Large Language Models Can Be Easily Distracted by Irrelevant Context

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael Schärli, Denny Zhou

Introduction

Prompting large language models performs decently well in a variety of domains (Brown et al., 2020; Chowdhery et al., 2022, inter alia). However, for most of theses evaluation benchmarks, all the information provided in the problem description is relevant to the problem solution, as the problems in exams. This is different from real-world situations, where problems usually come with several pieces of contextually related information, which may or may not be relevant to the problems that we want to solve. We have to identify what information is actually necessary during solving those problems. Studies in psychology have shown that irrelevant information may significantly decrease some children and even adults problem-solving accuracy (Hoyer et al., 1979; Pasolunghi et al., 1999; Marzocchi et al., 2002, inter alia).

In this work, we study the distractibility of large language models for various prompting techniques; i.e., how is large language model prompting affected by irrelevant context, and what strategies can be used to improve performance? To measure distractibility, we construct the GSM-IC dataset, a grade-school math problem dataset derived from GSM8K (Cobbe et al., 2021) and introduce two different metrics. In contrast to prior work that derives benchmark variations by substituting sentences of the base problems with variations (Patel et al., 2021; Kumar et al., 2021, inter alia), we keep the base problem description and add to it one irrelevant sentence, while making sure that it does not affect the solution of the problem (Table 1).

We use Codex (code-davinci-002) and GPT-3.5 ( text-davinci-003) in the GPT3 model family to evaluate state-of-the-art prompting techniques on GSM-IC,http://openai.com/api/ including chain-of-thought prompting (CoT; Wei et al., 2022), zero-shot chain-of-thought prompting (0-CoT; Kojima et al., 2022), least-to-most-prompting (LtM; Zhou et al., 2022), and prompting with programs (Program; Chowdhery et al., 2022).We find that their performance on GSM-IC greatly decreases compared to the original GSM8K (without irrelevant context). We then investigate several approaches to mitigate this weakness, including self-consistency (Wang et al., 2022c) and adding irrelevant information to the exemplars in the prompt. In addition to demonstrating how to handle irrelevant information via exemplars, we also investigate the usage of task-specific instructions (Wei et al., 2021; Sanh et al., 2021; Ouyang et al., 2022; Suzgun et al., 2022; Chung et al., 2022), where we prepend an instruction sentence “feel free to ignore irrelevant information in the problem description” to the exemplars. We summarize our key findings below:

All investigated prompting techniques are sensitive to irrelevant information in the problem description. In particular, among the original problems that can be solved by baseline prompts with greedy decoding, no more than $18\%$ of them can be consistently solved for all types of irrelevant information, showing that the large language model is easily distracted and produces inconsistent predictions when adding a small amount of irrelevant information to the problem description.

Self-consistency improves the performance of all prompting techniques on GSM-IC. In particular, the recall rate of the correct answer for GSM-IC is as high as 99.7% with 20 samples per problem, i.e., at least one of the 20 solutions result in the correct final answer, which means that using multiple samples allows the model to almost always retrieve the correct answer.

Adding irrelevant information to the exemplars shown in the prompt consistently boosts the performance, and the same holds for adding an instruction to ignore irrelevant context. This suggests that language models are—to some extent—able to learn to ignore irrelevant information by following examples or instructions.

We identify different factors of the irrelevant information that affect the model’s sensitivity to irrelevant context. Our breakdown analysis shows that varying the numbers in the irrelevant information does not notably change the model performance, while the degree of lexical overlap with the original problem description matters.

Filtering out irrelevant information is essential for handling real-world tasks. Our evaluation indicates that despite the strong performance on challenging reasoning problems, state-of-the-art language models still have fundamental weaknesses in context understanding and identifying the relevant information from the input. Our findings suggest that in order to gain a more holistic understanding of the reasoning capability of language models, future work should also consider the model sensitivity to irrelevant context, in addition to solving more challenging problems.

Related Work

Few-shot prompting. Few-shot prompting (Brown et al., 2020; Chowdhery et al., 2022, inter alia) has been significantly boosted with various techniques, including generating intermediate steps (Ling et al., 2017; Cobbe et al., 2021; Nye et al., 2021; Wei et al., 2022; Suzgun et al., 2022; Shi et al., 2022b, inter alia), problem decomposition (Zhou et al., 2022; Drozdov et al., 2022; Dohan et al., 2022; Khot et al., 2022; Press et al., 2022, inter alia), generating programs (Austin et al., 2021; Chowdhery et al., 2022; Gao et al., 2022; Chen et al., 2022, inter alia), marginalizing intermediate steps that share the same result (Wang et al., 2022c; Shi et al., 2022a), and ensemble (Wang et al., 2022b; Drozdov et al., 2022). In addition, Kojima et al. (2022) demonstrate that appropriate hint in prompts also leads to decent performance, even without any exemplar. In this work, we examine these cutting-edge prompting techniques (Wei et al., 2022; Zhou et al., 2022; Kojima et al., 2022; Wang et al., 2022c) on our benchmark, and demonstrate that they are sensitive to irrelevant input context.

Natural language benchmarks with input perturbations. There has been a long line of work on adding input perturbations for natural language tasks, including model-agnostic input transformations (Liang et al., 2022; Ravichander et al., 2022, inter alia) and adversarial example generation against individual models (Jia & Liang, 2017; Shi et al., 2018; Morris et al., 2020; Wang et al., 2021). In particular, prior work has constructed arithmetic reasoning benchmarks through paraphrasing or rewriting sentences in the base problems from clean datasets (Patel et al., 2021; Kumar et al., 2021). Meanwhile, Liang et al. (2022) evaluate various large language models under several metrics, including accuracy, robustness, fairness, etc. Specifically, the input transformations in their robustness evaluation include semantics-preserving and semantics-altering perturbations, such as injecting typos and modifying sentences to change the ground-truth classification labels. In contrast the above work where the meaning of problem descriptions may be changed with perturbations, we keep all sentences in the original problem description, and introduce an irrelevant sentence that is ensured not to affect the standard answer.

Natural language benchmarks with irrelevant input context. Jia & Liang (2017) have shown that neural question answering systems are largely affected by adversarial distracting sentences, whereas follow up work (Khashabi et al., 2017; Ni et al., 2019) proposes learning strategies that mitigate the problem. Similar issues have been found for general-purpose pretrained language models, on the tasks of factual reasoning (Kassner & Schütze, 2020; Pandia & Ettinger, 2021; Misra et al., 2023; Li et al., 2022), code generation (Jones & Steinhardt, 2022), and syntactic generalization (Chaves & Richter, 2021). In particular, Li et al. (2022) evaluated T5 (Raffel et al., 2020) and PaLM (Chowdhery et al., 2022) with few-shot prompts, and proposed knowledge-aware finetuning that finetunes the model on problems with counterfactual and irrelevant context, which strengthens the model robustness to noisy context. In our evaluation, we show that without training or finetuning, adding irrelevant context into demonstrations in the prompt also mitigates the distractibility of the underlying language model and significantly improves the model performance on our GSM-IC benchmark.

There exist some logical reasoning benchmarks that contain irrelevant content in task descriptions (Weston et al., 2015; Sinha et al., 2019; Clark et al., 2021; Han et al., 2022; Tafjord et al., 2020, inter alia). However, previous work largely focuses on designing models that require extra training, and prompting alone still hardly achieves the same level of performance as finetuned models for these tasks (Han et al., 2022; Creswell et al., 2022). In our work, we focus on arithmetic reasoning, where prompting techniques have achieved the state-of-the-art results, e.g., on GSM8K, while we show that adding a single irrelevant sentence into the problem description significantly degrades the performance.

Prompting with noisy ground truth. A line of work studies the model performance with incorrect prompting exemplars, i.e., the example problems are paired with wrong answers (Min et al., 2022; Kim et al., 2022). In addition, prior work has investigated the model sensitivity to other parts of the prompt, such as instruction tuning with misleading and irrelevant instructions (Webson & Pavlick, 2021) and wrong reasoning steps in the examples (Madaan & Yazdanbakhsh, 2022; Wang et al., 2022a). In particular, Madaan & Yazdanbakhsh (2022) conclude that the correctness of numbers and equations in chain-of-thought prompts does not play a key role in model performance, but using wrong entities and removing either equations or text explanation in the reasoning steps drastically hamper the performance. Different from this line of work, we always include correct answers to example problems in the prompt, and ensure that the irrelevant context added to the problem description does not change the ground truth answer. We show that the model performance significantly drops when presented with irrelevant context in problem descriptions, and different distributions of numbers and entities in the irrelevant context also lead to different levels of performance degradation.

The GSM-IC Dataset

We then generate the examples of our new dataset by adding to each base problem one sentence containing irrelevant information. We use a template-based method (Figure 1) to generate these sentences, which can be characterized by the following three factors:

Topic of the inserted sentence. We write templates for both in-topic and off-topic sentences. In-topic sentences are closely related to the topic of the original problem, whereas off-topic sentences are about a different topic.

Role name overlap. Most sentence templates contain some role name blanks, which can be filled with names that may or may not overlap with the role names that occur in the problem. For blank fillers that have overlap with original role names, we: (1) randomly pick a role name A from the original problem description and (2) create the blank fillers with template such as A’s father and A’s sister.

Range of numbers. Since we focus on arithmetic reasoning, most sentence templates also contain a number blank. We can choose to fill in the number blank with a number of similar or different magnitude to those in the original problem description. Concretely, for a number $a$ , if there exists a number $b$ in the original problem description or solution such that $\frac{1}{10}\leq\frac{a}{b}\leq 10$ , we consider $a$ as an in-range number, and otherwise an out-of-range number. Since the standard answer to GSM8K problems are all positive integers, we only consider positive integers as the number blank fillers.

We manually verify that (1) all the generated sentences are acceptable in English and that (2) adding them does not affect the standard solution of the base problem. Because the above factors are orthogonal, we generate for each base example a set of derived examples with different factor combinations. The full GSM-IC benchmark consists of 58,052 examples. More details about the dataset creation process can be found in Appendix A.

For a problem $p$ , we denote its standard solution by $s(p)$ , and the solution of method $\mathcal{M}$ by $\mathcal{M}(p)$ . To evaluate the distractibility of $\mathcal{M}$ , we consider the following two metrics:

Micro accuracy $\textit{Acc}_{\textit{micro}}(\mathcal{M};\mathcal{P})$ is the average accuracy of method $\mathcal{M}$ over all the test problems $\mathcal{P}$ .

This means that the micro accuracy weighs all the individual test problems equally.

Macro accuracy $\textit{Acc}_{\textit{macro}}(\mathcal{M};\mathcal{B})$ is the average accuracy of method $\mathcal{M}$ over classes of test problems, where each class $\mathcal{P}(b)$ consists of the set of test examples derived from the base example $b\in\mathcal{B}$ . We define $\mathcal{M}$ ’s prediction for a class $\mathcal{P}(b)$ to be correct if and only if $\mathcal{M}$ ’s prediction for all problems in this class are correct.

This means that the macro accuracy is the fraction of base problems that can be consistently solved no matter what irrelevant sentence is being added.

Normalized accuracy measures how a method is affected by the distractors, considering its accuracy on base problems. For a micro or macro accuracy $a_{\mathcal{M}}$ achieved by method $\mathcal{M}$ , we calculate its corresponding normalized accuracy by

where $n_{\mathcal{M}}$ denotes the base problem accuracy of method $\mathcal{M}$ (§ 3).

In the following section, we review the investigated prompting techniques (§ 4.1), present the formats of our prompts (§ 4.2), and introduce instructed prompting (§ 4.3).

Chain-of-thought prompting (CoT; Wei et al., 2022) is a prompting technique that guides the language models to solve a problem in a step-by-step manner. By presenting exemplars that solve the corresponding problems with intermediate reasoning steps in the prompts, CoT significantly improves the reasoning performance over direct answer prediction without such intermediate reasoning steps.

Zero-shot chain-of-thought prompting (0-CoT; Kojima et al., 2022) is a variation of CoT where the prompt does not contain any exemplar. Instead, the model is prompted directly with the problem of interest followed by the instruction “Let’s think step by step:”.

Least-to-most prompting (LtM; Zhou et al., 2022) teaches language models to (1) break down a problem into subproblems, and (2) solve those subproblems sequentially using CoT. The final answer is that to the last subproblem.

Program prompts (Program; Chowdhery et al., 2022) represent the arithmetic reasoning process as a program. Following prior work on solving GSM8K problems with code (Chowdhery et al., 2022; Gao et al., 2022; Chen et al., 2022), we include a Python program as the problem solution in the prompt, and execute the generated Python code using an external Python interpreter to obtain the final answer.

Self-consistency (SC; Wang et al., 2022c; Shi et al., 2022a) may further boost the reasoning performance by marginalizing over intermediate reasoning steps that share the same final result. In practice, SC can be implemented by (1) sampling several solutions from the large language model and (2) taking the majority vote. Note that SC is orthogonal to above techniques, and can be combined with any of them.

2 Prompt Design

We present some example prompts used in our experiments (Figure 2). For few-shot prompting techniques (i.e., CoT, LtM and Program), the input prompt includes exemplar problems and their solutions before the problem of interest. In order to keep simplicity and avoid over-fitting in prompt engineering, we follow Zhou et al. (2022) on exemplar creation; that is, we only use one simple exemplar for our main experiments. This exemplar is either based on the [Original Problem] or the [Problem with Irrelevant Context], which allows us to investigate the effect of irrelevant information in the prompt exemplar. For 0-CoT, we adhere to Kojima et al. (2022) and directly present the problem of interest followed by “A: Let’s think step by step:”.

3 Instructed Prompting

In addition to presenting irrelevant information in the exemplars, we also investigate whether natural language instructions help language models ignore irrelevant context and become less distracted. Extending the line of work (Suzgun et al., 2022; Sanh et al., 2021; Ouyang et al., 2022) that includes a general task description before exemplars, we add the sentence “Solve grade school math problems. Feel free to ignore irrelevant information given in the questions.” before our exemplars in the prompt (Figure 2), which explicitly instructs the language model to ignore irrelevant information in the problem description.

We compare the performance of different prompting techniques on GSM-IC-4K (§ 5), in terms of both micro and macro accuracies, as well as their corresponding normalized accuracies. Overall, we observe significant performance drop for both models with all prompting techniques. The drop on macro accuracy is especially large, showing that fewer than 30% of the base problems are consistently solved after adding distractors. Comparing the results of two models, text-davinci-003 achieves better normalized micro accuracy than code-davinci-002, though its macro accuracy is mostly worse. In Figure 3, we present a GSM-IC-4K example where a single irrelevant sentence causes different types of errors in investigated prompting techniques. One common error type is wrongly using the number in the irrelevant sentence, as shown in the LtM prediction and other examples in Appendix B. Even if the model does not directly use the irrelevant number for numerical calculation, the presence of the irrelevant sentence in the reasoning steps alone can still cause a wrong prediction, as shown in the CoT prediction.

LtM is generally the most robust technique to irrelevant context. In terms of micro accuracy, LtM outperforms all other prompting methods across models. Using code-davinci-002, LtM achieves about double macro accuracy of CoT. Interestingly, with text-davinci-003, despite that LtM outperforms CoT on the micro accuracy, its macro accuracy is lower. Specifically, text-davinci-003 is highly susceptible to irrelevant context with role overlap; e.g., such irrelevant sentences decrease the macro accuracy to 0 on problems with more than 2 reasoning steps. See Table 5.2 for the breakdown performance on different types of irrelevant context.

Selecting exemplars with distractors mitigates the distractibility. For few-shot prompts, we find that using exemplars with distractors (i.e., including problems with irrelevant context) consistently outperforms using the original exemplars without distractors across prompting techniques. While prior work has shown that training and fine-tuning with different types of problems improves model robustness (Li et al., 2022), our results show that prompting with exemplars that demonstrate how to ignore irrelevant context also results in significant robustness improvement. In § 5.3, we further show that using exemplars with distractors does not cause a performance drop on the original GSM8K dataset, indicating that such a prompt design can be beneficial in achieving better accuracy and robustness simultaneously.

Self-consistency significantly reduces the distractibility. Taking the majority vote from 20 samples,If there is a tie, we take a random top-tier result for evaluation, following Wang et al. (2022c) and Shi et al. (2022a). SC improves the overall micro accuracy by more than 11 percentage points. This means that in addition to improving model performance on clean arithmetic reasoning tasks (Wang et al., 2022c), SC also substantially reduces the distractibility of large language models to irrelevant context. The gain on micro accuracy is notably large on 0-CoT (35.5 percentage points). Furthermore, the correct answer for 99.7% of the problems is in the 20 sampled answers for both CoT and LtM. Even for 0-CoT, the recall of correct solutions within 20 samples is 96.5%. Despite these improvements, the best macro accuracy among all prompting techniques is only $45\%$ , suggesting that for more than half of the base problems, SC fails to prevent the model from being distracted by different variants of irrelevant information. These results imply that a better algorithm may be developed to further reduce the distractibility based on a few sampled solutions.

We have shown that using exemplars with distractors improves robustness to irrelevant context. We also compare the performance of instructed prompting and that of the prompts without instructions in § 5. Adding instructions to CoT, LtM, and Program consistently improves their performance. Surprisingly, instructed prompting with original exemplars reaches comparable or even better performance than uninstructed prompting that uses exemplars with distractors for both CoT and LtM. Note that adding the instruction “Solve grade school math problems.” alone does not significantly improve the performance, and it is the instruction “Feel free to ignore irrelevant information given in the questions.” that makes the difference. Similar to the instruction “Let’s think step by step.” employed by 0-CoT, this shows that language models are—to some extent—able to follow natural language instructions in a way that dramatically changes their problem solving behavior, suggesting that such instructions may be useful for guiding the behavior of language models on more tasks.

We use the CoT and LtM prompts in (Zhou et al., 2022) as the baselines, and we evaluate the prompt variants with the instruction “Solve following questions. Feel free to ignore irrelevant information given in the questions.” added before the exemplars. Note that by adding a problem reduction step in the exemplar solution, the least-to-most prompt implicitly leads the model to come up with relevant subproblems to solve the given problem. Again, we observe that the instruction consistently improves the performance of both CoT and LtM prompting (Table 7).

In this work, we introduce GSM-IC, a dataset that supports comprehensive study of the distractibility of large language models when performing arithmetic reasoning in presence of irrelevant contexts. We examine a variety of prompting techniques on GSM-IC, and demonstrate that they are all sensitive to the irrelevant information in the problems. Among the studied techniques, self-consistency (Wang et al., 2022c) leads to a substantial improvement in robustness to irrelevant context across the board, and presenting example problems with irrelevant context in the prompt also consistently improves the performance. Similarly, we find that simply adding an instruction to ignore irrelevant information brings notable performance gains on our benchmark.

Despite the improvement achieved by these methods, the fundamental issue remains: a single piece of irrelevant information can distract the models and substantially degrade their performance, even on problems whose clean versions they correctly solve. We encourage researchers to also prioritize improving on this fundamental limitation when developing new training and prompting techniques. We leave further investigation on the distractibility for other tasks and different language models for future work.

We would like to thank Dale Schuurmans, Olivier Bousquet and Jack Nikodem for helpful discussion and feedback.

Appendix A GSM-IC Details

Each of the 100 base problem require two to seven steps to solve (Figure 5).

Starting from the base problems, we follow the protocols below to create GSM-IC (§ 3.1).

For in-topic sentences, we manually write templates within the topic that is close to the original problem description. We are particularly careful about the shareable stuff, for example, money is sometimes considered shareable between family members. In such cases, we make sure that the added do not change the amount of shareable stuff to ensure that the final standard answer is not affected.

For off-topic sentences, we use general templates (Table 9) for all problems unless some of them can be considered as in-topic sentences for some problems—for example, the sentence “The height of {role} is {number} feet.” is considered as an in-topic sentence for problems about heights of people.

We make sure that all sentences derived by each template are grammatical English sentences.

We write four in-topic and choose four off-topic distractor sentence templates for each problem.

We randomly choose a role name X, and use X’s father, X’s mother, X’s brother, X’s sister and X’s neighbor as the overlapped role names.

We choose from the name set {Ada, David, Emma, Jack, John, Mary, Max, Tom} for non-overlapped role names.

We write five names that have overlap with the original character, and five names that do not have overlap for each problem.

We write four in-range numbers and four out-of-range numbers for each problem.

Finally, if adding the irrelevant sentence causes ambiguity (e.g., Table 10), we fix the question to ensure that the standard solution to the generated problem remain the same as the base problem.

Appendix B Sample Predictions on GSM-IC

In addition to the example outputs shown in Figure 3, we include more example problems and the predictions by different techniques (Tables 11 and 12).

Appendix C Full prompts in experiments

We list the prompts for all experiments in Tables 13 and 14.