Symbol tuning improves in-context learning in language models

Jerry Wei, Le Hou, Andrew Lampinen, Xiangning Chen, Da Huang, Yi Tay, Xinyun Chen, Yifeng Lu, Denny Zhou, Tengyu Ma, Quoc V. Le

cs.CL

Introduction

A key feature of human intelligence is that humans can learn to perform new tasks by reasoning using only a few examples. Scaling up language models has unlocked a range of new applications and paradigms in machine learning, including the ability to perform challenging reasoning tasks via few-shot examples given in-context (Brown et al., 2020; Chowdhery et al., 2022; OpenAI, 2023, inter alia). Language models, however, are still sensitive to the way that prompts are given, indicating that they are not reasoning in a robust manner. For instance, language models often require heavy prompt engineering (Brown et al., 2020; Reynolds & McDonell, 2021) or phrasing tasks as instructions (Wei et al., 2022a; Ouyang et al., 2022; Sanh et al., 2022, inter alia), and they exhibit unexpected behaviors such as performance on tasks being unaffected even when shown in-context exemplars with random labels (Min et al., 2022b) or flipped labels (Wei et al., 2023).

In this paper, we propose a simple finetuning procedure that we call symbol tuning, which significantly improves the ability of language models to reason with and learn from input–label mappings presented in-context. In the symbol-tuning procedure, we finetune language models on input–label pairs presented in-context where natural language labels are remapped to arbitrary symbols.We call our method symbol tuning because arbitrary designation is a key property of symbols (Newell & Simon, 1976), and manipulating symbols is a crucial part of intelligence (Newell, 1980; Santoro et al., 2021). The intuition is that when models cannot rely on instructions or relevant natural language labels to figure out a given task, it must instead do so by reasoning with input–label mappings in-context in order to learn the mappings that reveal the task. We perform symbol tuning using a mixture of 22 NLP datasets with various arbitrary symbols as labels and experiment using several Flan-PaLM models (Chung et al., 2022, 8B, 62B, 62B-cont, 540B).

First, symbol tuning improves performance of baseline models on unseen in-context learning tasks across various settings (with/without instructions, with/without relevant labels), with larger performance gains when instructions or natural language labels are not given in the prompt. For example, when prompts do not contain instructions or relevant labels, symbol tuning yields a +11.1% average performance improvement across eleven evaluation tasks for Flan-cont-PaLM-62B.

Second, symbol-tuned models are better at algorithmic reasoning tasks, a striking result since symbol tuning only includes natural language data and did not have any numerical or algorithmic data. On a set of reasoning evaluation suites for list functions (e.g., remove the last element in a list), symbol-tuned models experience performance improvements of +18.2% for Flan-PaLM-8B, +11.1% for Flan-PaLM-62B, and +3.6% for Flan-PaLM-540B. On a set of turing concept tasks (e.g., swapping 0s and 1s in a string), symbol-tuned models also improve by +15.3% for Flan-PaLM-8B and Flan-PaLM-62B and +4.7% for Flan-PaLM-540B.

Additionally, we experiment on an in-context learning setting where inputs have flipped labels, which forces the model to override its prior knowledge when presented with contradictory information in-context. Pretrained language models have the ability to somewhat follow flipped labels—this ability is lost during instruction tuning but can be restored via symbol tuning.

Finally, we conduct ablation studies demonstrating that symbol tuning is simple to implement and only requires a relatively-small amount of compute. Symbol tuning does not require mixing instruction-tuning data or collecting a large number of datasets, and only 1k to 2k steps of tuning are needed to get its benefits. Overall, we hope that the strong empirical results from symbol tuning encourage further work in allowing language models to reason over arbitrary symbols given in-context.

Symbol tuning

Despite their ability to perform some reasoning tasks after being shown in-context exemplars (Chowdhery et al., 2022; OpenAI, 2023), language models are still sensitive to the way in which these tasks are presented in prompts (Brown et al., 2020; Reynolds & McDonell, 2021; Wei et al., 2022a), suggesting that they are not reasoning in a robust way. Instruction tuning has been shown to improve performance and allow models to better follow in-context exemplars (Mishra et al., 2022; Min et al., 2022a; Wei et al., 2022a; Ye et al., 2021; Chung et al., 2022). One shortcoming, however, is that models are not forced to learn to use the exemplars because the task is redundantly defined in the evaluation example via instructions and natural language labels. For example, in the left-hand side of Figure 1, although the exemplars can help the model understand the task, they are not strictly necessary since the model could ignore the exemplars and just read the instruction.

To make the model better at in-context learning, we propose symbol tuning, in which the model is finetuned on exemplars where the instructions are removed and natural language labels are replaced with semantically-unrelated labels (e.g., “Foo,” “Bar,” etc.). In this setup, the task is unclear without looking at the in-context exemplars. For example, if the prompt from the previous paragraph was changed to “. Answer: {Foo, Bar}” (as shown in the right-hand side of Figure 1), multiple in-context exemplars would be needed in order to figure out the task. Because symbol tuning teaches the model to reason over the in-context exemplars, symbol-tuned models should have much better performance on unseen tasks that require reasoning between in-context exemplars and their labels.

Experimental setup

Figure 2 shows the 22 publicly-available NLP datasets from HuggingFace (Lhoest et al., 2021) (see Section B.1 for dataset details) that we use for our symbol-tuning procedure (we ablate the number of datasets used for symbol tuning in Section 7.3). We selected NLP tasks that have been widely used in the literature (Wang et al., 2018; 2019). Each dataset is categorized into one of seven task types—we only selected classification-type tasks because symbol tuning requires discrete labels. For each dataset, we use examples from the training split to compose prompts that we use for tuning. Each prompt uses a randomly-selected input–label format (formats are shown in Section C.2) and contains a randomly-selected number between 2 and 10 of in-context exemplars per class. We remap labels to a randomly-selected label from a set of $\sim$ 30k labels from three label types as shown in Figure 3 (we ablate the number of labels in Section A.6 and the label types in Section A.7). Examples of generated tuning prompts for each task are shown in Section E.1.

2 Evaluation tasks

We want to evaluate a model’s ability to perform on unseen tasks, so we cannot evaluate on tasks used in symbol tuning (22 datasets) or used during instruction tuning (1.8k tasks). Hence, we choose 11 NLP datasets from HuggingFace (Lhoest et al., 2021) that were not used in either stage of finetuning (details are shown in Section B.2): (Conneau & Kiela, 2018, SUBJ); (Basile et al., 2019, TEH); (Mohammad et al., 2016, TEAB); (Mohammad et al., 2016, TEAT); (Mohammad et al., 2016, TEFE); (Mohammad et al., 2016, TEHI); (Alex et al., 2021, ADEC); (Alex et al., 2021, OR); (Alex et al., 2021, SOT); (Alex et al., 2021, TOS); and (Alex et al., 2021, TC). We use the validation split of each dataset to generate evaluation prompts. For each dataset, we randomly select a maximum of 100 examples to use during evaluation. Each evaluation prompt uses a randomly-selected input–label format following Section 3.1, though we fix the number of in-context exemplars per class at $k=4$ (we ablate this parameter in Section A.5).

We generate prompts for the four different in-context learning (ICL) settings described in Figure 4; each setting either contains or does not contain instructions describing the task (see Section B.2 for the instructions we use for each task) and does or does not contain relevant natural language labels. For settings that do not use relevant natural language labels, we remap original labels to a randomly-selected label from a set of approximately 270k semantically-unrelated labels as shown in Figure 3 (we removed labels that were seen during symbol tuning). Examples of generated evaluation prompts for each task are shown in Section E.2.

3 Models & finetuning procedure

For our experiments, we tune Flan-PaLM (Chung et al., 2022), the instruction-tuned variants of PaLM (Chowdhery et al., 2022). We use instruction-tuned variants in order to reduce the number of steps needed for tuning, since symbol tuning an instruction-tuned model does not require relearning the information learned during the original round of instruction tuning. We use three different sizes of Flan-PaLM models: Flan-PaLM-8B, Flan-PaLM-62B, and Flan-PaLM-540B. We also tested Flan-cont-PaLM-62B (Chowdhery et al., 2022, PaLM-62B at 1.3T tokens instead of 780B tokens), which we abbreviate as 62B-c.

Our symbol-tuning pipeline mixes all datasets and randomly samples from each dataset. To ensure that the dataset sizes are balanced (i.e., no dataset gets completely overshadowed), we limit the number of training examples per dataset to a maximum of 25k randomly-selected examples. Training examples are combined into a single sequence using packing (Raffel et al., 2020), and inputs are separated from labels using an end-of-sequence (EOS) token. We tune all models using a batch size of 32 and the Adafactor optimizer (Shazeer & Stern, 2018). For 8B and 62B models, we tune with a learning rate of $3\times 10^{-3}$ , and we tune Flan-PaLM-540B with a learning rate of $1\times 10^{-3}$ . We use 2048 and 512, respectively, as the input and target sequence lengths during tuning.

Symbol tuning for 1k steps on a TPUv4 (Jouppi et al., 2023) requires approximately 16 minutes with 64 chips for Flan-PaLM-8B, 70 minutes with 128 chips for Flan-PaLM-62B, and 6 hours with 512 chips for Flan-PaLM-540B. For 8B and 62B model evaluations, we report results from the checkpoint after tuning for 4k steps, and for 540B model evaluations, we report results from the checkpoint after tuning for 1k steps (we ablate the number of tuning steps in Section 7.1). See Section C.3 for the number of finetuning steps, learning rate, batch size, and dropout used for each model. As a baseline, we compare our symbol-tuned models against the instruction-tuned models from Chung et al. (2022), and we also compare symbol tuning against continued instruction tuning in Section A.1.

Symbol-tuned models are better in-context learners

In the symbol-tuning procedure, models must learn to reason with in-context exemplars in order to successfully perform tasks because prompts are modified to ensure that tasks cannot simply be learned from natural language labels or instructions. Symbol-tuned models should thus perform better in settings where tasks are unclear and require reasoning between in-context exemplars and their labels. Additionally, since symbol tuning is meant to improve the ability to follow in-context exemplars, it should not modify prior knowledge and should thus retain the same performance in settings where exemplars are not as necessary to complete the task.

To explore these settings, we define four ICL settings that vary the amount of reasoning required between inputs and labels in order to learn the task (based on the availability of instructions/relevant labels), as shown in Figure 4. The easiest of these settings uses prompts where both instructions and relevant labels are available (as in-context exemplars are not necessary to learn the task), while the hardest setting uses prompts where instructions and relevant labels are both unavailable.

In Table 1, we evaluate model performance before and after symbol tuning in each of these settings. We find that symbol tuning improves performance across all ICL settings for models 62B and larger, with small improvements in settings with relevant natural language labels (+0.8% to +4.2%) and substantial improvements in settings without relevant natural language labels (+5.5% to +15.5%). Strikingly, when relevant labels are unavailable, symbol-tuned Flan-PaLM-8B outperforms Flan-PaLM-62B, and symbol-tuned Flan-PaLM-62B outperforms Flan-PaLM-540B. This performance difference suggests that symbol tuning can allow much smaller models to perform as well as large models on learning input-label mapping from exemplars (effectively saving $\sim$ 10x inference compute).

Symbol-tuned models also perform somewhat-comparably in settings with only relevant labels or only instructions, unlike baseline models whose performance in settings with only relevant labels is always better than in settings with only instructions. Performance in settings with relevant labels actually decreases for Flan-PaLM-8B after symbol-tuning, however, which may suggest that symbol tuning a small model can override its prior knowledge due to overfitting. Overall, the improvements demonstrate the strong potential of symbol tuning to improve model performance, especially when tasks are not clear and require learning from in-context exemplars.

Symbol tuning improves algorithmic reasoning

Symbol tuning is designed to force the model to learn from input–label mappings in the in-context exemplars because the symbols are unrelated to the task and no instructions are provided (and thus the model cannot rely on any other guidance to determine the task). For this reason, we posit that symbol tuning should not only improve the model’s ability to map natural language inputs to arbitrary symbols, but also its ability to learn other forms of inputs–label mappings such as algorithms.

To test this, we experiment on algorithmic reasoning tasks from BIG-Bench (Srivastava et al., 2022). We first experiment on a set of list function tasks (Rule et al., 2020; Srivastava et al., 2022) where the model needs to identify a transformation function (e.g., remove the last element in a list) between input and output lists containing non-negative integers. These tasks were evaluated in a four-shot setting, following our evaluation setup in Section 3.2. Additionally, we test models on a set of simple turing concepts (Telle et al., 2019; Srivastava et al., 2022) where models need to reason with binary strings to learn the concept that maps an input to an output (e.g., swapping 0s and 1s in a string). These tasks have predetermined shots for each evaluation example. We selected these algorithmic tasks because they test the model’s ability to generalize to different task types (the symbol-tuning tasks were classification problems with discrete labels, while these tasks are more open-ended generation problems) and do not require world knowledge (symbol tuning does not increase prior knowledge).

In Figure 5, we show model performance on the twenty list function tasks with the highest human accuracy baselinesWe do not directly compare with the human baselines because our evaluation format was different. (Rule, 2020) separated into five categories (category details are described in Section D.1) and the turing concepts containing 3 or fewer instructions in the AS II subset of the simple turing concepts task. On the list function tasks, symbol tuning results in an average performance improvement across all tasks of 18.2% for Flan-PaLM-8B, 11.1% for Flan-PaLM-62B, 15.5% for Flan-cont-PaLM-62B, and 3.6% for Flan-PaLM-540B. On the turing concept tasks, symbol tuning results in a performance improvement of 15.3% for Flan-PaLM-8B and Flan-PaLM-62B, 14.1% for Flan-cont-PaLM-62B, and 4.7% for Flan-PaLM-540B. Flan-cont-PaLM-62B with symbol tuning outperforms Flan-PaLM-540B on the list function tasks (in terms of average accuracy across tasks), which is equal to a $\sim$ 10x reduction in inference compute. These improvements on an unseen task type suggest that symbol tuning indeed strengthens the model’s ability to learn in-context, as the symbol-tuning procedure did not include any algorithmic data and only used natural language data.

Symbol-tuned models can override priors via flipped labels

Wei et al. (2023) showed that while pretrained language models (without instruction tuning) could, to some extent, follow flipped labels presented in-context, instruction tuning degraded this ability. Symbol tuning, on the other hand, forces models to consider the label presented in-context as an arbitrary symbol, which should reduce the model’s usage of prior knowledge that contradicts the flipped labels. For this reason, we expect that symbol tuning would be able to improve and restore the ability to follow flipped labels in-context.

To test this, we flip the labels of both in-context exemplars and the evaluation example for the tasks described in Section 3.2 (we remove tasks with more than two labels from this experiment since it is unclear how to best “flip” more than two labels). For example, for the SST2 dataset, all exemplars that are labeled as having “positive” sentiment will now be labeled as having “negative” sentiment. A perfect model that can follow these flipped labels should achieve 100% accuracy on these tasks if its accuracy on the standard in-context learning setting is also 100%.

As shown in Figure 6, symbol tuning restores the ability to follow flipped labels that was lost during instruction tuning. We see that there is a similar trend across all model sizes—instruction-tuned models are generally unable to follow flipped labels (as demonstrated by their performance being far below random guessing), but symbol-tuned models are much more capable of doing so. We found that after symbol tuning, Flan-PaLM-8B sees an average improvement across all datasets of 26.5%, Flan-PaLM-62B sees an improvement of 33.7%, and Flan-PaLM-540B sees an improvement of 34.0%. For some datasets (e.g., OR, SUBJ, TC), symbol-tuned models can now override priors and follow flipped labels (i.e., achieve much better performance than random guessing), despite instruction-tuned models not being able to do so for any datasets. Additionally, symbol-tuned models achieve similar or better average performance as pretraining-only models, indicating that symbol tuning has, to some extent, restored the model’s original ability to follow flipped labels.

These results further indicate another type of generalized in-context learning capability, as we did not include any flipped labels during symbol tuning. Although the performance improvement from symbol tuning is large, we note that more work should be done in this area since performance on the flipped-labels settings is, on average, not significantly better than random guessing.

Ablation studies

A question that may come to mind is how many steps of finetuning is needed to get the benefits of symbol tuning. In particular, Chung et al. (2022) performed instruction tuning on PaLM models for 40k steps for PaLM-8B and PaLM-62B, 21k steps for PaLM-540B, and 60k steps for cont-PaLM-62B, so it is unclear if symbol tuning would require such extensive tuning. Intuitively, however, since our symbol-tuning dataset is much smaller than the tuning data from Chung et al. (2022), symbol tuning should require fewer steps for finetuning than instruction tuning does. To analyze this, we examine model performance in each of the four ICL settings from Figure 4 with respect to the number of steps tuned. We train 8B and 62B models for up to 10k steps and 540B models for up to 5k steps, and we evaluate checkpoints every 1k steps on the same evaluation tasks and settings from Section 4.

We show these results in Figure 7. As expected, we see that symbol tuning does not require many steps of finetuning for any model. Moreover, the largest changes in performance occur within the first 1k to 2k steps of symbol tuning, after which model performance stays relatively constant. Flan-PaLM-540B also seems to experience performance drops in all settings after 1k steps, which may indicate that larger models require a more-diverse or larger set of symbol-tuning data. These results suggest that symbol tuning does not require extensive compute for exhaustive tuning.

2 Mixing instruction-tuning data

In Section 4, we found that small models may actually overfit to the symbol-tuning data, resulting in performance drops in ICL settings where relevant labels are available. One potential way of preventing this is to include instruction-tuning data during symbol tuning. Since instruction-tuning examples contain relevant labels and instructions that match a model’s prior knowledge, they may help reinforce prior knowledge and prevent small models from “forgetting” their priors. We create several mixtures of instruction-tuning data and symbol-tuning data to test this idea. For each mixture, we use varying ratios of instruction-tuning data to symbol-tuning data (e.g., a mixture with 33.3% symbol-tuning data means that instruction-tuning data is weighted twice as heavily as symbol-tuning data). Our instruction-tuning data is directly taken from Chung et al. (2022) and then mixed with our symbol-tuning data from Section 3.1.

We then tune models on these mixtures and evaluate their performance.We exclude Flan-PaLM-540B from this ablation study to reduce computational costs. In Figure 8, we show model performance on the ICL settings from Section 4. We find that even a small mixture of symbol-tuning data (e.g., 16%) versus instruction-tuning data can significantly change model performance.

Furthermore, higher proportions of symbol-tuning data after this initial change generally do not significantly affect model performance.Flan-PaLM-8B experiences a performance drop in the settings that include relevant natural language labels, which was also seen in Section 4. These results indicate that, in terms of a model’s ability to succeed in these ICL settings, the proportion of symbol-tuning data used is not important as long as some non-trivial amount of symbol-tuning data is used. As shown in Figure 9, however, the proportion of symbol-tuning data is much more impactful for succeeding in flipped-label settings. We find that there is a strong correlation between a higher mixture of symbol-tuning data and a model’s ability to follow flipped labels, a trend that holds regardless of the size of the model. Combining this result with the trend shown in Figure 9, we propose using only symbol-tuning data as a default setting because it does not significantly decrease model performance (for large-enough models) and because a higher percentage of symbol-tuning data significantly improves the model’s ability to override prior knowledge with in-context exemplars.

3 Number of tuning datasets

The overall goal of symbol tuning is to teach models that any arbitrary label for an input–label mapping should be treated as a symbol to be learned. The symbol-tuning procedure should thus only be successful if a diverse-enough set of tasks are shown such that the model can learn to generalize its behavior to new tasks. To test this, we randomly remove a varying number of tasks from the mixture and retune models on these new mixtures.We exclude Flan-PaLM-540B from this ablation study to reduce computational costs. We then evaluate these models on the ICL settings from Section 4.

We show these results in Figure 10. First, we see that as a general trend, using more datasets for symbol tuning improves performance. This effect seems to slightly plateau as more datasets are added, and 62B models benefit more from added datasets than the 8B model does. Second, we find that symbol tuning with a small number of datasets (e.g., only one or two datasets) can hurt performance in settings where relevant labels are available. For example, while symbol tuning using just one dataset can significantly improve performance in settings without relevant labels, it simultaneously decreases model performance in settings where relevant labels are available. These results imply that symbol tuning works best when a large variety of tasks are used, and symbol tuning with only a small number of tasks may result in models that perform worse in settings with relevant labels. Given these results, we note that future work may be needed to investigate the effects of scaling up the symbol-tuning procedure.

Related work

Recent studies on in-context learning suggest that prior knowledge plays a significant role in how models learn in-context. For example, Wei et al. (2023) showed that some small models and instruction-tuned models cannot follow flipped labels presented in-context, suggesting that these models primarily utilize prior knowledge for in-context learning. Min et al. (2022b) found a similar result that using random ground-truth labels in in-context exemplars does not significantly affect performance, meaning that performance may be driven by other factors such as the label space.

Reynolds & McDonell (2021) also showed that cleverly-constructed prompts in a zero-shot setting could outperform prompts in a few-shot setting, implying that, for some tasks, models can achieve better performance by leveraging their existing knowledge than from attempting to learn the task from in-context exemplars. Additionally, in chain-of-thought prompting (Wei et al., 2022b), Madaan & Yazdanbakhsh (2022) and Wang et al. (2022) showed that performance on multi-step reasoning tasks does not decrease when models are provided with logically-incorrect prompts. Raghu et al. (2020) also demonstrated that systems such as MAML can effectively “memorize” labels when trained in a way where all labels can be memorized, which further illustrates that, when possible, models may attempt to use prior knowledge rather than adapt to each new task.

Our findings do not dispute the idea that semantic prior knowledge can provide significant benefits to in-context learning. Indeed, we showed that instruction-tuned models cannot follow flipped labels in-context, which is consistent with the findings from Wei et al. (2023). We instead aim to demonstrate that through symbol tuning, language models can retain the benefits of utilizing prior knowledge while also improving their ability to learn from the input–label pairs shown in the in-context exemplars.

2 In-context learning via in-context exemplars

At the same time, however, other recent work has suggested that language models can, in fact, learn in-context using the given exemplars. This ability may be more useful than the ability to use semantic prior knowledge because it would allow models to perform tasks that are not seen in or contradict pretraining data. Garg et al. (2022), for instance, showed that transformers trained from scratch can perform in-context learning on linear-regression tasks at a similar performance level as the least-squares estimator. This capability was shown to result from transformers implementing standard learning algorithms such as gradient descent (Akyürek et al., 2023; von Oswald et al., 2022; Dai et al., 2023). Furthermore, Webson & Pavlick (2022) demonstrated that, in a natural language setting, language models can learn at the same rate during finetuning even when given irrelevant or misleading prompts. On a broader level, Rajendran et al. (2020) and Yin et al. (2020) found that adding noise to, shuffling, or regularizing the label space can make systems better at learning and adapting to new tasks. In this paper, we attempt to improve the degree to which language models are able to learn tasks via input–label mappings. Our symbol-tuning method can be seen as a form of label augmentation and is thus similar to the proposed methods from Rajendran et al. (2020) and Yin et al. (2020), though it differs crucially in that we apply them to tune large language models. We found that symbol-tuned models saw significant improvements in their ability to learn in-context (e.g., on algorithmic tasks or settings with underspecified prompts).

3 Tuning language models

Our work presented symbol tuning, a form of finetuning on input–label pairs where labels are remapped to arbitrary symbols. Symbol tuning relates to a broader body of work showing that finetuning language models can significantly alter their behavior and performance in different settings. For example, Wei et al. (2022a) first presented instruction tuning (finetuning on tasks phrased as instructions) and showed that this finetuning procedure substantially improves model performance in zero-shot settings. Chung et al. (2022) further scaled this procedure by adding more tasks, increasing model sizes, and adding chain-of-thought data, demonstrating that, with these changes, tuned models are significantly better at chain-of-thought reasoning, open-ended generation, and several evaluation benchmarks. Our experimental findings match these results, though our work differs by not only focusing on settings with in-context exemplars and underspecified prompts, but also by modifying the tuning procedure to make tasks harder to learn and require additional reasoning with exemplars.

Conclusions

In this paper, we presented symbol tuning, a new method of tuning models on tasks where natural language labels are remapped to arbitrary symbols. Symbol tuning is based off of the intuition that when models cannot use instructions or relevant labels to determine a presented task, it must do so by instead learning from in-context exemplars. We tuned four language models (Flan-PaLM-8B, Flan-PaLM-62B, Flan-cont-PaLM-62B, and Flan-PaLM-540B) using our symbol-tuning procedure, utilizing a tuning mixture of 22 datasets and approximately 30k arbitrary symbols as labels.

Experimentally, we showed that symbol tuning can significantly improve a model’s ability to learn from in-context exemplars in not only natural language settings, but also on algorithmic tasks. First, we showed that symbol tuning improves performance on unseen in-context learning tasks, especially when prompts do not contain instructions or relevant labels. We also found that symbol-tuned models were much better at algorithmic reasoning tasks, despite the lack of numerical or algorithmic data in the symbol-tuning procedure. Moreover, in an in-context learning setting where inputs have flipped labels, symbol tuning (for some datasets) reunlocks the ability to follow flipped labels that was lost during instruction tuning. Finally, we demonstrated that symbol tuning does not require extensive compute or complex implementations in order to achieve these improvements.

Through symbol tuning, we aim to have increased the degree to which models can examine and learn from input–label mappings during in-context learning. We hope that our results encourage further work towards improving language models’ ability to reason over symbols presented in-context.

References

Appendix

One unanswered question that arises is whether our results come from the symbol-tuning data or whether they come from the additional steps of tuning. To answer this question, we continue tuning Flan-PaLM models using the same instruction-tuning mixture from Chung et al. (2022) for the same number of steps that the model was symbol tuned using (see Section C.3). We then compare these instruction-tuned models with our symbol-tuned models on each reasoning task from Section 5, the flipped-label setting from Section 6, and the ICL settings from Section 4 in Table 2.We exclude comparisons on the ICL settings with relevant natural language labels because, as shown in Section 4, symbol tuning did not significantly improve performance in these settings.

We find that our symbol-tuned models significantly outperform the models with continued instruction tuning on each of these evaluations. These results suggest that, indeed, the performance improvements on these tasks were not a result of simply tuning the model for more steps. Instead, we conclude that the symbol-tuning data itself is the root cause of the results we observed in this paper.

A.2 Does symbol tuning affect performance on benchmarks?

As shown in Section 4, symbol-tuned models see only minor performance improvements in ICL settings with relevant labels, and small models (e.g., Flan-PaLM-8B) experience performance drops on these settings after symbol tuning. A natural question that follows is whether these differences on our unseen tasks translate to similar differences in well-studied benchmarks, as examples from these benchmarks often contain instructions and relevant labels. In particular, we examine model performance on the MMLU (Hendrycks et al., 2021) and BIG-Bench Hard (Suzgun et al., 2022) benchmarks. For this experiment, we set prompts in a 5-shot setting for MMLU and a 3-shot setting for BIG-Bench Hard, following the settings used in Chung et al. (2022).

In Figure 11, we show model performance on these benchmarks for each symbol-tuned model. We find that small models (i.e., Flan-PaLM-8B) may experience minor performance drops after symbol tuning. This aligns with the result shown in Section 4 and further bolsters the possibility that, after symbol tuning, small models may tend to use prior knowledge less and purely attempt to learn in-context instead. For larger models, on the other hand, symbol tuning only results in performance changes within approximately $\pm 1$ %, indicating relatively-consistent performance before and after symbol tuning. This consistent performance is expected, however, as symbol tuning is meant to improve a model’s ability to learn from and reason with in-context exemplars, and models likely do not use in-context exemplars in order to succeed on these benchmarks.Instruction-tuned models achieve similar performance in zero-shot settings versus few-shot settings on these benchmarks (Chung et al., 2022), suggesting that in-context exemplars are not crucial for completing these tasks.

A.3 Can symbol tuning improve chain-of-thought reasoning?

One limitation of symbol tuning is that it does not include any data with chain-of-thought (CoT) reasoning (Wei et al., 2022b) since it is unclear how to best replace intermediate steps with symbols. We thus want to examine whether symbol tuning affects chain-of-thought reasoning given its ability to improve in-context learning. To analyze this, we reformat prompts from the two benchmarks in Section A.2 to use chain-of-thought prompting and evaluate all symbol-tuned models. We use the same chain-of-thought prompts that were used in Chung et al. (2022).

We show these results in Figure 12. We find that performance is mostly consistent between symbol-tuned models and their base variants when using CoT prompting. One outlier, however, is that Flan-PaLM-8B experienced a significant drop in CoT performance on BIG-Bench Hard after symbol tuning, though it is unclear why this occurred since it did not experience a drop in CoT performance on MMLU. Other than this outlier, the results are expected, as symbol tuning did not include any CoT prompts and thus should not change a model’s performance in CoT settings.

A.4 Does symbol tuning affect zero-shot performance?

Our setup for symbol tuning does not include any zero-shot examples, as an arbitrary symbol that maps an input to a label cannot be learned without any exemplars. This raises the question of whether symbol tuning would harm a model’s zero-shot performance, especially since we do not mix in any instruction-tuning data during symbol tuning for the reasons stated in Section 7.2. Intuitively, symbol tuning should not affect zero-shot performance because it should modify a model’s ability to learn in-context and not its prior knowledge (which is what would primarily be used in zero-shot settings). To test this, we test the models on the MMLU benchmark (Hendrycks et al., 2021) and reformat prompts to a zero-shot setting.

In Figure 13, we compare each of our symbol-tuned model’s performance on zero-shot MMLU against their respective Flan-PaLM model. We find that performance is somewhat consistent after symbol-tuning. Symbol-tuned models saw a maximum decrease in performance of 1.7%, though we note that this difference is not sufficiently large to conclude that symbol tuning reduces zero-shot performance due to the variance within the evaluation. For example, continuing instruction-tuning on Flan-PaLM-8B for 1k steps reduces MMLU 5-shot performance from 49.5% to 47.2%, and continuing for another 1k steps improve performance back to 49.0%, which may indicate that for these benchmarks, small differences in performance are not enough to suggest an actual reduction or improvement in a model’s true performance. For this reason, we posit that the zero-shot performance before and after symbol-tuning is relatively-consistent for all base models, though we note that there is some ambiguity in this conclusion due to the variance in the performance metric.

A.5 Do symbol-tuned models require fewer in-context exemplars?

In Section 4, we showed that symbol-tuned models perform much better than Flan-PaLM models in difficult ICL settings without relevant labels. Our evaluations, however, were all in a setting using four in-context exemplars per class, making it unclear how symbol-tuned models perform relative to baselines when there are fewer or more in-context exemplars that the model can use. Intuitively, symbol tuning should be more effective when there are fewer in-context exemplars available, as having fewer exemplars makes it more difficult to identify the task (and we already showed in Section 4 that symbol-tuned models are better in ICL settings where the task is unclear).

To investigate this, we regenerate evaluations using the same process as described in Section 3.2, except we vary the number of in-context exemplars per class.If a dataset does not have enough examples to create a prompt with a particular number of in-context exemplars, we exclude that dataset from the evaluation for that number of in-context exemplars. We then test models on the hardest ICL setting from Section 4 in order to study how instruction-tuned and symbol-tuned models behave relative to the number of available exemplars. These results are shown in Figure 14. We find that the performance difference between symbol-tuned models and their base variants is relatively consistent in all settings except when there is only one in-context exemplar per class. In this setting, symbol-tuned models perform much better than base models, and this trend is consistent across all of our tested models. We posit that this could be a result of the Flan-PaLM not recognizing that arbitrary symbols are meant to be used as labels (which is implied because they perform significantly worse than random guessing), while symbol-tuned models already learned that arbitrary symbols can be used as labels. These results suggest that in ICL settings where the task is unclear, symbol tuning improves model performance regardless of the number of in-context exemplars that are provided.

A.6 Does symbol tuning require using all 30k labels?

As described in Section 3.1, our symbol-tuning procedure remapped original labels using a set of approximately 30k possible arbitrary symbols. This raises the question, however, of whether symbol tuning requires this large of a label space, and exactly how large of a label space is necessary for successful symbol tuning. Intuitively, we expect that models that are symbol tuned using larger label spaces should match or outperform those that are symbol tuned using smaller label spaces because a larger label space increases the diversity of the symbol-tuning data, which may make it easier to learn that any arbitrary symbol can be used as a label. We study how the size of the label space used for symbol tuning affects model performance by shrinking the label space for each category in Section 3.1. As our experiments from Section 3.1 use 10k possible labels per category, we decrease the label space size by only using 1k, 100, and 10 labels per category for possible labels.

We retune modelsWe exclude Flan-PaLM-540B from this ablation study to reduce computational costs. and evaluate their performance on the ICL settings from Section 4, showing these results in Figure 15. We find that, in general, models perform slightly better after symbol tuning using larger label spaces, but that the performance improvement from using larger label spaces is greater for the smallest model, Flan-PaLM-8B. The improvement seen in Flan-PaLM-8B may suggest that the larger label space’s ability to increase the diversity of the symbol-tuning data is important for smaller models that may have a harder time learning a general trend from a small sample size. Combined with the overall trend of improved performance with larger label spaces across model sizes and across ICL settings, we posit that using a larger label space can indeed improve the symbol-tuned model performance to some degree, possibly because the larger label space creates a more-diverse set of prompts for the model to learn from.

A.7 Which category of symbols is most important during symbol tuning?

For our symbol-tuning procedure, we used symbols drawn from three categories (integers, combinations of characters, and words). Here, we investigate whether any particular category is more important for symbol tuning (one might expect, for example, using labels that are more similar to natural language might better teach models to examine in-context exemplars before using prior knowledge since models are more likely to have priors for those labels). We retune models (we exclude Flan-PaLM-540B to reduce computational costs) using only integers, only character combinations, and only words as labels. In Table 3, we evaluate these models on the algorithmic reasoning tasks from Section 5, the flipped-label setting from Section 6, and the ICL settings from Section 4.

We find that for all model sizes, using only words as labels results in the best performance on flipped labels, indicating that this category best teaches models to examine in-context exemplars before using prior knowledge. Additionally, symbol tuning using words often yields the best performance when relevant labels are unavailable, but for Flan-PaLM-8B, yields the worst performance when relevant labels are available. This may suggest that small models learn to treat all natural language labels as arbitrary symbols, even when the label is relevant and could be utilized to better learn the task. Finally, while one might expect symbol tuning with numbers to be key to improving on algorithmic tasks, Flan-PaLM-8B and Flan-PaLM-62B actually perform better when tuned using only words (there is no consistently-better label type for Flan-cont-PaLM-62B).

A.8 Can symbol tuning be successful using random labels?

As a sanity check, we want to show that symbol tuning cannot improve in-context learning when the tuning data is randomized. We expect this behavior since if the input–label mappings are randomized, there is no task to learn from the in-context exemplars and thus no reason to learn to use exemplars. To show this, we use the same symbol-tuning procedure as before but when remapping labels, we randomly select a symbol for each in-context exemplar rather than assigning a symbol for each label and consistently remapping all instances of that label to the new symbol. This ensures that the labels (despite being arbitrary symbols) are randomized and that there is no meaningful task to learn. We then retune models using symbol-tuning data generated using this modified process.We exclude Flan-PaLM-540B from this ablation study to reduce computational costs.

In Figure 16, we show these models’ performance on the ICL settings from Section 4. We find that the randomized symbol-tuning procedure is almost always worse than the standard symbol-tuning procedure. In settings without relevant targets, symbol tuning with randomized labels results in equal or worse performance compared with no symbol tuning at all, and model performance is strictly worse than that achieved by standard symbol tuning. In settings with relevant targets, while randomized symbol tuning results in worse performance than no symbol tuning, it outperforms standard symbol tuning for Flan-PaLM-8B, our smallest model. This result is not surprising, however, since in Section 4, we observed a large drop in model performance after symbol tuning for Flan-PaLM-8B in settings with relevant labels (which we posited resulted from the model treating all labels as arbitrary symbols, even when the label could have helped the model learn the task). Overall, these results indicate that, as expected, models do not learn to better utilize in-context exemplars when symbol tuned using exemplars with randomized labels.

Appendix B Dataset Details

Here, we show details of the tasks we used for symbol tuning as described in Section 3.1. We selected 22 publicly-available tasks from HuggingFace (Lhoest et al., 2021), ensuring that each task has discrete labels so that there would be labels to swap with our symbols. For each dataset, we used examples from the training split, and because some datasets had more examples than other datasets by multiple orders of magnitude, we cap the number of examples taken from any singular dataset at 25,000. As shown in Table 4, our tuning dataset consists of 291,693 total unique examples.

We selected datasets from several task types as follows: natural language inference (Wang et al., 2019, RTE), (Wang et al., 2018, WNLI), (Rajpurkar et al., 2016; Wang et al., 2018, QNLI), (Wang et al., 2018, MNLI), (Bowman et al., 2015, SNLI), and (Wang et al., 2019, CB); sentiment analysis (Socher et al., 2013, SST2), (Pang & Lee, 2005, RT), and (Rosenthal et al., 2017, TES); paraphrase detection (Chen et al., 2017; Wang et al., 2018, QQP), (Wang et al., 2018, MRPC), and (Zhang et al., 2019, PAWS); common sense answering (Wang et al., 2019, COPA) and (Bisk et al., 2020, PIQA); topic classification (Zhang et al., 2015, AGN) and (Li & Roth, 2002, TREC); coreference resolution (Levesque et al., 2012; Wang et al., 2019, WSC) and (Keisuke et al., 2021, WINO); offensive language identification (Zampieri et al., 2019, TEO); irony detection (Van Hee et al., 2018, TEI); equal-meaning identification (Wang et al., 2019, WIC); and sentence acceptability classification (Wang et al., 2018, COLA).

B.2 Evaluation datasets

In this section, we list the eleven tasks from Section 3.2 that we used for our evaluation. We selected eleven publicly-available tasks from HuggingFace (Lhoest et al., 2021). In order to ensure that evaluation tasks were not seen during tuning, we select datasets that were not used in symbol tuning (Section B.1) and not used in instruction tuning (specifically, the datasets used in Chung et al. (2022), Wei et al. (2022a), and Sanh et al. (2022)). For each dataset, we select examples from the validation split when available (we use the train split if there is no validation split). Some evaluation tasks had significantly more available examples than other evaluation tasks, so we cap the number of examples per evaluation task at 100 in order to make evaluation set sizes similar and reduce the computational costs of each evaluation.

As shown in Table 5, we use the following tasks: subjectivity detection (Conneau & Kiela, 2018, SUBJ), hate speech detection (Basile et al., 2019, TEH), abortion stance classification (Mohammad et al., 2016, TEAB), atheism stance classification (Mohammad et al., 2016, TEAT), feminism stance classification (Mohammad et al., 2016, TEFE), Hillary Clinton stance classification (Mohammad et al., 2016, TEHI), adverse drug event classification (Alex et al., 2021, ADEC), overruling classification (Alex et al., 2021, OR), organization classification (Alex et al., 2021, SOT), potentially-unfair terms-of-service detection (Alex et al., 2021, TOS), and Twitter complaint detection (Alex et al., 2021, TC). In Table 6, we also show the instructions that we provided for each dataset when instructions are included in the prompt setting.

Appendix C Symbol tuning details

In this paper, we experimented using a set of $\sim$ 300k arbitrary symbols as shown in Figure 3. When selecting a symbol to replace natural language labels with, we first randomly select a type of symbol from the three categories (integers, combinations of charactersObtained by converting integers to characters (e.g., $0\to A$ , $1\to B$ , $26\to AA$ , etc.)., and wordsObtained from MIT’s list of 10k words (www.mit.edu/~ecprice/wordlist.10000) and list of 100k words (www.mit.edu/~ecprice/wordlist.100000).) and then select a random symbol from the available symbols for that category. We did not test other ways of generating arbitrary symbols (e.g., picking random words from the prompt, combining multiple words, combining alphabetical characters and numbers, etc.) and leave this for future work.

C.2 Prompt formatting

We used ten distinct prompt templates to format inputs and outputs into prompts. During both tuning and evaluation, prompts are randomly generated using one of the following templates ([input] and [label] stand for the input and label of a given example, respectively):

“Sentences: [input] \n Mapped To: [label]”

For evaluation prompts with instructions, however, we format the prompt as “Question: [instruction] \n [input] \n Answer: [label]” where [instruction] stands for the instruction for a given task (see Table 6 for instructions that we used). Section E.2 contains examples of prompts that were generated using these prompt templates with instructions.

C.3 Tuning procedure

In Table 7, we show tuning details for each model that we symbol tuned. We primarily follow the hyperparameter selection from Chung et al. (2022)—in particular, we use the same batch size, dropout, and learning rate for each model. On the other hand, we showed in Section 7.1 that symbol tuning does not require tuning for as long as instruction tuning does. Because we use packing (Raffel et al., 2020), the effective batch size is larger than the reported number.

Appendix D Full experimental results

We experimented on twenty list function tasks from the List Functions benchmark from BIG-Bench (Srivastava et al., 2022). These list function tasks were selected as the tasks with the highest human accuracy baseline reported in Rule (2020). We describe each of the tasks that we tested in Figure 5 and categorize them into five distinct categories based on the list function used by that task.

The pairings in all tasks are composed of input and output lists that contain numbers from 0 to 9 or numbers from 0 to 99 (these two ranges are separated such that a single list function can have two associated tasks, one for each range). Each task contains 32 input–output pairs—each pairing is used as an evaluation example and for each evaluation example, in-context exemplars examples are randomly selected from the remaining 31 pairs. In Section 4, we evaluated models on evaluation examples generated with four in-context exemplars. We show per-task results from this experiment for base models, continued instruction-tuned variants, and symbol-tuned variants in Table 8.

D.2 In-context learning

We evaluated each model’s in-context learning abilities on a set of eleven datasets as described in Section 3.2. We reported results on these tasks using an unweighted average of the per-task accuracies. In Table 10, Table 10, Table 12, and Table 12, we show base model, continued instruction-tuned model, and symbol-tuned model performance for each task. Models have been tuned with the same specifications described in Section C.3.

D.3 MMLU

MMLU consists of 57 tasks that test a model’s knowledge and problem-solving abilities (Hendrycks et al., 2021). We evaluate on MMLU in a five-shot setting where few-shot exemplars are from the “dev” set, following Chung et al. (2022). In this section, we report the “validation” set performance on MMLU for each task. We use the same prompts as Chung et al. (2022), which can be found at https://github.com/jasonwei20/flan-2. Prompts for STEM datasets are also the same as in Chung et al. (2022), which originated from Lewkowycz et al. (2022). We show full experimental results for Flan-PaLM models and symbol-tuned variants (after tuning for 4k steps for 8B and 62B models and 1k steps for 540B models) on MMLU in Table 15, Table 15, Table 15, Table 18, Table 18, and Table 18.

D.4 BIG-Bench Hard

BIG-Bench Hard is a collection of challenging tasks from BIG-Bench. Tasks were selected in Suzgun et al. (2022) by choosing tasks where model performance as recorded by Srivastava et al. (2022) was better than the average human rater. There are a total of 23 tasks in BIG-Bench Hard—two of these tasks have three subtasks (Suzgun et al., 2022). Following Chung et al. (2022), we treat these subtasks as distinct tasks and take an unweighted average. Our prompts are the same as those used in Chung et al. (2022) which are also the same as the ones given in Suzgun et al. (2022). These prompts contain three in-context exemplars. We show full experimental results for Flan-PaLM models and symbol-tuned variants (after tuning for 4k steps for 8B and 62B models and 1k steps for 540B models) on BIG-Bench Hard in Table 21, Table 21, and Table 21.

D.5 MMLU (zero-shot)

In this section, we show full experimental results for Flan-PaLM models and symbol-tuned variants (after tuning for 4k steps for 8B and 62B models and 1k steps for 540B models) on MMLU (Hendrycks et al., 2021). These results are from evaluating models in a zero-shot setting rather than in a five-shot setting as was tested in Section D.3.

Appendix E Example Prompts

In this section, we provide an example of a full few-shot prompt for each of the 22 datasets used in the main paper. When generating these prompts, we follow the procedure describe in Section 3.1. Namely, prompts use one of ten possible formats shown in Section C.2 and contain 2-10 in-context exemplars per class. Original labels are remapped to arbitrary symbols as described in Section 3.1.

Overview. This prompt contains $k=2$ in-context exemplars per class. The original natural language labels [“entailment”, “not entailment”] have been remapped to [“4348”, “forests”], respectively.

Input: A zoo worker is dead and two visitors are seriously injured after a Siberian tiger escaped from the San Francisco Zoo in San Francisco, California in the United States and attacked three people who were inside a cafe. The tiger was shot dead by police who were called to the scene. They found the tiger attacking one of the zoo visitors when they killed it.

A tiger attacked three people in San Francisco.

Input: After the 1979 Soviet invasion and occupation, 3 million Afghans fled to Pakistan, which was encouraged by hefty Western aid to take them in.

Afghanistan was invaded by the Soviet Union in 1979.

Input: In the May 2005 general election Michael Howard failed to unseat the Labour Government, although the Conservatives did gain 33 seats, playing the most significant role in reducing Labour’s majority from 167 to 66.

In the May 2005 general election Conservatives got 33 seats.

Input: David Millar retained his Tour de France leader’s yellow jersey despite crashing in the final two kilometres of the third stage of the race to Nantes.

Tour de France winner is likely to lose the crown.

Input: New Zealand’s Qualifications Authority said Friday that it still strongly discourages students from using anything other than full English, but that credit will be given if the answer "clearly shows the required understanding," even if it contains text-speak.

Full English is recommended by New Zealand’s Qualifications Authority.

Overview. This prompt contains $k=6$ in-context exemplars per class. The original natural language labels [“entailment”, “not entailment”] have been remapped to [“MIC”, “certification”], respectively.

Input: Even before they reached town, they could hear a sound like corn popping. Dora asked what it was, and Dad said it was firecrackers.

Input: Alice tried frantically to stop her daughter from barking at the party, leaving us to wonder why she was behaving so strangely.

Alice’s daughter was behaving so strangely.

Input: The sack of potatoes had been placed above the bag of flour, so it had to be moved first.

Input: Jim signaled the barman and gestured toward his empty glass.

Jim signaled the barman and gestured toward Jim’s empty glass.

Input: Bob collapsed on the sidewalk. Soon he saw Carl coming to help. He was very concerned.

Input: Tom said "Check" to Ralph as he moved his bishop.

Tom said "Check" to Ralph as he moved Tom’s bishop.

Input: Bob paid for Charlie’s college education, but now Charlie acts as though it never happened. He is very hurt.

Input: The Wainwrights treated Mr. Crowley like a prince until he made his will in their favor; then they treated him like dirt. Folks said he died just to be rid of their everlasting nagging.

Folks said he died just to be ride of the Wainwrights’ everlasting nagging.

Input: Susan knows all about Ann’s personal problems because she is indiscreet.

Input: No one joins Facebook to be sad and lonely. But a new study from the University of Wisconsin psychologist George Lincoln argues that that’s exactly how it makes us feel.

That’s exactly how the study makes us feel.

Input: Fred is the only man alive who still remembers my father as an infant. When Fred first saw my father, he was twelve months old.

When Fred first saw my father, My father was twelve months old.

Input: Anna did a lot better than her good friend Lucy on the test because she had studied so hard.

Input: George got free tickets to the play, but he gave them to Eric, because he was not particularly eager to see it.

George was not particularly eager to see it.

Overview. This prompt contains $k=3$ in-context exemplars per class. The original natural language labels [“entailment”, “not entailment”] have been remapped to [“JMH”, “8529”], respectively.

In Kazakhstan on June 19, 1989, young men carrying guns, firebombs, iron bars and stones rioted in Zhanaozen, causing a number of deaths.

X = What status did the Marshall Islands have in Germany?

It has been speculated that the crisis over the Carolines with Spain, which almost provoked a war, was in fact “a feint to cover the acquisition of the Marshall Islands”, which went almost unnoticed at the time, despite the islands being the largest source of copra in Micronesia.

X = How much of the island was controlled by Turks after international pressure led to a ceasefire?

Among a variety of sanctions against Turkey, in mid-1975 the US Congress imposed an arms embargo on Turkey for using American-supplied equipment during the Turkish invasion of Cyprus in 1974.

X = What body was overthrown by the October Revolution?

Under the leadership of Vladimir Lenin, the Bolsheviks established the Soviet state on 7 November [O.S. 25 October] 1917, immediately after the Russian Provisional Government, which governed the Russian Republic, was overthrown during the October Revolution.

X = Which restaurant did Madonna work in New York City?

In 1978, she dropped out of college and relocated to New York City.

X = What part of China did the earthquake occur in?

Swaminathan Krishnan, assistant professor of civil engineering and geophysics at the California Institute of Technology said: the earthquake occurred in the rural part of China.

X = The initiations are part allegory and part what?

The initiations are part allegory and part lecture, and revolve around the construction of the Temple of Solomon, and the artistry and death of his chief architect, Hiram Abiff.

Overview. This prompt contains $k=7$ in-context exemplars per class. The original natural language labels [“entailment”, “neutral”, “contradiction”] have been remapped to [“root”, “KVA”, “peoples”], respectively.

Input: If we don’t spend seven evenings a week together, if we don’t talk on the phone each day during work, if I want to spend any time alone, my girlfriend pouts and gets angry, or cries.

If I’m not with my girlfriend, she gets mad at me.

Input: i try to keep it pretty reasonable

There was not a single city doing the activity.

Input: (In that sense, the Internet was the ultimate hack.)

That’s an example of how the internet is the ultimate hack.

Input: You can obtain a complete schedule of events from the tourist office on the Champs-Elysees.

The events covered by the schedule do not include the daily tours that begin in the city center.

Input: Plans are in place to turn the house into a museum charting the life and works of this extraordinary man.

Input: As the road climbs, though, it offers spectacular views back to Little Langdale in the east; get out at the small car park at the top of the pass and take photographs.

Input: is that that’s because you’re you’re natives there and that’s what you’re used to you you’ve grown up that way

People who have not grown up there would not be used to it.

Input: In a few cases, we toured the organizations’ facilities and observed practices in operation.

We were unable to tour any of the facilities.

Input: Woodland floors are blanketed with swathes of bluebells, and Gowbarrow Park, immortalized by Wordsworth, has its host of golden daffodils.

Gowbarrow Park is known for its lack of daffodils.

Input: The northernmost village in the National Park and once a mining town, Caleeck, with its pa stel cottages on either side of Chalk Beck, is now rather sleepy.

Caleeck was once a popular tourist spot with its pastel cottages.

Input: In its fiscal year 2000 performance report, the Veterans Administration reported that performance declined with respect to its rating-related claims-processing timeliness and national accuracy rate.

In the fiscal year 2000 report, the VA said performance went down and fewer people were served.

Input: A final factor affecting the environment is the agency’s relationship with the Congress and central oversight agencies such as OMB.

Agency’s relationship with the Congress do not affect the environment.

Input: Effects of ambient air pollution on nonelderly asthma hospital admissions in Seattle, Washington 1987-1994.

In Seattle, the effects of pollution on asthma patients were measured.

Input: that’s i’m going to have to start going out to eat more often i’d i guess i would like to see some things like that

I’m going to need to eat out more often because I want to see things similar to that.

Input: Tickets to shows and concerts can be booked either at the venue itself or (if paying with a major credit card) by telephone from ticket agencies such as Ticketmaster (Tel.

Show tickets can’t be booked at the venue itself.

Input: like that because you know i’ve talked to many people and we wouldn’t mind going its extra effort to do it uh

So far, I’ve talked to over three hundred people.

Input: health effects assessment, environmental fate and effects assessment, EPA correspondence, and registrant comments).

The correspondence with the EPA was responded to promptly.

Everyone’s safety is relatively less than Oprah’s.

Input: usually i can talk all day but this is something to me that’s sad

Input: However, none of the EPA rules that we could access through the agency’s web site had this feature.

None of the EPA rules could receive comments online.

Overview. This prompt contains $k=2$ in-context exemplars per class. The original natural language labels [“entailment”, “neutral”, “contradiction”, “unknown”] have been remapped to [“MSO”, “HWI”, “NGL”, “whilst”], respectively.

A dad told her daughter she wasn’t allowed to wear her outfit.

Input: An experienced young surfer in California enjoying the waves on a sunny Saturday.

A pro surfer is surfing the waves on a Saturday in California.

Input: A man in brown shirt wearing blue pants and brown boots watches from top of a tree

Input: A man adjusts the cymbal for a drummer.

Input: Four people are in some type of cement building with the number 93 painted on the wall.

Four people are in a cement building with numbers painted on the wall.

Input: The man has a poof on top of his woolen hat.

Input: A group of smiling teenagers sits at a table while playing a board game.

The group of angry teenagers sat far apart at the table.

Input: A woman in a multicolored shirt makes a hammock.

Overview. This prompt contains $k=3$ in-context exemplars per class. The original natural language labels [“entailment”, “neutral”, “contradiction”] have been remapped to [“under”, “6749”, “exposure”], respectively.

Input: B: I did, too. A: I mean, it was just more for my money. B: Yeah. I didn’t think it was too long at all.

Input: A: Well, I don’t know, uh, I have a hard time getting, uh, people on the telephone. B: Oh really. A: Uh-huh, getting through to anybody. Sometimes I call off and on all day, B: Huh. A: but anyway, uh, I guess we’re supposed to be talking about family reunions aren’t we.

they’re supposed to be talking about family reunions

Input: Under the Racketeer Influenced and Corrupt Organizations law, or RICO, the government has the authority to seek to freeze or seize a defendant’s assets before trial. According to individuals familiar with Mr. Antar’s case, prosecutors issued their warning this week after one of Mr. Antar’s attorneys asked whether legal fees might be subject to seizure. In a letter, prosecutors told Mr. Antar’s lawyers that because of the recent Supreme Court rulings, they could expect that any fees collected from Mr. Antar may be seized.

any fees collected from Mr. Antar may be seized

Input: A: How do you feel about gun control? B: Well, uh, I mean I don’t think that guns should be outlawed

Input: It is all very well, in these changing times, to adapt one’s work to take in duties not traditionally within one’s realm. But bantering is of another dimension altogether. For one thing how would one know for sure that at any given moment a response of the bantering sort is truly what is expected?

at any given moment a response of the bantering sort is truly what is expected

Input: B: Uh, uh, I’ve had one or two American cars I think, and they were okay. I had a Pontiac once and I never had a problem with it, but, uh, my mother had a Dodge at one point and I had driven it a few times and I really did not feel that I would buy a Dodge just from, A: Um. B: well, actually, I had uh, a Dodge Omni at one point A: Uh-huh. B: and that was, I think, what really prejudiced me against American cars because I did not feel that it was a very quality, uh, car.

Input: It is part of their religion, a religion I do not scoff at as it holds many elements which match our own even though it lacks the truth of ours. At one of their great festivals they have the ritual of driving out the devils from their bodies. First the drummers come on - I may say that no women are allowed to take part in this ritual and the ladies here will perhaps agree with me that they are fortunate in that omission.

no women are allowed to take part in this ritual

Input: A: Sometimes you hear things on the radio that, you know, could be true or couldn’t be. B: Uh-huh. A: Uh, do you feel like this is, I guess they’re spending a billion or so a year on this AIDS research. B: Uh-huh. A: Do you think they should spend more?

Input: B: when you’ve lost something or uh, uh, don’t have what other people have that’s when you tend to realize, you know, what’s out there and you know, what you have and what you don’t have. A: Yeah I agree. B: So the original question, do we think they’re you know, a security threat?

Input: B: I understand we are doing care of the elderly, right? A: Yes. B: And how do you feel about putting someone in the nursing home? A: Well, I don’t think that uh, any of my relatives would really like to go there.

some of her relatives would really like to go there

Overview. This prompt contains $k=4$ in-context exemplars per class. The original natural language labels [“positive”, “negative”] have been remapped to [“1132”, “peter”], respectively.

Input: ’s never too late to believe in your dreams .

Input: terrifically entertaining specimen

Input: is one of those war movies that focuses on human interaction rather than battle and action sequences

Overview. This prompt contains $k=10$ in-context exemplars per class. The original natural language labels [“positive”, “negative”] have been remapped to [“4839”, “3804”], respectively.

Input: . . . pays tribute to heroes the way julia roberts hands out awards–with phony humility barely camouflaging grotesque narcissism .

Input: an uninspired preachy and clichéd war film .

Input: hawke draws out the best from his large cast in beautifully articulated portrayals that are subtle and so expressive they can sustain the poetic flights in burdette’s dialogue .

Input: by candidly detailing the politics involved in the creation of an extraordinary piece of music , [jones] calls our attention to the inherent conflict between commerce and creativity .

Input: de niro may enjoy the same free ride from critics afforded to clint eastwood in the lazy bloodwork . but like bruce springsteen’s gone-to-pot asbury park , new jersey , this sad-sack waste of a movie is a city of ruins .

Input: zigzag might have been richer and more observant if it were less densely plotted .

Input: the pianist is the film roman polanski may have been born to make .

Input: after all the big build-up , the payoff for the audience , as well as the characters , is messy , murky , unsatisfying .

Input: the movie is . . . very funny as you peek at it through the fingers in front of your eyes .

Input: the entire cast is first-rate , especially sorvino .

Input: saddled with an unwieldy cast of characters and angles , but the payoff is powerful and revelatory .

Input: this may be the first cartoon ever to look as if it were being shown on the projection television screen of a sports bar .

Input: this pathetic junk is barely an hour long . nevertheless , it still seems endless .

Input: the woodman seems to have directly influenced this girl-meets-girl love story , but even more reassuring is how its makers actually seem to understand what made allen’s romantic comedies so pertinent and enduring .

Input: awesome creatures , breathtaking scenery , and epic battle scenes add up to another ’spectacular spectacle . ’

Input: the angst-ridden , affluent slacker characters are more grating than engaging .

Input: a compelling pre-wwii drama with vivid characters and a warm , moving message .

Input: even those who would like to dismiss the film outright should find much to mull and debate .

Input: a graceless , witless attempt at mating some like it hot with the wwii espionage thriller .

Overview. This prompt contains $k=2$ in-context exemplars per class. The original natural language labels [“positive”, “neutral”, “negative”] have been remapped to [“licensing”, “8517”, “1527”], respectively.

spot price on 14KT gold is $49.08 dwt in Tampa Bay today - crazy that gold is over$ 1800/ounce - remember when the real price was $300.00.. -> 1527

“Bargain said that ““Iran was the 1st 2help us””. Is that means each take a piece or credit goes2 1country or another? @user @user -> 8517

Willis McGahee had a pretty gruesome knee injury in the Fiesta Bowl… Bills still drafted him late in 1st round. -> 1527

Hip-Hop Rumors: Is Maino Headed To Reality TV?: Brooklyn rapper Maino may be the latest rapper to be making the … -> 8517

"3rd hat-trick in 4 games for Ronaldo. Outrageous. Madrid imperious again tonight. Bale, James, Benzema, Modric, Kroos, Isco. Class overload." -> licensing

Watch her stock go up Sept 17th. How Carly Fiorina earned a spot on the big stage at the GOP debate -> licensing

#GRASS/WOODS_FIRE - (Lawrence County) Bridgeport Fire dispatched to the 6th Curve into Petrolia (trust me if you… ->

Overview. This prompt contains $k=8$ in-context exemplars per class. The original natural language labels [“duplicate”, “not duplicate”] have been remapped to [“womens”, “NDY”], respectively.

Input: How do I increase Negotiation skill?

Input: What is the sum of all 4 digit numbers that can be formed by the digits 2, 3, 4,5 with repeatation?

What is the sum of all 4 digit numbers that can be formed by the digits 0, 1, 2, 3, 4? No repetitions allowed.

Input: What would happen if the moon fell to earth?

What would happen to a chain attaching the Moon to the Earth?

Input: How will scrapping the INR 500 and INR 1000 currency notes affect the daily lives of Indians?

How is scrapping of Rs 500 and Rs 1000 currency notes affect our daily life?

Input: What are some of the most mind-blowing facts about Kapil Sharma?

What are some less known facts about Kapil Sharma?

what steps should I follow to learn machine learning?

Input: Has Zac Efron ever worked with Kevin Bacon?

Input: How can we tackle poverty in India?

Can poverty be totally eliminated from india?

Input: Which is best antivirus for computer?

Input: Does a Master’s in Data Science and Analytics carry a future for a chemical engineering graduate?

Input: What is the most evil thing you have done in your life?

What is the most evil thing you have ever done?

Input: What is the difference between Chinese parents and other countries’?

What is the difference between Chinese parents and other countries?

Overview. This prompt contains $k=8$ in-context exemplars per class. The original natural language labels [“equivalent”, “not equivalent”] have been remapped to [“AFM”, “1352”], respectively.

Input: “ It ’s going to happen , ” said Jim Santangelo , president of the Teamsters Joint Council 42 in El Monte .

“ That really affects the companies , big time , ” said Jim Santangelo , president of the Teamsters Joint Council 42 in El Monte .

Input: Most other potential buyers are interested only in cherry-picking the most attractive assets .

Other potential suitors are not interested in acquiring only the music business .

Input: Recall proponents claim to have turned in more than 1.6 million signatures .

Recall sponsors say they have submitted 1.6 million signatures .

Input: Appellate courts across the country have issued differing rulings on the issue , allowing public displays of the Ten Commandments in some cases and banning them in others .

Lower courts have splintered on the issue , allowing depictions of the Ten Commandments in some instances and not in others .

Input: Martha Stewart shares fell $2.03 , about 18 percent , to$ 9.17 and were the NYSE ’s biggest percentage loser .

Its shares fell 4.6 percent , or $4.04 , to$ 83.38 and was the blue-chip Dow ’s biggest percent loser .

Input: A new variant of Blaster also appeared Wednesday and seemed to be spreading , according to antivirus companies .

The new variation of Blaster was identified Wednesday , according to antivirus company Sophos .

Input: While robbery appeared to be the motive , the suspects drove off before taking anything .

While robbery appeared to be the motive , the suspects fled before they could take anything , he said .

Input: Both NASA and Russian space officials said it posed no danger to the crew .

American and Russian space officials stressed there is no immediate danger to the crew or the operation of the orbiting outpost .

Input: It was developed with consultation from more than 300 leaders in academia , industry , government and the public .

The plan , called The NIH Roadmap , was developed over 14 months with help from more than 300 consultants in industry and academia .

Input: A picture of the doctor ’s son holding the guitar appeared in the National Enquirer just two weeks after George died .

A photograph of the doctor ’s son holding the guitar appeared in the National Enquirer two weeks after Harrison ’s death .

Input: The dollar was last at $ 1.1149 to the euro , close to its strongest level since April 30 .

The dollar pushed as high as $ 1.1115 to the euro in early trade , extending Tuesday ’s one percent rally to hit its strongest level since April 30 .

Input: Aspen Technology ’s shares dropped 74 cents , or 23 percent , to close at $ 2.48 on the Nasdaq .

In afternoon trading , Aspen ’s shares were off 89 cents or more than 27 percent at $ 2.33 per share .

Input: Egyptologists cast doubt Tuesday on an expedition ’s claim that it may have found the mummy of Queen Nefertiti , one of the best-known ancient Egyptians .

Egyptologists think they may have identified the long-sought mummy of Queen Nefertiti , one of the ancient world ’s legendary beauties .

Input: The moment of reckoning has arrived for this West African country founded by freed American slaves in the 19th century .

Taylor is now expected to leave the broken shell of a nation founded by freed American slaves in the 19th century .

Input: Trade deals between manufacturers and grocery retailers or distributors have long been governed by complicated contracts that offer retailers discounts , money for advertising or payments for prominent shelf space .

Manufacturers and grocers or distributors have a long history of complicated contracts offering retailers discounts , money for advertising or payments for prominent shelf space .

Input: Nigeria and other African oil producers are increasingly important in U.S. plans to lessen dependence on Middle Eastern suppliers for its energy security .

Nigeria and other African producers are increasingly important in the former Texas oilman ’s plans to lessen dependence on Middle Eastern suppliers for energy security .

Input: “ Our own history should remind us that the union of democratic principle and practice is always a work in progress , ” Rice said in reference to Iraq .

“ Our own histories should remind us that the union of democratic principle and practice is always a work in progress , ” she said .

Overview. This prompt contains $k=4$ in-context exemplars per class. The original natural language labels [“paraphrase”, “not paraphrase”] have been remapped to [“constitution”, “DDX”], respectively.

Sentences: He captained South Africa on 29 August 1891 against the British Isles in Kimberley .

He listed South Africa against the British Isles in Kimberley on 29 August 1891 .

Sentences: In 2007 he continued as a Toyota test driver and also drove for GP2 team Trident Racing .

In 2007 he drove as a Toyota test driver and also continued the GP2 - Team Trident Racing .

Sentences: The mountain was named by Jules de Blosseville , after French naval officer Marie Henri Daniel Gauthier , comte de Rigny ( 1782 – 1835 ) .

The mountain was named after the French navy officer Marie Henri Daniel Gauthier , Comte de Rigny ( 1782 - 1835 ) , by Jules de Blosseville .

Sentences: 22.0 % were German according to the 2000 census , 20.5 % Irish , 16.4 % Italian , 8.9 % Polish and 7.8 % of English origin .

22.0 % were of German , 20.5 % Irish , 16.4 % Italian , 8.9 % Polish and 7.8 % English ancestry according to Census 2000 .

Sentences: They are purple , dense black-hard rocks with a considerable pyrite content .

They are purple , dense black-hard rocks with considerable content of pyrite .

Sentences: In “ The Guardian ” in 2006 , Stanage argued in opposition to George Monbiot , who had written that the Iraqi insurgency was comparable to the IRA :

In “ The Guardian ” of 2006 , George Monbiot argued in contrast to Stanage , who had written that the Iraqi insurgency was comparable to the IRA .

Sentences: He invented “ A new geometrical method of measuring the human figure ” ( 1860 ) , and wrote and patented various improvements in boats and weapons .

He wrote “ A New Geometrical Method of Measuring the Human Figure ” ( 1860 ) , and invented and patented various improvements in boats and guns .

Sentences: Restovich was traded to the Chicago White Sox from the Arizona Diamondbacks on July 27 , 2011 .

On 27 July 2011 , Restovich was traded from the Chicago White Sox into Arizona Diamondbacks .

Sentences: Vera Zvonareva won the title by beating Caroline Wozniacki in the final with 6 – 3 , 3 – 6 , 6 – 3 .

Vera Zvonareva won the title by beating Caroline Wozniacki in the final 6 – 3 , 3 – 6 , 6 – 3 .

Overview. This prompt contains $k=9$ in-context exemplars per class. The original natural language labels [“choice 1”, “choice 2”] have been remapped to [“NFG”, “brother”], respectively.

Input: I dabbed the floor with a paper towel.

Input: The service at the restaurant was slow.

Input: The woman tolerated her friend’s difficult behavior.

The woman knew her friend was going through a hard time.

The woman felt that her friend took advantage of her kindness.

Input: I clumsily bumped into the stranger.

Input: The man caught his shirtsleeve on the rungs of the fence.

Input: The librarian located the book in the stacks.

Input: I ran the ice cube under warm water.

Input: The water in the teapot started to boil.

Overview. This prompt contains $k=4$ in-context exemplars per class. The original natural language labels [“choice 1”, “choice 2”] have been remapped to [“lf”, “serial”], respectively.

Student: How do I make chocolate chip cookie dough?

In a bowl, mix butter, sugar, brown sugar, vanilla extract, and eggs. To this, add a mixture of flour, baking soda, and salt. Finally, add chocolate chips to the bowl and put the bowl into the oven.

In a bowl, mix butter, sugar, brown sugar, vanilla extract, and eggs. To this, add a mixture of flour, baking soda, and salt. Finally, add chocolate chips to the bowl and mix thoroughly with a wooden spoon.

Student: How do you remove a flat tire from a car?

Lift the car up on a car jack, using a tire iron loosen the bolts on the wheel until they are off, and pull the tire off.

Lift the car up on a car jack, using a screwdriver loosen the bolts on the wheel until they are off, and pull the tire off.

Student: To prevent chance of a oil boilover and subsequent fire during deep frying a turkey

Measure and cut the 1x2 to go around the frame and attach the edge pieces on top of one another with glue

Measure and cut the 1x2 to go around the frame and attach to the edges of the project with glue

Student: To make butter become a liquid to use it in a project, you can

Melt it in a pan for a few minutes until it softens

Melt it in a pan for a few hours until it softens

Student: To properly put a top sheet on a bed

spread the sheet over the bed, tuck the sheet under the foot of the bed, grab the sheet on the sides of the bed 1 foot from the foot, pull up and tuck excess under mattress by foot, and then let the sheet drop hang that way.

spread the sheet over the bed, tuck the sheet under the head of the bed, grab the sheet on the sides of the bed 1 foot from the head, pull up and tuck excess under mattress by foot, and then let the sheet drop hang that way.

Put them on a baking sheet and put them in the freezer for 10 minutes at 375 degrees. They’ll come out like new.

Put them on a baking sheet and put them in the oven for 10 minutes at 375 degrees. They’ll come out like new.

Student: how ot make mashed potatoes with skin

Bring a pot of lightly salted water to a boil. Add peeled potatoes, and cook until tender, about 15 minutes. Drain potatoes, and transfer to a bowl. Add butter, and mash with a potato masher or electric mixer until potatoes are starting to become smooth. Add milk and sour cream, and mix to your desired texture.

Bring a pot of lightly salted water to a boil. Add unpeeled potatoes, and cook until tender, about 15 minutes. Drain potatoes, and transfer to a bowl. Add butter, and mash with a potato masher or electric mixer until potatoes are starting to become smooth. Add milk and sour cream, and mix to your desired texture.

Overview. This prompt contains $k=7$ in-context exemplars per class. The original natural language labels [“world”, “sports”, “business”, “science/technology”] have been remapped to [“KYX”, “european”, “pillow”, “3863”], respectively.

Sentences: GM Europe to Cut 12,000 Jobs in Deal (AP) AP - General Motors Corp.’s European unit Thursday announced a deal that will allow the struggling automaker to cut up to 12,000 jobs; most of them in Germany, where it will offer generous incentives for employees to leave.

Sentences: Special to ESPN.com Oklahoma sports information director Kenny Mossman has a fresh, new story about quarterback Jason White that he’s dying to tell.

Sentences: Voeller resigns as AS Roma coach Rudi Voeller resigned as coach of AS Roma on Saturday following a 3-1 loss to nine-man Bologna in an early fourth round match. Franco Baldini, sport director of the Roman team, told

Sentences: Transactions BASEBALL Anaheim (AL): Exercised 2005 option on C Bengie Molina’s contract; declined 2005 option on P Ramon Ortiz’s contract; purchased P T.J. Stanton from Winnipeg (Northern). Milwaukee (NL): Purchased C Kelley Gulledge from Fargo-Moorhead (Northern). New York (NL): Declined 2005 option on P Al Leiter’s contract. San Francisco (NL): Purchased P Oscar Montero from Winnipeg (Northern). Seattle (AL): Named Jeff …

Sentences: Lonard wins second straight Australian Open Australia’s Peter Lonard successfully defended his title in the centennial Australian Open on Sunday, shooting a 3-under 68 for one-stroke victory over countryman Stuart Appleby.

Sentences: UMC continues to grow faster than TSMC Growth at foundry United Microelectronics Corporation (UMC) continues to outpace that at Taiwan Semiconductor Manufacturing Company (TSMC), which ended its record-sales streak in September.

Sentences: Mad Catz Signs Games Accessory Deal with Disney LOS ANGELES (Reuters) - Video game accessory maker Mad Catz Interactive Inc. will release a series of game-related products based on Disney properties starting with “The Incredibles,” Mad Catz Chief Executive Darren Richardson said on Wednesday.

Sentences: Pacifiers Could Help Teach Babies to Eat KANSAS CITY, Kan. (AP) – Researchers at the University of Kansas Medical Center are testing a high-tech pacifier that could help premature babies learn to eat…

Sentences: SAP expands offshore to cater to growth markets BANGALORE, INDIA - SAPplans to more than double the number of staff at its software development centers in Bangalore, India, and Shanghai by 2006, and is also considering setting up a new development center in Eastern Europe, according to a company executive.

Sentences: Miami Battles Back to Edge Louisville The No. 18 Louisville Cardinals flummoxed the No. 3 team in the nation Thursday night. They forcibly seized the notice of the college football world.

Sentences: Sacked EU whistleblower defiant The European Commission’s former chief accountant, who said the EU budget was open to fraud, will fight her dismissal.

Sentences: Two Iraqi Ministers Targeted in Separate Attacks Iraqi officials say assailants targeted the convoys of two Iraqi government ministers in separate attacks in Baghdad Tuesday. Neither official was hurt, but five other people were killed in the attacks.

Sentences: Chinese secession law may seek legal basis for use of force; BEIJING, (AFP) - A secession law being drafted by China could provide the legal basis for using force against Taiwan, but it is unlikely to include a clear deadline for when reunification must take place, analysts said.

Sentences: Iraq, Sadr Militia Begin Peace Talks Description: Efforts are under way to arrange a ceasefire in the Baghdad slum of Sadr City, where militiamen loyal to Shiite cleric Muqtada al-Sadr have been battling US forces.

Sentences: Magna spinoffs pledge to keep to long-term strategies amid; TORONTO (CP) - Top executives at two Magna International spinoffs the the auto parts giant wants to take private reaffirmed Tuesday that independent committees will analyse the bids and denied that outcomes in Magna’s favour have been pre-determined.

Sentences: Life after Howard Loss of marquee name and ad slowdown show big hurdles face Infinity and other radio broadcasters. By Krysten Crawford, CNN/Money staff writer.

Sentences: Sony sees PSP Asia launch in spring Sony plans to launch its new PlayStation Portable game console in Asia at the same time the product goes live in North America and Europe toward the end of next year’s first quarter, the company said on Friday.

Sentences: IBM supersizes storage arrays IBM is expected to debut its highest-capacity storage arrays, pitting them against high-end offerings from competitors including EMC and Hitachi.

Sentences: Rival engines catch up with Google GOOGLE, the world’s number one search engine, has lost its edge. That’s the considered view of software engineers who have been testing an early version of Microsoft’s MSN Search service, released last week.

Sentences: Investors Watching Consumers’ Jitters NEW YORK Sept. 26, 2004 - When nervous consumers hold on to their money, Wall Street gets nervous about profits. So the question investors hope will be answered in the coming week is, just how nervous are consumers these days?

Sentences: Microsoft To Beef Up Interoperability with Vintela Investment (NewsFactor) NewsFactor - Microsoft (Nasdaq: MSFT) has become a minority investor in Vintela, a Utah-based maker of software that allows the Windows operating system to communicate with other software types, such as Unix, Linux or Mac OS.

Sentences: French Troops Deploy in Ivory Coast After Rioting ABIDJAN (Reuters) - France deployed troops in the Ivory Coast’s main city on Sunday to protect its citizens from mob violence which erupted overnight after French forces destroyed most of the small West African nation’s air force.

Sentences: August Chip Sales Growth Slows on High Inventory Global semiconductor sales growth slowed to 1 percent in August as electronics makers reacted to growing inventories in Asia by limiting orders of chips, an industry trade group said on Thursday.

Sentences: Oil Won’t Derail U.S. Expansion -Bernanke WASHINGTON (Reuters) - Rising oil prices will weigh on U.S. economic growth but the increases seen so far will not derail the expansion and need not fuel a troubling inflation, Federal Reserve Board Governor Ben Bernanke said on Monday.

Sentences: Kenya Pushes for Ban on Hunting Lions (AP) AP - Kenya is pushing for an international ban on trade in lion trophies and skins, arguing that the number of the animals has declined sharply over the years as a result of hunting, loss of habitat and lack of prey.

Sentences: Morning dawns for championship Motorsport.com. The hotly contested Champ Car World series concludes the years competition today as nineteen drivers will determine the outcome of what has been the most thrilling season in recent memory.

Sentences: Keith Miller passes away, aged 84 Australian cricket legend Keith Miller passed away peacefully today, aged 84, in a nursing home on the Mornington Peninsula, south of Melbourne.

Sentences: Castro Breaks Knee in Dramatic Public Fall HAVANA (Reuters) - Cuban President Fidel Castro tripped and shattered a kneecap in a tumble captured on live television that raised new questions about the political future of the communist-run country he has led for 45 years.

Sentences: PeopleSoft sweetens severance packages In a move to retain employees, PeopleSoft quadruples the size of the minimum severance package for employees.

Overview. This prompt contains $k=3$ in-context exemplars per class. The original natural language labels [“abbreviation”, “entity”, “description and abstract concept”, “human being”, “location”, “numeric value”] have been remapped to [“publicity”, “cry”, “trains”, “ya”, “HNY”, “substances”], respectively.

Input: What Aesop ’s fable has the moral : “ The race is not always to the swift. Slow and steady is bound to win ” ?

Input: How many quarts of whole milk is needed to make one pound of butter ?

Input: How many pins are used in skittles ?

Input: The major league baseball team in Pittsburgh is called what ?

Input: How does a hydroelectric dam work ?

Input: Who wrote the song , “ Stardust ” ?

Input: What is the federal minimum wage ?

Input: What are the world ’s four oceans ?

Input: What is the abbreviated term used for the National Bureau of Investigation ?

Input: What do the names Andrew and Christina mean ?

Input: In what sport are these following numbers relevant : 118 , 126 , 134 , 142 , 15 , 158 , 167 , 177 , and 19 ?

Input: What is the only gland in humans that can regenerate itself ?

Input: What is the name of the planet that the Ewoks live on ?

Input: What is Drew Barrymore ’s middle name ?

Input: What city ’s the kickoff point for climbs of Mount Everest ?

Overview. This prompt contains $k=4$ in-context exemplars per class. The original natural language labels [“true”, “false”] have been remapped to [“FQG”, “testament”], respectively.

Sentences: So Mark slept. It was daylight when he woke with Warren ’s hand upon his shoulder.

Sentences: Lily spoke to Donna , breaking her concentration.

Sentences: Ann asked Mary what time the library closes, because she had forgotten.

Sentences: Bob paid for Charlie ’s college education, but now Charlie acts as though it never happened. He is very ungrateful.

Sentences: The mothers of Arthur and Celeste have come to the town to fetch them. They are very happy to have them back, but they scold them just the same because they ran away.

Sentences: James asked Robert for a favor but he was refused.

Sentences: Beth didn’t get angry with Sally , who had cut her off, because she stopped and apologized.

Sentences: The journalists interviewed the stars of the new movie. They were very persistent, so the interview lasted for a long time.

Sentences: Sir Clifford wants me to find him a new groom , about twenty or twenty-one, who knows his business. His old coachman is getting feeble, and he wants a man to work with him and get into his ways, who would be able, when the old man was pensioned off, to step into his place

Overview. This prompt contains $k=4$ in-context exemplars per class. The original natural language labels [“choice 1”, “choice 2”] have been remapped to [“4389”, “2093”], respectively.

Question: Craig had a chatting addiction unlike Kevin because _ spent too much time online playing games.

Question: The hard working bartender put ice cubes from the bucket into the glass until the _ was empty.

Question: The woman hung dry the sweaters but tried to shrink the trousers in the dryer because the _ fit loosely.

Question: James picked the item in the car and moved it to a box and then the _ was empty.

Question: Emily was a very good cook while Megan was not, so _ decided to teach a cooking class.

Question: Tanya used their cell phone signal, while Rebecca used wi-fi, because _ had a strong cell signal.

Question: The math department at the school excelled, while the history department floundered, as the district put miniscule funds into the _ department.

Question: Jessica’s health condition was better than Amy’s as _ didn’t eat as well and never exercised.

Question: The man tried to put the beanie inside the drawer but the _ was too small.

Overview. This prompt contains $k=7$ in-context exemplars per class. The original natural language labels [“offensive”, “not offensive”] have been remapped to [“desktops”, “possible”], respectively.

Input: @user @user And that’s where you are seeing what you want to see. You think these people want to look manly and powerful. I see them expressing solidarity. what a useless thing to mock.

Input: @user @user At least you got to experience the 90s though. These 90s and 2000s born missed out on a great decade. It’s been an increasingly worse shit show since lol

Input: @user & this statement by Jewish Antifa Berlin is so on point

Input: @user @user @user Agree Liberals will not have a majority but you will see People’s Party of Canada in the running your little c party will be lower in the ranks the the NDP.

Input: @user She lives in the hearts of all gun nut conservatives.

Input: @user your pappy is a big fat fool the Democrats doesn’t have to make him look bad he makes himself look just what he is an ignorant 5th level FOOL

Input: @user It’s only significant if he is packing his bags.

Input: @user liberals drink their own Kool-aid. They live in these bubbles and they think a lot of people like them. If they could see themselves how clownish they look and act.

Input: @user Shorty’s dad taught him eveything about Pussygrabbing Sexual assault #trump family time. #maga girls have to stay silent.

Input: #Kavanaugh this is the common use of guns … the conservatives and their prays of dead

Input: @user @user Dude why do you want #DeepState documents to be kept secret? #MAGA @user #LockThemUp

Input: @user @user way to go Liberals… taking away from Americans … smh

Input: @user No reporting on the death threats faced by Kavanaugh or ANY conservatives huh?

Overview. This prompt contains $k=9$ in-context exemplars per class. The original natural language labels [“irony”, “not irony”] have been remapped to [“DEA”, “3390”], respectively.

Input: Making a pair of lungs out of all the tobacco boxes I’ve collected for my sculpture project

Input: You know it’s going to be a great day when you’re Garmin resets itself and you spill some cinnamon down yourself #slowclap

Input: Ahhh 7 a.m bedtimes, how I’ve missed you #examproblems

Input: I’m seeing #catfish while I search for a #fishsitter for my #goldfish, #LoveLife

Input: @user Did I really need to put a hashtag? Oh dear!

Input: @user My entire body sends sympathies too - I was diagnosed with Rheumatoid Arthritis 11 yrs ago - yay flare-ups!!

Input: Update! Punt an Ostrich ? January 2, 2015 via @user

Input: @user Now, now, Islam is the "religion of peace" and only Christians can hurt anybody! Get it right! D:

Input: Glad to know my friends are supportive

Input: JLPT tomorrow. I am feeling vaguely confident. But just vaguely.

Input: At – #Sketch #today #spudshed #fresh #fruit .only.for.eating #for.drawing.as.well #drawing #Perth

Input: @user Hey #Russia ! MAKE MONEY #TWEETING YET?|| <-This #FOXNews Clip explains how||

Input: @user excited for 2015 its gonna be a good year for music ur gonna smash it #Ed

Input: Ah @user soul show making my Sunday morning at work more bearable #cold yet #happy

Input: Today stats: 5 followers, 8 unfollowers and followed 2 people via

Overview. This prompt contains $k=4$ in-context exemplars per class. The original natural language labels [“true”, “false”] have been remapped to [“completion”, “availability”], respectively.

X = He danced hypnotically while she beat the atabaque.

X = The ballet dancer walked with a graceful attitude.

X = Any good golf club will have a range where you can practice.

They used to drive the cattle across the open range every spring.

I have the authority to penalise the staff in my department, but not the authority to sack them.

They had to put their family pet to sleep.

X = I’m cold; can you roll over here and cuddle me, honey?

X = This receptiveness is the key feature in oestral behavior, enabling natural mating to occur.

He was testing the government’s receptiveness to reform.

Overview. This prompt contains $k=6$ in-context exemplars per class. The original natural language labels [“acceptable”, “not acceptable”] have been remapped to [“IRH”, “FOT”], respectively.

Q: Mary considers John a fool and Bill a wimp.

Q: The ship’s sinking to collect the insurance was very devious.

Q: There is likely to be no student absent.

Q: Sharon shivered at the thought of the cold sea.

Q: These are the things for which to be thankful.

Q: for discussion of the same phenomenon in Russian.

E.2 Evaluation task prompts

Here, we provide examples of a full evaluation prompt for each of the 11 datasets used in the main paper. For each dataset, we randomly selected one of the four ICL settings from Figure 4 to show an example from. Each prompt contains $k=4$ in-context exemplars per class for simplicity. We follow the process in Section 3.2 for remapping original labels to arbitrary symbols for evaluation.

Overview. This prompt contains no relevant labels but has instructions. The original natural language labels [“objective”, “subjective”] have been remapped to [“69651”, “BNDQ”], respectively.

Question: Is the following sentence subjective or objective?

however , boey and wayne get closer and johnny ( who had broken up with samantha ) falls for his new secretart , the paranoid sabrina .

Question: Is the following sentence subjective or objective?

the film is almost eerily calm and refuses to take sides . but that lets its insights penetrate all the deeper .

Question: Is the following sentence subjective or objective?

what they soon realize , though , is they are not alone in hypertime .

Question: Is the following sentence subjective or objective?

writer/director david caesar ladles on the local flavour with a hugely enjoyable film about changing times , clashing cultures and the pleasures of a well-made pizza .

Question: Is the following sentence subjective or objective?

seven years later , the submarine uss tunny successfully launched the regulus nuclear cruise missile , and a whole new era in the history of the navy , the submarine and the cold war began !

Question: Is the following sentence subjective or objective?

the video work is so grainy and rough , so dependent on being ’naturalistic ’ rather than carefully lit and set up , that it ’s exhausting to watch .

Question: Is the following sentence subjective or objective?

evelyn may be a weightless picture , but it ’s hardly torture to sit through .

Question: Is the following sentence subjective or objective?

he fails to win eve ’s heart and is consequently dejected .

Question: Is the following sentence subjective or objective?

in a last ditch effort to stop memnon from taking over the world , the leaders of the remaining free tribes hire the assassin mathayus to kill the sorceress .

Overview. This prompt contains relevant labels but no instructions. The natural language labels are [“hate”, “not hate”].

Q: You girls were working it today @user @user Its too bad you both come across as hysterical women.

Q: Stephen Miller - Public Charge Rule is not New its been on the books since 1882 50% of Immigrants on Welfare 90% will Remain on Welfare after 20 years burdening U.S Taxpayers #Immigration #Trump #MAGA #SendThemBack via @user

Q: EU keen to strike deal with Muammar Gaddafi on immigration | World news | The Guardian

Q: Juncker response on migrants a step forward says Conte

Q: @user Andrew Cuomo, a self proclaimed “undocumented” immigrant who frees criminal #illegalAliens says #GOP is on a “Jihad” to deport illegals. That’s akin to flipping the bird at taxpayers. You’re an embarrassment & disappointment to law abi

Q: Europe wants centers in Africa to vet migrants Critics say it’s abdicating its responsibilities

Q: I think Booker is a more hysterical woman than Kamala

Q: Copper Sandwich Maker Sweeps Sponsored by Money Nuts and Kitchen Authority! Enter today!

Q: Why is #MSM so quiet? How many more innocent people need to die before congress gets off their butts and passes legislation to #BuildThatWall Quit showboating over a judge you know is very qualified and do your job to protect American Citizens! #MagaOneVoice

Overview. This prompt contains no relevant labels and no instructions. The original natural language labels [“against”, “none”, “favor”] have been remapped to [“41098”, “blob”, “SVN”], respectively.

Student: @user 1/3 of my generation is missing. And it can’t be changed. But we can change the future. #ProLifeYouth #SemST

Student: I just want to sit in a corner and cry. I wish I was a thicker-skinned feminist but this shit is personal! #MyBodyMyRights #SemST

Student: How ironic is it that im sitting in @user & swiping on @user right now? #SemST

Student: Thank you for another day of life Lord. #Christian #Catholic #TeamJesus #SemST

Student: Every time you respond to something that frustrates you, you let it steal away your time and happiness. #EasyWeightLoss #SemST

Student: Adding to the progress of this week the Supreme Court is also allowing Texas abortion clinics to stay open! #SCOTUS #SemST

Student: Thank you @user for treating me with kindness & respect & TLC during my wellness exam. #womenshealth #yaycondoms #SemST

Student: If ’tis not human beings in the womb, how do we harvest and transplant their organs onto human beings? #SemST

Student: Everyone who disagrees has always had the right to mind their own damn business!!!! #LoveWins #SemST

Student: When is abortion a responsible choice? When a woman chooses it to be #SemST

Student: My body, my life. You fuck it up in a way I’m not prepared for and I will kill you. #SemST

Student: A person’s a person, no matter how small. - Dr. Suess #WAAR #SemST

Student: Can we get a law for the little ones who can’t even speak for themselves? #ProLifeYouth #EVERYLIFEMATTERS #gay #straight #baby #SemST

Overview. This prompt contains relevant labels and instructions. The original natural language labels are [“against”, “none”, “favor”].