MetaICL: Learning to Learn In Context

Sewon Min, Mike Lewis, Luke Zettlemoyer, Hannaneh Hajishirzi

Introduction

Large language models (LMs) have recently been shown to be able to do in-context learning (Brown et al., 2020), where they learn a new task simply by conditioning on a few training examples and predicting which tokens best complete a test input. This type of learning is attractive because the model learns a new task through inference alone, without any parameter updates. However, performance significantly lags behind supervised finetuning, results are often high variance (Zhao et al., 2021; Perez et al., 2021), and it can be difficult to engineer the templates that convert existing tasks to this format.

In this paper, we address these challenges by introducing MetaICL: Meta-training for In-Context Learning. MetaICL tunes a pretrained language model on a large set of tasks to learn how to in-context learn, and is evaluated on strictly new unseen tasks. Each meta-training example matches the test setup—it includes $k+1$ training examples from one task that will be presented together as a single sequence to the language model, and the output of the final example is used to calculate the cross-entropy training loss. Simply finetuning the model in this data setup directly leads to better in-context learning—the model learns to recover the semantics of the task from the given examples, as must be done for in-context learning of a new task at test time. This approach is related to recent work that uses multi-task learning for better zero-shot performance at test time (Khashabi et al., 2020; Zhong et al., 2021; Mishra et al., 2022; Wei et al., 2022; Sanh et al., 2022). However, MetaICL is distinct as it allows learning new tasks from $k$ examples alone, without relying on a task reformatting (e.g., reducing everything to question answering) or task-specific templates (e.g., converting different tasks to a language modeling problem).

We experiment on a large, diverse collection of tasks taken from Ye et al. (2021) and Khashabi et al. (2020), including 142 text classification, question answering, natural language inference and paraphrase detection datasets. We report seven different settings, all with no overlap between meta-training and target tasks. This leads to 52 unique target tasks in total, which is the largest among all recent related work to the best of our knowledge.

Experimental results show that MetaICL consistently outperforms baselines including (1) a variety of LM in-context learning baselines without meta-training (Brown et al., 2020; Zhao et al., 2021; Holtzman et al., 2021; Min et al., 2022), and (2) multi-task learning followed by zero-shot transfer (Zhong et al., 2021; Wei et al., 2022; Sanh et al., 2022). Gains over multi-task zero-shot transfer are particularly significant when meta-training tasks and target tasks are dissimilar, e.g. there are large differences in task formats, domains, or required skills. This demonstrates that MetaICL enables the model to recover the semantics of the task in context during inference even when the target does not share similarities with meta-training tasks. MetaICL often gets close to (and sometimes beats) the performance of models trained with supervised finetuning on the target datasets, and perform as well as models with 8x parameters. We also perform extensive ablations to identify key ingredients for success of MetaICL such as the number and diversity of meta-training tasks. Finally, we demonstrate MetaICL without any templates is better than recent work using human-written natural instructions, while the best performance is achieved by combining both approaches. Code and data are publicly released at github.com/facebookresearch/MetaICL .

Related Work

Brown et al. (2020) propose to use a language model (LM) conditioned on a concatenation of training examples for few-shot learning with no parameter updates. It has been further improved by later work (Zhao et al., 2021; Holtzman et al., 2021; Min et al., 2022), showing promising results on a variety of tasks. However, in-context learning with an LM achieves poor performance when the target task is very different from language modeling in nature or the LM is not large enough. Moreover, it can have high variance and poor worst-case accuracy (Perez et al., 2021; Lu et al., 2021).

Our paper is based on the core idea of in-context learning by conditioning on training examples. We show that, by explicitly training on an in-context learning objective, MetaICL achieves substantial improvements even with smaller LMs.

Meta-training via multi-task learning

Our work is broadly inspired by a large body of work in meta-learning (Vilalta and Drissi, 2002; Finn et al., 2017) and multi-task learning (Evgeniou and Pontil, 2004; Ruder, 2017). Prior work has shown that multi-task learning on a large collection of tasks leads to better performance on a new task, either when tested zero-shot (Khashabi et al., 2020; Zhong et al., 2021; Mishra et al., 2022; Wei et al., 2022) or when further finetuned (Aghajanyan et al., 2021; Ye et al., 2021). In particular, the former is closely related to our work, as it eliminates the need for parameter updates on a target task. However, these zero-shot models are either limited to tasks sharing the same format as training tasks (e.g., a question answering format) (Khashabi et al., 2020; Zhong et al., 2021), or rely heavily on task-specific templates (Mishra et al., 2022; Wei et al., 2022; Sanh et al., 2022) which are difficult to engineer due to high variance in performance from very small changes (Mishra et al., 2021).

In this paper, we propose a meta-training method for better in-context learning that improves few-shot performance. We show that it effectively learns semantics of a new task with no manual effort, significantly outperforming zero-shot transfer methods.We show that MetaICL without instructions is still better than zero-shot transfer with instructions, but by using instructions, performance of MetaICL further improves (Section 5.2). Furthermore, while Wei et al. (2022) show that meta-training helps only when the model has 68B or more parameters, our experiments demonstrate improvements with a much smaller model (770M).

Chen et al. (2022), concurrently to our work, propose meta-training for in-context learning. Our approach differs in a number of ways: we remove requirements of human-written templates or instructions, and include more diverse tasks, stronger baselines, and extensive experiments in much larger scale with many meta-training/target splits.

MetaICL

We introduce MetaICL: Meta-training for In-Context Learning. Table 1 provides an overview of the approach. The key idea is to use a multi-task learning scheme over a large collection of meta-training tasks, in order for the model to learn how to condition on a small set of training examples, recover the semantics of a task, and predict the output based on it. Following previous literature (Brown et al., 2020), the training examples are concatenated and provided as an single input to the model, which is feasible for $k$ -shot learning (e.g., $k=16$ ). At test time, the model is evaluated on an unseen target task that comes with $k$ training examples, and inference directly follows the same data format as in meta-training.

The model is meta-trained on a collection of tasks which we call meta-training tasks. For every iteration, one meta-training task is sampled, and $k+1$ training examples $(x_{1},y_{1}),\cdots,(x_{k+1},y_{k+1})$ are sampled from the training examples of the chosen task. We then supervise the model by feeding the concatenation of $x_{1},y_{1},\cdots,x_{k},y_{k},x_{k+1}$ to the model as an input and train the model to generate $y_{k+1}$ using a negative log likelihood objective. This simulates in-context learning at inference where the first $k$ examples serve as training examples and the last $(k+1)$ -th example is regarded as the test example.

2 Inference

For a new target task, the model is given $k$ training examples $(x_{1},y_{1}),\cdots,(x_{k},y_{k})$ as well as a test input $x$ . It is also given a set of candidates $\mathcal{C}$ which is either a set of labels (in classification) or answer options (in question answering). As in meta-training, the model takes a concatenation of $x_{1},y_{1},\cdots,x_{k},y_{k},x$ as the input, and compute the conditional probability of each label $c_{i}\in\mathcal{C}$ . The label with the maximum conditional probability is returned as a prediction.

3 Channel MetaICL

Experimental Setup

We use a large collection of tasks taken from CrossFit (Ye et al., 2021) and UnifiedQA (Khashabi et al., 2020). We have 142 unique tasks in total, covering a variety of problems including text classification, question answering (QA), natural language inference (NLI) and paraphrase detection. All tasks are in English.

We experiment with seven distinct settings as shown in Table 2, where there is no overlap between the meta-training and target tasks. The number of unique target tasks in total is 52, which is significantly larger than other relevant work (Khashabi et al., 2020; Zhong et al., 2021; Mishra et al., 2022; Wei et al., 2022; Sanh et al., 2022). Each target task is either classification or multi-choice, where a set of candidate options ( $\mathcal{C}$ in Table 1) is given.

HR $\rightarrow$ LR (High resource to low resource): We experiment with a setting where datasets with 10,000 or more training examples are used as meta-training tasks and the rest are used as target tasks. We think using high resource datasets for meta-training and low resource datasets as targets is a realistic and practical setting for few-shot learning.

X $\rightarrow$ X (X={Classification, QA}): We experiment with two settings with meta-training and target tasks sharing the task format, although with no overlap in tasks.

Non-X $\rightarrow$ X (X={Classification, QA, NLI, Paraphase}): Lastly, we experiment with four settings where meta-training tasks do not overlap with target tasks in task format and required capabilities. These settings require the most challenging generalization capacities.

Each setting has a subset of target tasks with no domain overlap with any meta-training tasks (e.g., finance, poem, climate or medical). We report both on all target tasks or on target tasks with no domain overlap only. Full details of the settings and datasets with citations are provided in Appendix A.

2 Baselines

We compare MetaICL and Channel MetaICL with a range of baselines, as summarized in Table 3.

0-shot: We use a pretrained LM as it is and run zero-shot inference, following Brown et al. (2020).

In-context: We use the pretrained LM as it is and use in-context learning by conditioning on a concatenation of $k$ training examples, following Brown et al. (2020).

PMI 0-shot, PMI In-context: We use the PMI method from Holtzman et al. (2021); Zhao et al. (2021) for 0-shot and In-context learning.

Channel 0-shot, Channel In-context: We use the noisy channel model from Min et al. (2022) for 0-shot and In-context learning.

Multi-task 0-shot: We train the LM on the same meta-training tasks without in-context learning objective, i.e., maximize $P(y|x)$ without $k$ other training examples, and then use zero-shot transfer on a target task. This is equivalent to MetaICL with $k=0$ . This is a typical multi-task learning approach from previous work (Khashabi et al., 2020; Zhong et al., 2021; Wei et al., 2022).

Channel Multi-task 0-shot: We have a channel variant of Multi-task 0-shot.

Fine-tune: We fine-tune the LM on an individual target task. This is not directly comparable to other methods as parameter updates are required for every target task.

Fine-tune w/ meta-train: We train the LM on meta-training tasks first and then further fine-tuned it on a target task. This is not directly comparable to other methods for the same reason as above.

3 Evaluation

We use Macro-F1More suitable than accuracy for imbalanced classification. and Accuracy as evaluation metrics for classification tasks and non-classification tasks, respectively.

For a target task, we use $k=16$ training examples, sampled uniformly at random. We relax the assumption of perfect balance between labels on $k$ training examples, following Min et al. (2022). Because in-context learning is known to have high variance (Zhao et al., 2021; Perez et al., 2021; Lu et al., 2021), we use 5 different sets of $k$ training examples. We first compute the average and the worst-case performance over seeds for every target task, and then report the macro-average of them over all target tasks.

4 Experiment Details

As a base LM, we use GPT-2 Large (Radford et al., 2019) which consists of 770M parameters.Appendix C.2 reports performance for other LM sizes. For baselines without meta-training (raw LMs), we also compare with GPT-J (Wang and Komatsuzaki, 2021), which is the largest public causal LM at the time of writing, consisting of 6B parameters.

Prior work uses human-authored templates to transform the input-output pair to a natural language sentence (Zhong et al., 2021; Mishra et al., 2022; Wei et al., 2022; Chen et al., 2022). They require expensive manual effort (as 136 different templates are required for 136 tasks in this paper) and cause unstable model performance due to many different ways of writing (Mishra et al., 2021). We eliminate templates, using the given input (or a concatenation of inputs if there are multiple) and label words provided in the original datasets.In our preliminary experiments, we explored templates taken from prior work, but found that they do not consistently improve few-shot performance, even when they do improve zero-shot performance. A comparison of input-output schemes from prior work and our approach is shown in Table 4.

Training details

All implementation is done in PyTorch (Paszke et al., 2019) and Transformers (Wolf et al., 2020). For meta-training, we use up to 16,384 training examples per task. We use a batch size of $8$ , learning rate of $1\times 10^{-5}$ and a sequence length of $1024$ . For multi-task 0-shot baselines (the baselines with no in-context learning), we use a sequence length of $256$ . We train the model for $30,000$ steps.We also explored training longer, but it did not improve performance. To save memory during meta-training, we use an 8-bit approximation (Dettmers et al., 2022) of an Adam optimizer (Kingma and Ba, 2015) and mixed precision (Micikevicius et al., 2017). Training was done for 4.5 hours with eight 32GB GPUs. This is drastically more efficient than recent prior work, e.g., 270 hours of a 512GB TPU in Sanh et al. (2022).

More details about preprocessing and training can be found in Appendix B.

Experimental Results

Table 5 reports the full results using GPT-2 Large, where we compute the average and the worst-case performance of every target task and report the macro-average over them. The top and the bottom respectively evaluate on all target tasks and target tasks in unseen domains only.

We first discuss the results of ours baselines. Among raw LMs without meta-training (the first six rows of Table 5), we observe that channel in-context baselines are the most competitive, consistent with findings from Min et al. (2022). We then find that Multi-task 0-shot baselines do not outperform the best raw LM baseline in most settings, despite being supervised on a large set of meta-training tasks. This somewhat contradicts findings from Wei et al. (2022); Sanh et al. (2022). This is likely for two reasons. First, our models are much smaller than theirs (770M vs. 11B–137B); in fact, Wei et al. (2022) reports Multi-task 0-shot starts to be better than raw LMs only when the model size is 68B or larger. Second, we compare with much stronger channel baselines which they did not; Multi-task 0-shot outperforms non-channel LM baselines but not channel LM baselines.

MetaICL outperforms baselines

MetaICL and Channel MetaICL consistently outperform a range of strong baselines. In particular, Channel MetaICL achieves the best performance in 6 out of 7 settings. Gains are particularly significant in the HR $\rightarrow$ LR, non-NLI $\rightarrow$ NLI and non-Para $\rightarrow$ Para settings (6–15% absolute). This is noteworthy because HR $\rightarrow$ LR targets the common low-resource case where new tasks have very few labeled examples, and the other two represent large data distribution shifts where the test tasks are relatively different from the meta-training tasks. This demonstrates that MetaICL can infer the semantics of new tasks in context even when there are no closely related training tasks.

While MetaICL significantly outperforms baselines in most settings, it only marginally outperforms Multi-task 0-shot in the QA $\rightarrow$ QA setting, as an exception. This is likely because the meta-training and target tasks are relatively similar, allowing the Multi-task 0-shot baseline to achieve very strong performance. Nonetheless, performance of Multi-task 0-shot in QA significantly drops when the model is trained on non-QA tasks, while performance of MetaICL drops substantially less.

Gains are larger on unseen domains

Gains over Multi-task 0-shot are more significant on target tasks in unseen domains. In particular, Multi-task 0-shot is generally less competitive compared to raw LM baselines, likely because they require more challenging generalization. MetaICL suffers less from this problem and is consistently better or comparable to raw LM baselines across all settings.

Comparison to fine-tuning

MetaICL matches or sometimes even outperforms fine-tuned models without meta-training. This is a promising signal, given that no prior work has shown models with no parameter updates on the target can match or outperform supervised models. Nonetheless, fine-tuning with meta-training exceeds both MetaICL and fine-tuning without meta-training, because meta-training helps in supervised learning as it does in in-context learning. This indicates that there is still room for improvement in methods that allow learning without parameter updates .

Comparison to GPT-J

In Table 6, we compare GPT-2 Large based models with raw LM baselines based on GPT-J which consists of 6B parameters. MetaICL, despite being 8x smaller, outperforms or matches GPT-J baselines.

2 Ablations

We vary the number of training examples ( $k$ ) from 0, 4, 8, 16 to 32. In-context learning with $k=0$ is equivalent to the zero-shot method. Results are shown in Figure 1. Increasing $k$ generally helps across all models, and Channel MetaICL outperforms the raw in-context learning over all values of $k$ . We additionally find that the performance tends to saturate when $k$ is closer to $16$ , likely because the sequence length limit of the language model makes it hard to encode many training examples.

Number of meta-training tasks

To see the impact of the number of meta-training tasks, we subsample $\{7,15,30\}$ meta-training tasks out of 61 in the HR $\rightarrow$ LR setting. For each, we use ten different random seeds to additionally see the impact of the choice of meta-training tasks.

Figure 2 reports the results. On average, performance generally increases as the number of tasks increase, which is consistent with results in Mishra et al. (2022); Wei et al. (2022). Across different numbers of meta-training tasks, Channel MetaICL consistently outperforms other models. Nonetheless, there is nonnegligible variance across different choices of meta-training (the bottom of Figure 2), indicating that a choice of meta-training gives substantial impact in performance.

Diversity in meta-training tasks

We hypothesize that the diversity in meta-training tasks may impact performance of MetaICL. To verify this hypothesis, we create two settings by subsampling 13 out of 61 meta-training datasets in the HR $\rightarrow$ LR setting. One setting is diverse in their task formats and required capacities: QA, NLI, relation extraction, sentiment analysis, topic classification, hate speech detection and more. The other setting is less diverse, including tasks related to sentiment analysis, topic classification and hate speech detection only. A full list of datasets is reported in Appendix A. Using these two settings, we compare multi-task zero-shot transfer baselines and MetaICL.

Results are reported in Table 7. We find that MetaICL with a diverse set outperforms MetaICL with a non-diverse set by a substantial margin. This shows that diversity among meta-training tasks is one of substantial factors for the success of MetaICL.

In Appendix C.3, we include ablations that provide more insights on the choice of meta-training tasks, such as (1) high quality data with diverse domains tend to help (e.g., GLUE family (Wang et al., 2018)) and (2) adversarially collected data tends to be unhelpful. However, more systematic studies on how to choose the best meta-training tasks and how they relate to particular target tasks should be done, which we leave for future work.

Are instructions necessary?

Most recent work has used human-written natural instructions for zero- or few-shot learning (Mishra et al., 2022; Wei et al., 2022; Sanh et al., 2022). While we argue for not using instructions to avoid manual engineering and high variance, we also ask: are instructions still useful with MetaICL? On one hand, learning to condition on $k$ examples may remove the necessity of instructions. On the other hand, instructions may still be complementary and provide the model with extra useful infomration.

We aim to answer this question by using 32 meta-training tasks and 12 target tasks from the HR $\rightarrow$ LR setting for which human-written instructions are available in Sanh et al. (2022).github.com/bigscience-workshop/promptsource We have two variants: (a) using one instruction per meta-training task, and (b) using all available instructions including 267 instructions in total (8.3 per meta-training task) which Sanh et al. (2022) found to be better than (a). We then compare MetaICL and a range of baselines with and without instructions.

Results are reported Table 8. As in Wei et al. (2022) and Sanh et al. (2022), Multi-task 0-shot outperforms the raw-LM 0-shot baseline. However, MetaICL with no instructions is better than Multi-task 0-shot with instructions. Furthermore, MetaICL achieves further improvements when instructions are jointly used, significantly outperforming all baselines. In fact, when increasing the number of instructions per task from 0, 1 to 8.3, performance of MetaICL improves much more than performance of Multi-task 0-shot does. To summarize, (1) learning to in-context learn (MetaICL) outperforms learning to learn from instructions; (2) MetaICL and using instructions are largely complementary, and (3) MetaICL actually benefits more from using instructions than Multi-task 0-shot does.

Importantly, Channel MetaICL trained on available tasks and instructions still achieves lower performance than Channel MetaICL without templates/instructions ( $46.9$ from Table 8 vs. $49.1$ from Table 5). This is likely because the model with instructions was trained with less meta-training tasks, which was unavoidable since instructions are only available on 32 out of 61 meta-training tasks. This supports our earlier choice of not using human-written templates/instructions, since writing templates and instructions for every task requires extensive effort.

It is worth noting that, it is nonetheless difficult to make direct comparisons with Wei et al. (2022) and Sanh et al. (2022) because there are many moving components: size of LMs, types of LMs (e.g., causal LM vs. masked LM), splits between meta-training and target tasks, and more.

Conclusion

In this paper, we introduced MetaICL, a new few-shot learning method where an LM is meta-trained to learn to in-context learn, i.e. condition on training examples to recover the task and make predictions. We experiment with a large, diverse collection of tasks, consisting of 142 unique tasks in total and 52 unique target tasks, using seven different settings. MetaICL outperforms a range of strong baselines including in-context learning without meta-training and multi-task learning followed by zero-shot transfer, and outperforms or matches 8x bigger models. We identify ingredients for success of MetaICL such as the number and diversity of meta-training tasks. We also demonstrate that, while MetaICL is better than recent work using natural instructions, they are complementary and the best performance is achieved by integrating MetaICL with instructions.

Our work is limited in multiple dimensions. First, in-context learning approaches in general requires much longer context at both meta-training and inference due to feeding the concatenation of the training data, thus being less efficient compared to baselines that do not use in-context learning. Second, our work experiment with a casual language model with modest size (GPT-2 Large, 770M parameters). Future work may investigate extending our approach to a masked language model and a larger model. Third, our experiments focus on classification and multi-choice tasks where a set of candidate options is given. Future work may study applying our approach for a wider range of tasks including free-form generation. Other avenues for future work include further improving MetaICL to outperform supervised models with meta-training, identification of which meta-training tasks are helpful on target tasks, and how to better combine human-written instructions and MetaICL.

Acknowledgements

We thank Ari Holtzman and Victoria Lin for comments and discussions, and Tim Dettmers for help with experiments. This research was supported by NSF IIS-2044660, ONR N00014-18-1-2826, an Allen Distinguished Investigator Award, and a Sloan Fellowship.

References

Appendix A Dataset List

Table 14 and Table 15 report a list of datasets used in the settings detailed in Section 4.1. The first 10 rows are for settings described in Section 4.1; the next two rows are for settings used for ablations on the diversity of meta-training tasks (Table 7 of Section 5.2); the last two rows are for settings used for ablations on using natural instructions (Table 8 of Section 5.2). Bold datasets are target datasets with no overlap in domain with meta-training tasks. All datasets are taken from CrossFit (Ye et al., 2021) (except we exclude datasets that are unavailable from their repository github.com/INK-USC/CrossFit or the scope is notably different from other tasks, e.g., solving math problems or breaking down compositional questions) and UnifiedQA (Khashabi et al., 2020).

The HR $\rightarrow$ LR setting is created based on the training data size as described in Section 4.1. Settings involving Classification, NLI and Paraphrase are taken from CrossFit. Settings involving QA are created by combining QA datasets from CrossFit and datasets from UnifiedQA.

Statistics are reported in Table 2 and Table 9. The number of tasks is the largest among recent related work: we have 142 unique tasks, while Khashabi et al. (2020), Zhong et al. (2021), Mishra et al. (2022), Wei et al. (2022) and Sanh et al. (2022) use 32, 62, 61, 42 and 62 tasks, respectively. References for all datasets are provided in Table 15. Data and splits are available at github.com/facebookresearch/MetaICL .

Appendix B Implementation Details

For all models with meta-training and the raw GPT-J, we separate the input and the output with one newline ( $\backslash$ n), and separate between examples with three newlines. For the raw GPT-2, we use spaces instead of newlines. This choice was made in order to report the best baseline performance we were able to achieve: when raw LMs are used, GPT-2 is significantly better with spaces than with newlines, and GPT-J is significantly better with newlines than with spaces.For example, in the HR $\rightarrow$ LR setting, the raw GPT-2 is about $4$ % better with spaces then with newlines, and the raw GPT-J is about $5$ % better with spaces and then with newlines (all with the channel in-context learning method). We note that MetaICL is less sensitive to these formatting differences, having less than 2% differences between using spaces and using newlines.

When the concatenation of $k$ examples is too long, we truncate each example to have at most $256$ tokens, and truncate the earlier tokens of the concatenation so that the LM sees the recent tokens. Additionally, for extractive question answering datasets as meta-training tasks, the input passage is truncated with a guarantee that the groundtruth answer is included in the input passage. We do not do this truncation for target datasets.

Comparison with baselines in training and inference cost

Although being trained for the same global steps (30,000 steps), it takes 3 hours to train Multi-task 0-shot baselines (in contrast to 4.5 hours for MetaICL), likely because the sequence length is 4x shorter. At inference, Multi-task 0-shot baselines are roughly 4x more efficient, also because the sequence length is 4x shorter.Let $L$ be the sequence length, the memory requirement for attention layers and feed-forward layers are $O(L^{2})$ and $O(L)$ , respectively. In practice, feed-forward layers are responsible for most memory usage when the size of the transformers is large, thus empirical memory usage tends to be linear to $L$ . We did not control for the training time and the inference time for comparison since both models are efficient enough.

Ablations in using instructions

When we choose one instruction per task at meta-training tasks, we choose one by (1) first excluding the instruction if its name contains no_option, (2) then taking the instruction which name contains multiple_choice, most_correct or most_suitable if there are any, and (3) if not, then randomly sampling one. We choose one instruction per target task at test time using the same process. This is different Sanh et al. (2022) where the median of the performance over all instructions is reported. We think our choice better reflects the real use-case scenario—choosing one instruction that looks the most reasonable to human.

Appendix C Additional Results & Analyses

Table 10 reports the full results of raw LM baselines based on GPT-J, consisting of 6B parameters. See Section 5.1 for discussion.

C.2 Varying LM sizes

We vary the size of the GPT-2 models—small, medium, large, and XL—with 124M, 355M, 774M, and 1.5B parameters, respectively. Results are reported in Table 11. We find that (1) increasing the model size generally helps, (2) for all model sizes, Channel MetaICL significantly outperforms baselines, and (3) MetaICL enables a much smaller model to outperform a bigger model, e.g., Channel MetaICL based on GPT-2 Small outperforms the GPT-2 XL baseline that is 12x bigger (46.2 vs. 43.5).

C.3 Which meta-training tasks are more helpful?

Based on large variance across different choices of meta-training (Figure 2 of Section 5.2), we think certain tasks are more helpful for meta-training than other tasks. In this context, we create $50$ sets of seven meta-training tasks using $50$ different random seeds. We then measure the correlation between tasks/task pairs/task triples and average performance of Channel MetaICL when the task is included in the meta-training tasks.

Table 12 reports the result. We first find that high quality datasets with diverse domain like GLUE family (Wang et al., 2018) are often helpful. We also find that datasets that are collected adversarially (e.g. paws, art) or are notably dissimilar from all other tasks (e.g. wikisql that requires semantic parsing) are often unhelpful. Nonetheless, we were not able to find good explanations for other cases, e.g., many sentiment analysis datasets being particularly helpful even though only 3 out of 26 target datasets are sentiment analysis, and dbpedia_14/cosmos_qa/race-middle being unhelpful. Moreover, we think which tasks are helpful largely depends on the choice of target tasks, and we should not make early conclusions that certain tasks are helpful/unhelpful in all cases. We think future work should investigate these impacts in a more systematic way.

C.4 Does MetaICL generalize when semantic hints from label words are removed?

Our experiments use label words taken from the original dataset, which often contain semantic hints—hints on what each label is supposed to mean (entailment and not_entailment for the NLI task, and positive and negative for the sentiment analysis task). If the model is truly learning the task in-context, it should generalize when label words are replaced with random English words, e.g., entailment and not_entailment are replaced with apple and orange, thus not giving any hints about the task. In this context, we run experiments where each label word is replaced with a random word sampled from 61,569 common English words. pypi.org/project/english-words. We use five seeds for sampling random words, and report the average and the worst-case performance.

Results in Table 13 show that raw LMs (the first block of the table) and models trained on the original data (the second block) achieve near random guessing performance. This indicates that having semantic hints from label words is a necessary condition for all models to perform the task.

Next, we meta-train the MT 0-shot baseline and MetaICL where, for each iteration of meta-training, we similarly map label words with random words. The mapping from the label set to sampled English words is independent for each iteration, so that the model never sees the same mapping during meta-training and hence does not overfit to a specific mapping. Results are reported in the third block of Table 13. MT 0-shot baselines are still not better than random guessing, which is expected as they have no way to grasp the meaning of each label. On the other hand, MetaICL benefits from training on the replaced data, improving performance from 30.1% to 43.5% while retaining most performance on the original data ( $43.4\%\rightarrow 40.7\%$ ).

Still, overall performance is relatively poor. We think future work should investigate the model that can in-context learn any task.

Appendix D Potential Risks

MetaICL is based on the large language model that is pretrained on a web corpus, which potentially includes harmful and biased context, despite the original authors’ best efforts to mine the text. There are also potential risks in privacy and security—for instance, Carlini et al. (2021) reported that it is possible to design the attack algorithm to extract a substantial amount of training data. We thus highlight that MetaICL should be considered as a research prototype rather than a deployable system to real users, and continuing efforts are needed to reduce potential risks of the model.