A Survey on In-context Learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, Baobao Chang, Xu Sun, Lei Li, Zhifang Sui

cs.CL cs.AI

Introduction

With the scaling of model size and corpus size (Devlin et al., 2019; Radford et al., 2019; Brown et al., 2020; Chowdhery et al., 2022), large language models (LLMs) demonstrate an in-context learning (ICL) ability, that is, learning from a few examples in the context. Many studies have shown that LLMs can perform a series of complex tasks through ICL, such as solving mathematical reasoning problems (Wei et al., 2022c). These strong abilities have been widely verified as emerging abilities for large language models (Wei et al., 2022b).

The key idea of in-context learning is to learn from analogy. Figure 1 gives an example describing how language models make decisions with ICL. First, ICL requires a few examples to form a demonstration context. These examples are usually written in natural language templates. Then, ICL concatenates a query question and a piece of demonstration context together to form a prompt, which is then fed into the language model for prediction. Different from supervised learning requiring a training stage that uses backward gradients to update model parameters, ICL does not conduct parameter updates and directly performs predictions on the pretrained language models. The model is expected to learn the pattern hidden in the demonstration and accordingly make the right prediction.

As a new paradigm, ICL has multiple attractive advantages. First, since the demonstration is written in natural language, it provides an interpretable interface to communicate with LLMs (Brown et al., 2020). This paradigm makes it much easier to incorporate human knowledge into LLMs by changing the demonstration and templates (Liu et al., 2022; Lu et al., 2022; Wu et al., 2022; Wei et al., 2022c). Second, in-context learning is similar to the decision process of human beings by learning from analogy (Winston, 1980). Third, compared with supervised training, ICL is a training-free learning framework. This could not only greatly reduce the computation costs for adapting the model to new tasks, but also make language-model-as-a-service (Sun et al., 2022) possible and can be easily applied to large-scale real-world tasks.

Despite being promising, there are also interesting questions and intriguing properties that require further investigation in ICL. While the vanilla GPT-3 model itself shows promising ICL abilities, several studies observed that the ability could be significantly boosted via adaption during pretraining (Min et al., 2022b; Chen et al., 2022c). In addition, the performance of ICL is sensitive to specific settings, including the prompting template, the selection of in-context examples, and order of examples, and so on (Zhao et al., 2021). Furthermore, while intuitively reasonable, the working mechanism of the ICL remains unclear, and few studies have provided preliminary explanations (Dai et al., 2022; von Oswald et al., 2022).

With the rapid growth of studies in ICL, our survey aims to sensitize the community toward the current progress. Specifically, we present a detailed paper survey with a paper list that will be continuously updated, and make an in-depth discussion on related studies of ICL. We highlight the challenges and potential directions and hope our work may provide a useful roadmap for beginners interested in this area and shed light on future research.

Overview

The strong performance of ICL relies on two stages: (1) the training stage that cultivates the ICL ability of LLMs, and (2) the inference stage where LLMs predict according to task-specific demonstrations. In terms of the training stage, LLMs are directly trained on language modeling objectives, such as left-to-right generation. Although the models are not specifically optimized for in-context learning, they still exhibit the ICL ability. Existing studies on ICL basically take a well-trained LLM as the backbone, and thus this survey will not cover the details of pretraining language models. Towards the inference stage, as the input and output labels are all represented in interpretable natural language templates, there are multiple directions for improving ICL performance. This paper will give a detailed description and comparison, such as selecting suitable examples for demonstrations and designing specific scoring methods for different tasks.

We organize the current progress in ICL following the taxonomy above (as shown in Figure 2). With a formal definition of ICL (§3), we provide a detailed discussion of the warmup approaches (§4), the demonstration designing strategies (§5), and the main scoring functions(§6). §7 provides in-depth discussions of current explorations on unveiling the secrets behind the ICL. We further provide useful evaluation and resources for ICL (§8) and introduce potential application scenarios where ICL shows its effectiveness (§10). Finally, we summarize the challenges and potential directions (§11) and hope this could pave the way for researchers in this field.

Definition and Formulation

Following the paper of GPT-3 Brown et al. (2020), we provide a definition of in-context learning: In-context learning is a paradigm that allows language models to learn tasks given only a few examples in the form of demonstration. Essentially, it estimates the likelihood of the potential answer conditioned on the demonstration by using a well-trained language model.

Formally, given a query input text $x$ and a set of candidate answers $Y=\{y_{1},\ldots,y_{m}\}$ (Y could be class labels or a set of free text phrases), a pretrained language model $\mathcal{M}$ takes the candidate answer with the maximum score as the prediction conditioning a demonstration set $C$ . $C$ contains an optional task instruction $I$ and $k$ demonstration examples; therefore, $C=\{I,s(x_{1},y_{1}),\ldots,s(x_{k},y_{k})\}$ or $C=\{s(x_{1},y_{1}),\ldots,s(x_{k},y_{k})\}$ , where $s(x_{k},y_{k},I)$ is an example written in natural language texts according to the task. The likelihood of a candidate answer $y_{j}$ could be represented by a scoring function $f$ of the whole input sequence with the model $\mathcal{M}$ :

The final predicted label $\hat{y}$ is the candidate answer with the highest probability:

The scoring function $f$ estimates how possible the current answer is given the demonstration and the query text. For example, we could predict the class label in a binary sentiment classification by comparing the token probability of Negative and Positive. There are many $f$ variants for different applications, which will be elaborated in §6.

According to the definition, we can see the difference between ICL and other related concepts. (1) Prompt Learning: Prompts can be discrete templates or soft parameters that encourage the model to predict the desired output. Strictly speaking, ICL can be regarded as a subclass of prompt tuning where the demonstration is part of the prompt. Liu et al. (2021) made a thorough survey on prompt learning. However, ICL is not included. (2) Few-shot Learning: few-shot learning is a general machine learning approach that uses parameter adaptation to learn the best model parameters for the task with a limited number of supervised examples Wang and Yao (2019). In contrast, ICL does not require parameter updates and is directly performed on pretrained LLMs.

Model Warmup

Although LLMs have shown promising ICL capability, many studies also show that the ICL capability can be further improved through a continual training stage between pretraining and ICL inference, which we call model warmup for short. Warmup is an optional procedure for ICL, which adjusts LLMs before ICL inference, including modifying the parameters of the LLMs or adding additional parameters. Unlike finetuning, warmup does not aim to train the LLM for specific tasks but enhances the overall ICL capability of the model.

To enhance ICL capability, researchers proposed a series of supervised in-context finetuning strategies by constructing in-context training data and multitask training. Since the pretraining objectives are not optimized for in-context learning (Chen et al., 2022a), Min et al. (2022b) proposed a method MetaICL to eliminate the gap between pretraining and downstream ICL usage. The pretrained LLM is continually trained on a broad range of tasks with demonstration examples, which boosts its few-shot abilities. To further encourage the model to learn input-label mappings from the context, Wei et al. (2023a) propose symbol tuning. This approach fine-tunes language models on in-context input-label pairs, substituting natural language labels (e.g., "positive/negative sentiment") with arbitrary symbols (e.g., "foo/bar"). As a result, symbol tuning demonstrates an enhanced capacity to utilize in-context information for overriding prior semantic knowledge.

Besides, recent work indicates the potential value of instructions Mishra et al. (2021) and there is a research direction focusing on supervised instruction tuning. Instruction tuning enhances the ICL ability of LLMs through training on task instructions. Tuning the 137B LaMDA-PT Thoppilan et al. (2022) on over 60 NLP datasets verbalized via natural language instruction templates, FLAN Wei et al. (2022a) improves both the zero-shot and the few-shot ICL performance. Compared to MetaICL, which constructs several demonstration examples for each task, instruction tuning mainly considers an explanation of the task and is more easier to scale up. Chung et al. (2022) and Wang et al. (2022c) proposed to scale up instruction tuning with more than 1000+ task instructions.

2 Self-supervised In-context Training

Leveraging raw corpora for warmup, Chen et al. (2022a) proposed constructing self-supervised training data aligned with ICL formats in downstream tasks. They transformed raw text into input-output pairs, exploring four self-supervised objectives, including masked token prediction and classification tasks. Alternatively, PICL Gu et al. (2023) also utilizes raw corpora but employs a simple language modeling objective, promoting task inference and execution based on context while preserving pre-trained models’ task generalization. Consequently, PICL outperforms Chen et al. (2022a)’s method in effectiveness and task generalizability.

$\Diamond$ Takeaway: (1) Supervised training and self-supervised training both propose to train the LLMs before ICL inference. The key idea is to bridge the gap between pretraining and downstream ICL formats by introducing objectives close to in-context learning. Compared to in-context finetuning involving demonstration, instruction finetuning without a few examples as demonstration is simpler and more popular. (2) To some extent, these methods all improve the ICL capability by updating the model parameters, which implies that the ICL capability of the original LLMs has great potential for improvement. Therefore, although ICL does not strictly require model warmup, we recommend adding a warmup stage before ICL inference. (3) The performance advancement made by warmup encounters a plateau when increasingly scaling up the training data. This phenomenon appears both in supervised in-context training and self-supervised in-context training, indicating that LLMs only need a small amount of data to adapt to learn from the context during warmup.

Demonstration Designing

Many studies have shown that the performance of ICL strongly relies on the demonstration surface, including demonstration format, the order of demonstration examples, and so on (Zhao et al., 2021; Lu et al., 2022). As demonstrations play a vital role in ICL, in this section, we survey demonstration designing strategies and classify them into two groups: demonstration organization and demonstration formatting, as shown in Table 1.

Given a pool of training examples, demonstration organization focuses on how to select a subset of examples and the order of the selected examples.

Demonstrations selection aims to answer a fundamental question: Which examples are good examples for ICL? We classify related studies into two categories, including unsupervised methods based on pre-defined metrics and supervised methods.

Liu et al. (2022) showed that selecting the closest neighbors as the in-context examples is a good solution. The distance metrics are pre-defined L2 distance or cosine-similarity distance based on sentence embeddings. They proposed KATE, a $k$ NN-based unsupervised retriever for selecting in-context examples. In addition to distance metrics, mutual information is also a valuable selection metric (Sorensen et al., 2022). Similarly, $k$ -NN cross-lingual demonstrations can be retrieved for multi-lingual ICL (Tanwar et al., 2023) to strengthen source-target language alignment. The advantage of mutual information is that it does not require labeled examples and specific LLMs. In addition, Gonen et al. (2022) attempted to choose prompts with low perplexity. Levy et al. (2022) consider the diversity of demonstrations to improve compositional generalization. They select diverse demonstrations to cover different kinds of training demonstrations. Different from these studies selecting examples from human-labeled data, Kim et al. (2022a) proposed to generate demonstrations from LLM itself.

Some other methods utilized the output scores of LMs $P(y|C,x)$ as unsupervised metrics to select demonstrations. Wu et al. (2022) selected the best subset permutation of $k$ NN examples based on the code-length for data transmission to compress label $y$ given $x$ and $C$ . Nguyen and Wong (2023) measured the influence of a demonstration $x_{i}$ by calculating the difference between the average performance of the demonstration subsets $\{C|x_{i}\in C\}$ and $\{C|x_{i}\notin C\}$ . Furthermore, Li and Qiu (2023a) used infoscore, i.e., the average of $P(y|x_{i},y_{i},x)-P(y|x)$ for all $(x,y)$ pairs in a validation set with a diversity regularization.

Rubin et al. (2022) proposed a two-stage retrieval method to select demonstrations. For a specific input, it first built an unsupervised retriever (e.g., BM25) to recall similar examples as candidates and then built a supervised retriever EPR to select demonstrations from candidates. A scoring LM is used to evaluate the concatenation of each candidate example and the input. Candidates with high scores are labeled as positive examples, and candidates with low scores are hard negative examples. Li et al. (2023f) further enhanced the EPR by adopting a unified demonstration retriever to unify the demonstration selection across different tasks. Ye et al. (2023a) retrieved the entire set of demonstrations instead of individual demonstrations to model inter-relationships between demonstrations. They trained a DPP retriever to align with LM output scores by contrastive learning and obtained the optimal demonstration set with maximum a posteriori at inference.

Based on prompt tuning, Wang et al. (2023e) view LLMs as topic models that can infer concepts $\theta$ from few demonstrations and generate tokens based on concept variables $\theta$ . They use task-related concept tokens to represent latent concepts. Concept tokens are learned to maximize $P(y|x,\theta)$ . They select demonstrations that are most likely to infer the concept variable based on $P(\theta|x,y)$ . Besides, reinforcement learning was introduced by Zhang et al. (2022a) for example selection. They formulated demonstration selection as a Markov decision process Bellman (1957) and selected demonstrations via Q-learning. The action is choosing an example, and the reward is defined as the accuracy of a labeled validation set.

1.2 Demonstration Ordering

Ordering the selected demonstration examples is also an important aspect of demonstration organization. Lu et al. (2022) have proven that order sensitivity is a common problem and always exists for various models. To handle this problem, previous studies have proposed several training-free methods to sort examples in the demonstration. Liu et al. (2022) sorted examples decently by their distances to the input, so the rightmost demonstration is the closest example. Lu et al. (2022) defined the global and local entropy metrics. They found a positive correlation between the entropy metric and the ICL performance. They directly used the entropy metric to select the best ordering of examples.

2 Demonstration Formatting

A common way to format demonstrations is concatenating examples $(x_{1},y_{1}),\ldots,(x_{k},y_{k})$ with a template $\mathcal{T}$ directly. However, in some tasks that need complex reasoning (e.g., math word problems, commonsense reasoning), it is not easy to learn the mapping from $x_{i}$ to $y_{i}$ with only $k$ demonstrations. Although template engineering has been studied in prompting (Liu et al., 2021), some researchers aim to design a better format of demonstrations for ICL by describing tasks with the instruction $I$ (§5.2.1) and adding intermediate reasoning steps between $x_{i}$ and $y_{i}$ (§5.2.2).

Except for the well-designed demonstration examples, good instructions which describe the task precisely are also helpful to the inference performance. However, unlike the demonstration examples, which are common in traditional datasets, the task instructions depend heavily on human-written sentences. Honovich et al. (2022) found that given several demonstration examples, LLMs can generate the task instruction. According to the generation ability of LLMs, Zhou et al. (2022c) proposed Automatic Prompt Engineer for automatic instruction generation and selection. To further improve the quality of the automatically generated instructions, Wang et al. (2022b) proposed to use LLMs to bootstrap off its own generations. Existing work has achieved good results in automatically generating instructions, which provided opportunities for future research on combining human feedback with automatic instruction generation.

2.2 Reasoning Steps Formatting

Wei et al. (2022c) added intermediate reasoning steps between inputs and outputs to construct demonstrations, which are called chain-of-thoughts (CoT). With CoT, LLMs predict the reasoning steps and the final answer. CoT prompting can learn complex reasoning by decomposing input-output mappings into many intermediate steps. There are many pieces of research on CoT prompting strategies (Qiao et al., 2022) including prompt designing and process optimization. In this paper, we mainly focus on CoT designing strategies.

Similar to demonstration selection, CoT designing also considers CoT selection. Different from Wei et al. (2022c) manually writing CoTs, AutoCoT (Zhang et al., 2022b) used LLMs with Let’s think step by step to generate CoTs. In addition, Fu et al. (2022) proposed a complexity-based demonstration selection method. They selected demonstrations with more reasoning steps for CoT prompting.

As input-output mappings are decomposed into step-by-step reasoning, some researchers apply multi-stage ICL for CoT prompting and design CoT demonstrations for each step. Multi-stage ICL queries LLMs with different demonstrations in each reasoning step. Self-Ask (Press et al., 2022) allows LLMs to generate follow-up questions for the input and ask themselves these questions. Then the questions and intermediate answers will be added to CoTs. iCAP (Wang et al., 2022a) proposes a context-aware prompter that can dynamically adjust contexts for each reasoning step. Least-to-Most Prompting (Zhou et al., 2022a) is a two-stage ICL including question reduction and subquestion solution. The first stage decomposes a complex question into subquestions; in the second stage, LLMs answer subquestions sequentially, and previously answered questions and generated answers will be added into the context.

Xu et al. (2023b) fine-tuned small LMs on specific task as plug-ins to generate pseudo reasoning steps. Given an input-output pair $(x_{i},y_{i})$ , SuperICL regarded the prediction $y_{i}^{\prime}$ and confidence $c_{i}$ of small LMs for the input $x_{i}$ as reasoning steps by concatenating $(x_{i},y_{i}^{\prime},c_{i},y_{i})$ .

$\Diamond$ Takeaway: (1) Demonstration selection strategies improve the ICL performance, but most of them are instance level. Since ICL is mainly evaluated under few-shot settings, the corpus-level selection strategy is more important yet under-explored. (2) The output score or probability distribution of LLMs plays an important role in instance selecting. (3) For $k$ demonstrations, the size of search space of permutations is $k!$ . How to find the best orders efficiently or how to approximate the optimal ranking better is also a challenging question. (4) Adding chain-of-thoughts can effectively decompose complex reasoning tasks into intermediate reasoning steps. During inference, multi-stage demonstration designing strategies are applied to generate CoTs better. How to improve the CoT prompting ability of LLMs is also worth exploring (5) In addition to human-written demonstrations, the generative nature of LLMs can be utilized in demonstration designing. LLMs can generate instructions, demonstrations, probing sets, chain-of-thoughts, and so on. By using LLM-generated demonstrations, ICL can largely get rid of human efforts on writing templates.

Scoring Function

The scoring function decides how we can transform the predictions of a language model into an estimation of the likelihood of a specific answer. A direct estimation method (Direct) adopts the conditional probability of candidate answers that can be represented by tokens in the vocabulary of language models (Brown et al., 2020). The answer with a higher probability is selected as the final answer. However, this method poses some restrictions on the template design, e.g., the answer tokens should be placed at the end of input sequences. Perplexity (PPL) is another commonly-used metric, which computes the sentence perplexity of the whole input sequence $S_{j}=\{C,s(x,y_{j},I)\}$ consists of the tokens of demonstration examples $C$ , input query $x$ and candidate label $y_{j}$ . As PPL evaluates the probability of the whole sentence, it removes the limitations of token positions but requires extra computation time. Note that in generation tasks such as machine translation, ICL predicts the answer by decoding tokens with the highest sentence probability combined with diversity-promoting strategies such as beam search or Top- $p$ and Top- $k$ (Holtzman et al., 2020) sampling algorithms.

Different from previous methods, which estimate the probability of the label given the input context, Min et al. (2022a) proposed to utilize channel models (Channel) to compute the conditional probability in a reversed direction, i.e., estimating the likelihood of input query given the label. In this way, language models are required to generate every token in the input, which could boost the performance under imbalanced training data regimes. We summarize all three scoring functions in Table 2. As ICL is sensitive to the demonstration (see §5 for more details), normalizing the obtained score by subtracting a model-dependent prior with empty inputs is also effective for improving the stability and overall performance (Zhao et al., 2021).

Another direction is to incorporate information beyond the context length constrain to calibrate the score. Structured Prompting (Hao et al., 2022b) proposes to encode demonstration examples separately with special positional embeddings, which then are provided to the test examples with a rescaled attention mechanism. $k$ NN Prompting (Xu et al., 2023a) first queries LLMs with training data for distributed representations, then predicts test instances by simply referring to nearest neighbors with closing representations with stored anchor representations.

$\Diamond$ Takeaway: (1) We conclude the characteristics of three widely-used scoring functions in Table 2. Although directly adopting the conditional probability of candidate answers is efficient, this method still poses some restrictions on the template design. Perplexity is also a simple and widely scoring function. This method has universal applications, including both classification tasks and generation tasks. However, both methods are still sensitive to demonstration surface, while Channel is a remedy that especially works under imbalanced data regimes. (2) Existing scoring functions all compute a score straightforwardly from the conditional probability of LLMs. There is limited research on calibrating the bias or mitigating the sensitivity via scoring strategies. For instance, some studies add additional calibration parameters to adjust the model predictions Zhao et al. (2021).

Analysis

To understand ICL, many analytical studies attempt to investigate what factors may influence the performance and aim to figure out why ICL works. We summarize the factors that have a relatively strong correlation to ICL performance in Table 3 for easy reference.

We first introduce influence factors in the LLM pretraining stage. Shin et al. (2022a) investigated the influence of the pretraining corpora. They found that the domain source is more important than the corpus size. Putting multiple corpora together may give rise to emergent ICL ability, pretraining on corpora related to the downstream tasks does not always improve the ICL performance, and models with lower perplexity do not always perform better in the ICL scenarios. Wei et al. (2022b) investigated the emergent abilities of many large-scale models on multiple tasks. They suggested that a pretrained model suddenly acquires some emergent ICL abilities when it achieves a large scale of pretraining steps or model parameters. Brown et al. (2020) also showed that the ICL ability grows as the parameters of LLMs increase from 0.1 billion to 175 billion.

In the inference stage, the properties of the demonstration samples also influence the ICL performance. Min et al. (2022c) investigated that the influence of demonstration samples comes from four aspects: the input-label pairing format, the label space, the input distribution, and the input-label mapping. They prove that all of the input-label pairing formats, the exposure of label space, and the input distribution contribute substantially to the ICL performance. Counter-intuitively, the input-label mapping matters little to ICL. In terms of the effect of input-label mapping, Kim et al. (2022b) drew an opposite conclusion that correct input-label mapping does impact the ICL performance, depending on specific experimental settings. Wei et al. (2023b) further found that when a model is large enough, it will show an emergent ability to learn input-label mappings, even if the labels are flipped or semantically-unrelated. From the compositional generalization perspective, An et al. (2023) validated that ICL demonstrations should be diverse, simple, and similar to the test example in terms of the structure. Lu et al. (2022) indicated that the demonstration sample order is also an important factor. In addition, Liu et al. (2022) found that the demonstration samples that have closer embeddings to the query samples usually bring better performance than those with farther embeddings.

2 Understanding Why ICL Works

Concentrating on the pretraining data, Chan et al. (2022) showed that the ICL ability is driven by data distributional properties. They found that the ICL ability emerges when the training data have examples appearing in clusters and have enough rare classes. Xie et al. (2022) explained ICL as implicit Bayesian inference and constructed a synthetic dataset to prove that the ICL ability emerges when the pretraining distribution follows a mixture of hidden Markov models.

By learning linear functions, Garg et al. (2022) proved that Transformers could encode effective learning algorithms to learn unseen linear functions according to demonstration samples. They also found that the learning algorithm encoded in an ICL model can achieve a comparable error to that from a least squares estimator. Li et al. (2023g) abstracted ICL as an algorithm learning problem and showed that Transformers can implement a proper function class through implicit empirical risk minimization for the demonstrations. Pan et al. (2023) decoupled the ICL ability into task recognition ability and task learning ability, and further showed how they utilize demonstrations. From an information-theoretic perspective, Hahn and Goyal (2023) showed an error bound for ICL under linguistically motivated assumptions to explain how next-token prediction can bring about the ICL ability. Si et al. (2023) found that large language models exhibit prior feature biases and showed a way to use intervention to avoid unintended features in ICL.

Another series of work attempted to build connections between ICL and gradient descent. Taking linear regression as a starting point, Akyürek et al. (2022) found that Transformer-based in-context learners can implement standard finetuning algorithms implicitly, and von Oswald et al. (2022) showed that linear attention-only Transformers with hand-constructed parameters and models learned by gradient descent are highly related. Based on softmax regression, Li et al. (2023e) found that self-attention-only Transformers showed similarity with models learned by gradient-descent. Dai et al. (2022) figured out a dual form between Transformer attention and gradient descent and further proposed to understand ICL as implicit finetuning. Further, they compared GPT-based ICL and explicit finetuning on real tasks and found that ICL indeed behaves similarly to finetuning from multiple perspectives.

Focusing on specific functional modules, Olsson et al. (2022) found that there exist some induction heads in Transformers that copy previous patterns to complete the next token. Further, they expanded the function of induction heads to more abstract pattern matching and completion, which may implement ICL. Wang et al. (2023b) focused on the information flow in Transformers and found that during the ICL process, demonstration label words serves as anchors, which aggregates and distributes key information for the final prediction.

$\Diamond$ Takeaway: (1) Knowing and considering how ICL works can help us improve the ICL performance, and the factors that strongly correlate to ICL performance are listed in Table 3. (2) Although some analytical studies have taken a preliminary step to explain ICL, most of them are limited to simple tasks and small models. Extending analysis on extensive tasks and large models may be the next step to be considered. In addition, among existing work, explaining ICL with gradient descent seems to be a reasonable, general, and promising direction for future research. If we build clear connections between ICL and gradient-descent-based learning, we can borrow ideas from the history of traditional deep learning to improve ICL.

Evaluation and Resources

As a general learning paradigm, ICL can be examined on various traditional datasets and benchmarks, e.g., SuperGLUE Wang et al. (2019), SQuAD Rajpurkar et al. (2018). Implementing ICL with 32 randomly sampled examples on SuperGLUE, Brown et al. (2020) found that GPT-3 can achieve results comparable to state-of-the-art (SOTA) finetuning performance on COPA and ReCoRD, but still falls behind finetuning on most NLU tasks. Hao et al. (2022b) showed the potential of scaling up the number of demonstration examples. However, the improvement brought by scaling is very limited. At present, compared to finetuning, there still remains some room for ICL to reach on traditional NLP tasks.

2 New Challenging Tasks

In the era of large language models with in-context learning capabilities, researchers are more interested in evaluating the intrinsic capabilities of large language models without downstream task finetuning Bommasani et al. (2021).

To explore the capability limitations of LLM on various tasks, Srivastava et al. (2022) proposed the BIG-Bench Srivastava et al. (2022), a large benchmark covering a large range of tasks, including linguistics, chemistry, biology, social behavior, and beyond. The best models have already outperformed the average reported human-rater results on 65% of the BIG-Bench tasks through ICL Suzgun et al. (2022). To further explore tasks actually unsolvable by current language models, Suzgun et al. (2022) proposed a more challenging ICL benchmark, BIG-Bench Hard (BBH). BBH includes 23 unsolved tasks, constructed by selecting challenging tasks where the state-of-art model performances are far below the human performances. Besides, researchers are searching for inverse scaling tasks,https://github.com/inverse-scaling/prize that is, tasks where model performance reduces when scaling up the model size. Such tasks also highlight potential issues with the current paradigm of ICL. To further probe the model generalization ability, Iyer et al. (2022) proposed OPT-IML Bench, consisting of 2000 NLP tasks from 8 existing benchmarks, especially benchmark for ICL on held-out categories.

Specifically, a series of studies focus on exploring the reasoning ability of ICL. Saparov and He (2022) generated an example from a synthetic world model represented in first-order logic and parsed the ICL generations into symbolic proofs for formal analysis. They found that LLMs can make correct individual deduction steps via ICL. Shi et al. (2022) constructed the MGSM benchmark to evaluate the chain-of-thought reasoning abilities of LLMs in multilingual settings, finding that LLMs manifest complex reasoning across multiple languages. To further probe more sophisticated planning and reasoning abilities of LLMs, Valmeekam et al. (2022) provided multiple test cases for evaluating various reasoning abilities on actions and change, where existing ICL methods on LLMs show poor performance.

3 Open-source Tools

Noticing that ICL methods are often implemented differently and evaluated using different LLMs and tasks, Wu et al. (2023) developed OpenICL, an open-source toolkit enabling flexible and unified ICL assessment. With its adaptable architecture, OpenICL facilitates the combination of distinct components and offers state-of-the-art retrieval and inference techniques to accelerate the integration of ICL into advanced research.

$\Diamond$ Takeaway: (1) Due to the restrictions of ICL on the number of demonstration examples, the traditional evaluation tasks must be adapted to few-shot settings; otherwise, the traditional benchmarks cannot evaluate the ICL capability of LLMs directly. (2) As ICL is a new paradigm that is different from traditional learning paradigms in many aspects, the evaluation of ICL presents new challenges and opportunities. Toward the challenges, the results of existing evaluation methods are unstable, especially sensitive to the demonstration examples and the instructions. Chen et al. (2022b) observed that existing evaluations by accuracy underestimate the sensitivity towards instruction perturbation of ICL. It is still an open question to conduct consistent ICL evaluation and OpenICLWu et al. (2023) represents a valuable initial attempt to address this challenge. Toward the opportunities for evaluation, as ICL only requires a few instances for the demonstration, it lowers the cost of evaluation data construction.

In-Context Learning Beyond Text

The tremendous success of ICL in NLP has inspired researchers to explore its potential in different modalities, including visual, vision+language and speech tasks as well.

Bar et al. (2022) employ an image patch infilling task in grid-like images using masked auto-encoders (MAE) to train their model. At the inference stage, the model generates output images consistent with provided input-output examples for a novel input image, showcasing promising ICL capabilities for unseen tasks such as image segmentation. Painter (Wang et al., 2023c) extends this approach by incorporating multiple tasks to build a generalist model, achieving competitive performance compared to task-specific models. Building upon this, SegGPT (Wang et al., 2023d) integrates diverse segmentation tasks into a unified framework and investigates ensemble techniques from spatial and feature perspectives to enhance the quality of prompt examples. Wang et al. (2023f) propose to utilize an extra text prompt to guide a generative model in comprehensively producing the desired image. The resulting Prompt Diffusion model is the first diffusion-based model that exhibits ICL ability. Figure 3 illustrates the key difference between the image-only and textual prompt augmented in-context learning for visual in-context learning.

Similar to ICL in NLP, the effectiveness of visual in-context learning is significantly influenced by the selection of in-context demonstration images (Zhang et al., 2023a; Sun et al., 2023). To address this, Zhang et al. (2023a) investigate two approaches: (1) an unsupervised retriever that selects nearest samples using an off-the-shelf model, and (2) a supervised method training an additional retriever model to maximize ICL performance. The retrieved samples notably enhance performance, exhibiting semantic similarity to the query and closer contextual alignment regarding viewpoint, background, and appearance. Except for the prompt retrieval, Sun et al. (2023) further explore a prompt fusion technique for improving the results.

2 Multi-Modal In-Context Learning

In the vision-language area, Tsimpoukelli et al. (2021) utilize a vision encoder to represent an image as a prefix embedding sequence that is aligned with a frozen language model after training on the paired image-caption dataset. The resulting model, Frozen, is capable of performing multi-modal few-shot learning. Further, Alayrac et al. (2022) introduce Flamingo, which combines a vision encoder with LLMs and adopts LLMs as the general interface to perform in-context learning on many multi-modal tasks. They show that training on large-scale multi-modal web corpora with arbitrarily interleaved text and images is key to endowing them with in-context few-shot learning capabilities. Kosmos-1 (Huang et al., 2023b) is another multi-modal LLMs and demonstrates promising zero-shot, few-shot, and even multimodal chain-of-thought prompting abilities. Hao et al. (2022a) present METALM, a general-purpose interface to models across tasks and modalities. With a semi-causal language modeling objective, METALM is pretrained and exhibits strong ICL performance across various vision-language tasks.

It is natural to further enhance the ICL ability with instruction tuning, and the idea is also explored in the multi-modal scenarios as well. Recent explorations first generate instruction tuning datasets transforming existing vision-language task dataset (Xu et al., 2022; Li et al., 2023a) or with power LLMs such as GPT-4 (Liu et al., 2023; Zhu et al., 2023a) , and connect LLMs with powerful vision foundational models such as BLIP-2 (Li et al., 2023c) on these multi-modal datasets (Zhu et al., 2023a; Dai et al., 2023).

3 Speech In-Context Learning

In the speech area, Wang et al. (2023a) treated text-to-speech synthesis as a language modeling task. They use audio codec codes as an intermediate representation and propose the first TTS framework with strong in-context learning capability. Subsequently, VALLE-X (Zhang et al., 2023b) extend the idea to multi-lingual scenarios, demonstrating superior performance in zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks.

$\Diamond$ Takeaway: (1) Recent studies have explored in-context learning beyond natural language with promising results. Properly formatted data (e.g., interleaved image-text datasets for vision-language tasks) and architecture designs are key factors for activating the potential of in-context learning. Exploring it in a more complex structured space such as for graph data is challenging and promising (Huang et al., 2023a). (2) Findings in textual in-context learning demonstration design and selection cannot be trivially transferred to other modalities. Domain-specific investigation is required to fully leverage the potential of in-context learning in various modalities.

Application

ICL manifests excellent performance on traditional NLP tasks and methods Kim et al. (2022a); Min et al. (2022b), such as machine translation Zhu et al. (2023b); Sia and Duh (2023), information extraction Wan et al. (2023); He et al. (2023) and text-to-SQL Pourreza and Rafiei (2023). Especially, through demonstrations that explicitly guide the process of reasoning, ICL manifests remarkable effects on tasks that require complexity reasoning Wei et al. (2022c); Li et al. (2023b); Zhou et al. (2022b) and compositional generalization Zhou et al. (2022a).

Moreover, ICL offers potential for popular methods such as meta-learning and instruction-tuning. Chen et al. (2022d) applied ICL to meta-learning, adapting to new tasks with frozen model parameters, thus addressing the complex nested optimization issue. Ye et al. (2023b) enhanced zero-shot task generalization performance for both pretrained and instruction-finetuned models by applying in-context learning to instruction learning.

Specifically, we explore several emerging and prevalent applications of ICL, showcasing their potential in the following paragraphs.

ICL has manifested the potential to be widely applied in data engineering. Benefiting from the strong ICL ability, it costs 50% to 96% less to use labels from GPT-3 than using labels from humans for data annotation. Combining pseudo labels from GPT-3 with human labels leads to even better performance at a small cost Wang et al. (2021). In more complex scenarios, such as knowledge graph construction, Khorashadizadeh et al. (2023) has demonstrated that ICL has the potential to significantly improve the state of the art of automatic construction and completion of knowledge graphs, resulting in a reduction in manual costs with minimal engineering effort. Therefore, leveraging the capabilities of ICL in various data engineering applications can yield significant benefits. Compared to human annotation (e.g., crowd-sourcing) or noisy automatic annotation (e.g., distant supervision), ICL generates relatively high quality data at a low cost. However, how to use ICL for data annotation remains an open question. For example, Ding et al. (2022) performed a comprehensive analysis and found that generation-based methods are more cost-effective in using GPT-3 than annotating unlabeled data via ICL.

The context-flexible nature of ICL demonstrates significant potential to enhance retrieval-augmented methods. By keeping the LM architecture unchanged and prepending grounding documents to the input, in-context RALMRam et al. (2023) effectively utilizes off-the-shelf general-purpose retrievers, resulting in substantial LM gains across various model sizes and diverse corpora. Furthermore, ICL for retrieval also exhibits the potential to improve safety. In addition to efficiency and flexibility, ICL also shows potential in safety Panda et al. (2023), Meade et al. (2023) use ICL for retrieved demonstrations to steer a model towards safer generations, reducing bias and toxicity in the model.

LLMs may contain outdated or incorrect knowledge, but ICL demonstrates the potential for effectively editing and updating this information. In an initial trial, Si et al. (2022) found that GPT-3 updated its answers 85% of the time when provided with counterfactual examples, with larger models performing better at in-context knowledge updating. However, this approach may impact other correct knowledge in LLMs. Compared to knowledge editing for fine-tuned models De Cao et al. (2021), ICL has proven effective for lightweight model editing. Si et al. (2022) explored the possibility of editing LLMs’ memorized knowledge through in-context demonstrations, discovering that a larger model scale and a mix of demonstration examples improved ICL-based knowledge editing success rates. In a comprehensive study, Zheng et al. (2023) investigated ICL strategies for editing factual knowledge, finding that well-designed demonstrations enabled competitive success rates compared to gradient-based methods, with significantly fewer side effects. This underlines the potential of ICL for knowledge editing.

Challenges and Future Directions

In this section, we review some of the existing challenges and propose possible directions for future research on ICL.

As investigated by Shin et al. (2022b), language model objectives are not equal to ICL abilities. Researchers have proposed to bridge the gap between pretraining objectives and ICL through intermediate tuning before inference (Section 4), which shows promising performance improvements. To take it further, tailored pretraining objectives and metrics for ICL have the potential to raise LLMs with superior ICl capabilities.

2 ICL Ability Distillation

Previous studies have shown that in-context learning for reasoning tasks emerges as the scale of computation and parameter exceed a certain threshold (Wei et al., 2022b). Transferring the ICL ability to smaller models could facilitate the model deployment greatly. Magister et al. (2022) showed that it is possible to distill the reasoning ability to small language models such as T5-XXL. The distillation is achieved by finetuning the small model on the chain-of-thought data (Wei et al., 2022c) generated by a large teacher model. Although promising performance is achieved, the improvements are likely task-dependent. Further investigation on improving the reasoning ability by learning from larger LLMs could be an interesting direction.

3 ICL Robustness

Previous studies have shown that ICL performance is extremely unstable, from random guess to SOTA, and can be sensitive to many factors, including demonstration permutation, demonstration format, etc. (Zhao et al., 2021; Lu et al., 2022). The robustness of ICL is a critical yet challenging problem.

However, most of the existing methods fall into the dilemma of accuracy and robustness Chen et al. (2022c), or even at the cost of sacrificing inference efficiency. To effectively improve the robustness of ICL, we need deeper analysis of the working mechanism of the ICL. We believe that the analysis of the robustness of the ICL from a more theoretical perspective rather than an empirical perspective can highlight future research on more robust ICL.

4 ICL Efficiency and Scalability

ICL necessitates prepending a significant number of demonstrations within the context. However, it presents two challenges: (1) the quantity of demonstrations is constrained by the maximum input length of LMs, which is significantly fewer compared to fine-tuning (scalability); (2) as the number of demonstrations increases, the computation cost becomes higher due to the quadratic complexity of attention mechanism (efficiency). Previous work in §5 focused on exploring how to achieve better ICL performance using a limited number of demonstrations and proposed several demonstration designing strategies. Scaling ICL to more demonstrations and improving its efficiency remains a challenging task.

Recently, some works have been proposed to address the issues of scalability and efficiency of ICL. Efforts were made to optimize prompting strategies with structured prompting (Hao et al., 2022b), demonstration ensembling (Khalifa et al., 2023), dynamic prompting (Zhou et al., 2023), and iterative forward tuning (Yang et al., 2023). Additionally, Li et al. (2023d) proposed EVaLM with longer context length and enhanced long-range language modeling capabilities. This model-level improvement aims to improve the scalability and efficiency of ICL. As LMs continue to scale up, exploring ways to effectively and efficiently utilize a larger number of demonstrations in ICL remains an ongoing area of research.

Conclusion

In this paper, we survey the existing ICL literature and provide an extensive review of advanced ICL techniques, including training strategies, demonstration designing strategies, evaluation datasets and resources, as well as related analytical studies. Furthermore, we highlight critical challenges and potential directions for future research. To the best of our knowledge, this is the first survey about ICL. We hope this survey can highlight the current research status of ICL and shed light on future work on this promising paradigm.