Structured Prompting: Scaling In-Context Learning to 1,000 Examples

Yaru Hao, Yutao Sun, Li Dong, Zhixiong Han, Yuxian Gu, Furu Wei

Introduction

In-context learning prompts pretrained language models to perform downstream tasks without any parameter update. Rather than fine-tuning the parameters for few-shot learning, we feed task-specific instructions and input-output demonstrations into large language models. Then the evaluation input is conditioned on the given context to make predictions. The paradigm is appealing since we can host language modeling inference as a general-purpose service for a wide range of tasks.

Most previous studies of in-context learning are conducted for few-shot learning. For example, PaLM typically conditions on five demonstration examples for most benchmarks. However, the restricted number of training instances potentially limits the usage of in-context learning in practice, especially when we have many examples. In comparison, fine-tuning is able to consume much more training examples for supervision despite the costly training. The above data utilization issue motivates us to empower in-context learning with more demonstration examples. Directly scaling up the size is challenging. For example, the language models with absolute position embeddings are pretrained with a predefined length. So the naive concatenation of many examples typically exceeds the maximum length. Moreover, the conventional self-attention mechanism suffers from quadratic complexity in terms of computation and memory consumption, rendering scaling up infeasible. In addition, more shots tend to reduce the performance variance caused by different choices and permutations of demonstration examples.

In this paper, we propose structured prompting to scale the number of examples to orders of magnitude larger and significantly improve stability. Rather than simply concatenating all demonstrations together, we divide a large number of demonstrations into multiple groups, which are independently encoded by the language model. So the encoding complexity becomes linear with respect to the number of groups, instead of quadratic complexity with respect to all examples. The position embeddings of grouped prompts are right-aligned to be next to the test input. Next, the input is encoded by conditioning on grouped prompts, where rescaled attention is proposed to normalize the attention scores. Our structured prompting is flexible to encode plenty of context in an efficient way. We conduct experiments on a variety of tasks, such as text classification, multi-choice, and open-ended tasks. Structured prompting successfully scales the number of demonstrations to much larger sizes. Our method substantially outperforms conventional in-context learning across various model sizes and tasks. Moreover, the approach greatly improves the stability of in-context learning.

Background: In-Context Learning

In-context learning allows language models to recognize the desired task and generate answers for given inputs by conditioning on instructions and input-output demonstration examples, rather than updating model parameters as fine-tuning.

Formally, given a set of $N$ labeled examples $\mathcal{D}_{\text{train}}=\{(x_{i},y_{i})\}_{i=1}^{N}$ (i.e., $N$ -shot in-context learning), each of them is transformed into a semantically meaningful demonstration $d_{i}=\mathcal{T}(x_{i},y_{i})$ using a hand-crafted template $\mathcal{T}$ . For example, the template of a binary sentiment classification task can be “Sentence: $x_{i}$ . Sentiment: $y_{i}$ ”, where $y_{i}$ is Negative or Positive. All demonstrations are concatenated as the context $\mathcal{Z}=d_{1}\oplus...\oplus d_{N}$ . Newlines or special tokens are used to delimit them. For each test input $x_{\text{test}}$ , we prompt the language model with the concatenation of $\mathcal{Z}$ and $x_{\text{test}}$ . The predicted answer is the completion with the highest language model probability, i.e., $\operatorname*{arg\,max}_{c\in\mathcal{Y}}P_{\text{LM}}(y^{c}|\mathcal{Z}\oplus\mathcal{T}(x_{\text{test}}))$ , where $\mathcal{Y}$ is the set of all possible candidates. For conventional in-context learning, the number of demonstrations $N$ is restricted by the maximum length of pretrained Transformers (typically $2048$ ), which typically fits $5$ to $100$ examples depending on the datasets.

Methods

In this section, we introduce structured prompting, which scales in-context learning to many examples under limited computation complexity. An overview of our approach is shown in Figure 1. First, we divide examples into groups. We obtain representations of group-structured exemplars independently with right-aligned position embeddings. Second, we incorporate the encoding results into the test input through the rescaled attention mechanism in each layer. Then the language model generates the answer.

Suppose we have $N$ demonstration examples. We randomly divide these examples into $M$ groups $\{\mathcal{Z}_{i}\}_{i=1}^{M}$ . Each group is a concatenation of exemplars $\mathcal{Z}_{i}=d_{N_{i-1}+1}\oplus...\oplus d_{N_{i}}$ , where $N_{0}=0$ and $N_{M}=N$ . As shown in Figure 1, all exemplar groups are separately encoded by the language model. Then we use the context encoding results for structured prompting. Notice that only key and value vectors of self-attention need to be cached, which are attended by the test input.

Grouped context encoding is able to consume longer sequences. In contrast, conventional in-context learning cannot exploit context efficiently because the concatenation of all examples far exceeds the window size of pretrained Transformers. Moreover, the computation complexity of conventional in-context learning is quadratic to the number of demonstration examples $N$ because of the matrix multiplication of queries and keys. It is infeasible when $N$ increases. Our approach improves it through splitting groups, which reduces the complexity from $\mathcal{O}(N^{2})$ to $\mathcal{O}(N^{2}/M)$ .

We right-align all the groups so that they have the same maximum position index. Hence all groups can have the same relative distance with respect to the test input. It is critical that the test input can be adjacent to all exemplars and pay equal attention to them. One way is to use left padding, i.e., pad tokens or space tokens. The other way is to set a maximum length for grouped context, truncating exemplars from the left side.

2 Structured Prompting

After encoding context exemplars in groups, the next step is to use them for prompting. As shown in Figure 1, all exemplars are incorporated into representations of the test input through a rescaled attention mechanism. Specifically, the test input is fed into the language model, conditioning on both itself and grouped exemplars. Let $L$ denote the maximum length of grouped context. The position index of the test input starts with $L+1$ so that it is contiguous to all groups.

We use $x$ instead of $\mathcal{T}(x_{\text{test}})$ for brevity. In each layer, we concatenate the keys and values of all exemplars and the test input, i.e., $\hat{K}=[K_{\mathcal{Z}_{1}},...,K_{\mathcal{Z}_{M}},K_{x}],\hat{V}=[V_{\mathcal{Z}_{1}},...,V_{\mathcal{Z}_{M}},V_{x}]$ . The test input $x$ attends both demonstrations and itself with causal masks. Then the attention output is computed via:

where $\sum_{j}{A_{ij}}=1$ , the query vector ${\bm{q}}_{i}\in Q_{x}$ , the key vector ${\bm{k}}_{j}\in\hat{K}^{\intercal}$ , and $d$ is dimension of queries and keys.

Compared with vanilla self-attention used in Transformers , the only difference is the scaling factor $M$ in Equation (2). Without rescaled attention, the test input will attend too much to exemplars and ignore itself as the number of exemplars increases. Intuitively, our method modifies the softmax function in self-attention by repeating test input tokens $M$ times. So we can augment the test input with multiple groups of context.

Experiments

We conduct experiments on open-source GPT-like (i.e., decoder-only Transformer) models released by . We use three models of different sizes with 1.3B, 6.7B, and 13B parameters. The context window contains up to 2048 tokens. For large-scale experiments, we use BLOOM-176B .

Datasets

We evaluate structured prompting on a wide range of tasks grouped into text classification, multi-choice, and open-ended generation tasks. For text classification, we use datasets of sentiment: SST-2 , SST-5 , MR , Subj ; topic: DBPedia , AGNews , TREC ; natural language inference: CB , RTE ; and question answering: BoolQ . For multi-choice tasks, we consider sentence completion: HellaSwag , StoryCloze ; commensense reasoning: PIQA , OpenBookQA , ARC-Easy , ARC-Challenge ; and COPA from SuperGLUE benchmark . For open-ended generation, we consider closed-book question answering: NaturalQS , WebQS , TriviaQA ; and extractive reading comprehension: SQuAD , SQuADv2 .

Evaluation Protocol

Following , we randomly draw $N$ fixed examples from the training set as conditioning and report evaluation results on the development set. The demonstrations are separated by a special token. For StoryCloze, there is no available training set so we draw from the development set and evaluate on the test set. To reduce cost, we use 4k test examples for inference. There are only six datasets with development sets larger than 4k and we randomly sample a fixed subset of them.

We design a hand-crafted template for each text classification dataset. For other datasets, we follow the same template in GPT-3. All templates are listed in Appendix A. Notice that demonstrations for reading comprehension datasets (SQuAD, SQuADv2) are constructed slightly differently from the original GPT-3. In GPT-3, the demonstrations provided for each test input are question-answer pairs from the same background passage as it. Here we consider a more strict setting where demonstrations are constructed with different passage-question-answer combinations from the training set.

For multi-choice tasks, we score each completion by the per-token language model likelihood (normalize perplexity by length) and pick the one with the highest score as the final answer. For text classification, we treat it as a multi-choice task with only one token per option and design meaningful names for each option. For open-ended generation tasks, we use beam search with a beam width of 3, a length penalty of $\alpha=0.6$ , and a maximum generation length of 30. We report exact-match accuracy for closed-book QA and F1 score for SQuAD and SQuADv2.

For conventional prompting, we report results for 0-shot and the largest shot that fills one context window ( $1\times$ ). The maximum number of shots is calculated based on the average length of each dataset. Structured prompting is no longer limited by the context window size and can scale in-context learning to thousands of examples. For datasets with shorter lengths (e.g., SST-2, Subj), we report results of 500-shot and 1000-shot. For datasets with longer lengths (e.g., AGNews, SQuAD), we choose the number of shots according to their average lengths. We find it beneficial to put as many demonstrations as possible in each group. Thus we adopt it in our main experiments. Under each setting, we use six different random seeds for all tasks and report the mean and variance.

2 Results

First, we consider text classification tasks, the results of nine datasets are shown in Table 1. Conditioning on thousands of examples, structured prompting brings consistent and significant improvements (3-5 absolute gains) for in-context learning. Moreover, our method makes in-context learning much more stable across multiple seeds, while conventional in-context learning is sensitive to different demonstration selections and permutations. In most cases, providing more examples leads to better performance and lower variance. For easier tasks like sentiment (SST-2, Subj, and MR) and topic classification (TREC and DBPedia), the improvement is more obvious and stable (the variance is generally less than 1.0). For natural language inference (CB), the result is relatively unstable, e.g., the 6.7B model has an outlier that classifies all examples into the same class, which indicates that inference is still a challenging task for in-context learning.

Multi-Choice Tasks

The performance comparison of multi-choice tasks is shown in Table 2. Structured prompting still brings consistent gains on these tasks. However, we notice that the improvement of our method on these tasks is relatively small compared with text classification. Besides, utilizing more demonstrations does not always lead to better performance. Scaling up the model size instead of the number of demonstrations can bring more improvements in these tasks.

Open-Ended Generation

Both text classification and multi-choice tasks restrict the label space. To evaluate structured prompting on open-ended generations tasks, we consider two types of datasets: closed-book question answering without conditioning on auxiliary information and extractive reading comprehension. Results are shown in Table 3. We observe that for all datasets except Natural Questions, incorporating more demonstrations via structured prompting leads to a monotonously increasing performance boost and a decreasing variance. Especially for SQuAD, our method has nearly five points improvement over the baseline on 13B LM with only 50 examples. For NQ, there is a negative observation that structured prompting has little gains compared with conventional in-context learning. For SQuAD, we tried up to 50-shot settings because of the long length of a single demonstration. We believe that the performance can still increase if the number of demonstrations can be further expanded.

3 Scale up to 176-Billion Model

The previous results show that the gains from structured prompting decrease slightly as the model size increases. To verify the effectiveness of our method on huge models, we conduct experiments with a subset of datasets on BLOOM-176B .

Our experiments are implemented on 8 $\times$ 80GB A100 GPUs. We evaluate conventional prompting under different prompt lengths (0.5 $\times$ means the prompt fills half of the context window size) and structured prompting under different group numbers (5 $\times$ means five groups of prompts). Under each setting, we use five different random seeds for all tasks and report the average results.

Large Model Results

The performance comparisons on 176-Billion Model are shown in Table 4. For most datasets, the performance gets better as the number of groups increases. The variance results also show that structured prompting is highly stable when using five groups. For TREC and PIQA, we observe that 3 $\times$ achieves better results that 5 $\times$ but they both outperform the conventional in-context learning. Our experiments show that large language models still have the potential to achieve better prompting results when utilizing more demonstrations.

4 Stability Analysis

Prior efforts demonstrate that different choices and permutations of conditioning examples can cause a high variance for in-context learning. We now investigate their impacts on structured prompting. Figure 2 shows how the performance and variance of in-context learning change as the number of examples increases. We observe that our method can significantly reduce inference variance for text classification and open-ended generation tasks. This phenomenon is less pronounced for multi-choice tasks that correlate better with pretraining. For larger LMs (13B), the baseline approach is stable enough in the max-shot case. Structured prompting can further boost its stability while achieving better performance. It suggests that in-context learning is underestimated under the few-shot setting and our method can bring key benefits to maximize its stability and effectiveness.

5 Ablation Studies

The results of different prompt lengths are shown in Figure 3(a). We control the number of examples as a constant value, so the group number is inversely proportional to the prompt length. Generally, the longer prompt length means better accuracy. In the small model (1.3B), the performance has a big drop when the prompt length is small (0.25 $\times$ ). It shows that the auto-regressive structure is still the most “natural” one. The best way to use structured prompting is to expand the groups under the maximal sequence length in pre-training. Therefore, we believe that the fundamental problem is language models’ ability to deal with exceeding sequence length. If the model’s extrapolation performance is satisfying, the benefits of structured prompting will degenerate to memory saving. We leave it for future work.

The Effect of Scaling Factor

The scaling factor is essential in structured prompting. The results of various scaling factors are shown in Figure 3(b). We observe that without the scaling factor, the attention distribution will focus on the demonstration, leaving the query alone. As illustrated before, a natural way is to repeat the query the same times as the group number. With that, the query tokens seem to be concatenated with every demonstration. The experiments show that the appropriate scaling factor contributes to huge progress compared with the naive inference. Besides, the exact M multiplier is best among the surrounding values, although large language models are more robust to the disturbance of the scaling factor.

The Effect of Alignment Strategies

In structured prompting, the group should have the same length to ensure the query and demonstrations are continuous. In one group, we initialize tokens by padding the length which is set as a constant. Then, we fill with examples sequentially from right to left. When the vacancy’s length is smaller than the incoming example’s length, three ways are optional to deal with the left padding:

Attention Mask: Masking the padding tokens during the whole process, including the calculation for ( $K_{\mathcal{Z}_{i}}$ , $V_{\mathcal{Z}_{i}}$ ) and the attention stage at inference step.

Pad Space: Replacing the padding tokens with blank space tokens. In this case, the attention is calculated in every token.

Truncate: filling the padding tokens with the incoming example but the front is truncated to maintain the constant length.

Table 5 shows that the “Truncate” strategy works well for both models. In FairseqLM, there is a token so that the subsequent tokens are less disturbed. In BLOOM, transitional invariance of Alibi is used to deal with padding and mask. However, it has a significant drop with “Pad Space”. The absence of seems to amplify the noise brought by blank space tokens. In conclusion, the “Truncate” strategy is the easiest and most natural way for aligning groups.

Related Work

Despite surprisingly effective, in-context learning suffers from certain vulnerabilities. For instance, the order of demonstrations and the choice of templates can cause a high variance in performance. show that the variance arises because of three types of biases (majority label, recency, and common token bias) and propose to calibrate model prediction by content-free output. demonstrate that these biases cause the decision boundary shift and propose calibrating it by estimating the distribution of prototypical clusters. Other work focus on prompt engineering, including selecting the performant demonstration permutation and semantically-similar in-context examples with a retrieval module . We aim to improve in-context learning by scaling up the number of demonstrations.

Understanding In-Context Learning

Another line of work investigates understanding how in-context learning works. propose a Bayesian inference framework to explain it, where the language model implicitly infers a concept when making a prediction. Since in-context learning emerges after pretraining on a large corpus, some efforts study the correlation between pretraining corpus and in-context learning performance . Additionally, previous work investigates whether the label-mapping of demonstrations matters as expected.

Fusion-In-Decoder

propose Fusion-In-Decoder for encoder-decoder fine-tuning. The method was applied to open-domain question answering in order to leverage retrieved passages. Specifically, each retrieved supporting passage is encoded by bidirectional encoders. Then the decoder performs conventional attention over the concatenation of the representations of passages. In comparison, we focus on in-context learning with decoder-only models (such as GPT), without fine-tuning the original parameters. There are also several key technical differences, which are critical to making the method work well. First, we proposed rescaled attention to balance the attention allocation between context and test input. Second, we right-align position embeddings for structured context so that they have the same relative distance with respect to the input.

Discussion and Conclusion

In this work, we explore how to utilize more examples for in-context learning and propose structured prompting to scale up the number of examples under restricted computation complexity. We encode each group of demonstrations independently and prompt the language model with the concatenations of their representations via the rescaled attention. Experimental results across a diverse set of tasks show that our method outperforms the conventional approach. As the number of examples increases, our method achieves further gains and is much more stable.

Despite the promising results, the current method still has some limitations. Ideal in-context learning should be invariant to demonstration permutations. If our method ensures that each group only contains one example, it satisfies the property indeed. However, in our experiments, we find that it works well on smaller models (i.e., 1.3B) but does not work on larger models (i.e., 13B). We hypothesize that larger models benefit more from autoregressive information, so we include multiple examples in each group. For future work, we will dive more deeply into this direction. Moreover, there is a mismatch between patterns of language model pretraining and in-context learning. The current objective makes language models only aware of sequential relationships but not parallel relationships. We would like to incorporate this prior knowledge during pretraining so that it aligns better with the scheme of downstream inference. In addition, structured prompting can be used to inject many long documents as context, e.g., using retrieved texts to augment generation.