In-Context Learning Creates Task Vectors

Roee Hendel, Mor Geva, Amir Globerson

Introduction

Large language models have improved dramatically over the last several years. One striking property of these models is that they can learn new rules from very few demonstrations. For instance, a model can be prompted with the input “Apple \rightarrow Red, Lime \rightarrow Green, Corn \rightarrow” and produce the output “Yellow”. The model has thus learned a mapping based on just two examples, which it can apply correctly to new examples. This capability, referred to as In-Context Learning (ICL), has been used extensively, yielding impressive empirical results Brown et al. (2020); Liu et al. (2023); Dong et al. (2022).

It is unclear whether ICL operates in such a way because the prediction is performed via T([S,x])T([S,x]), where TT is typically an auto-regressive transformer and [S,x][S,x] is a concatenation of the tokens in SS and xx. Thus, in the general case, it can be an arbitrary function that operates on SS and xx to produce the output. This can include “non-parametric” methods such as nearest-neighbor. Recent work has begun to explore this question. For example, it was shown that when training a transformer from scratch to perform linear regression in context, the emerging learning algorithm is similar to Stochastic Gradient Descent Akyürek et al. (2022); von Oswald et al. (2022). However, for LLMs performing more complex natural language tasks, it is not at all clear what the hypothesis space may be.

In this work, we show that on a wide range of tasks, ICL in LLMs can be viewed as working on a very natural hypothesis space. We argue that, given a training set SS, the transformer maps it into a “task vector” θ(S)\boldsymbol{\theta}(S) that essentially represents the mapping/rule described in SS.The term “task vector” was coined by Ilharco et al. (2023) for directions in weight space that correspond to a particular task. Although our vectors are in “activations space” they share a similar motivation and thus we overload the term. Namely, given the transformer TT and a vector θ\boldsymbol{\theta}, we can construct a new function f(x;θ)f(x;\boldsymbol{\theta}) that implements the task. The function ff is very similar to the original transformer applied to xx without demonstrations but instead modulated by θ\boldsymbol{\theta} (see Fig. 2).

Our view is also related to soft prompts Lester et al. (2021), since both approaches modulate the function of the transformer towards a particular task. However, in ICL, task vectors are calculated in the forward pass rather than being fine-tuned.

Our contributions include proposing a hypothesis-class based mechanistic view of ICL, and conducting experiments to validate our view on a range of publicly available LLMs and a diverse set of tasks. Our results further the understanding of ICL and may have practical implications for the efficient adaptation of LLMs to perform specific tasks.

A Hypothesis Class View of ICL

Motivated by the hypothesis class view of learning theory, our goal is to understand if ICL maps the set of demonstrations SS to a function on the query xx and how this mapping occurs. Specifically, we seek to see if ICL converts SS into θ\boldsymbol{\theta} - the “parameters” of a function within a certain hypothesis space. Our empirical findings suggest this view is applicable, shedding light on the structure of the hypothesis space on which ICL can be viewed to operate.

We use TT to denote a decoder-only transformer LLM, SS to denote the set of demonstrations (i.e. training examples) used as input to ICL, and xx to denote the query that ICL is asked to provide an output for. We use T([S,x])T([S,x]) to denote the output of ICL on the concatenation of SS and xx.

To demonstrate that ICL operates within a hypothesis space, we aim to show that its underlying mechanism can be broken down into two parts:

A “Learning Algorithm” (denoted by A\mathcal{A}) that maps SS into a “task vector” θ\boldsymbol{\theta}, independent of the query xx. Given that attention layers can access both SS and xx, this independence is not trivial.

A “Rule Application” (denoted by ff) which maps the query xx to the output, based on θA(S)\boldsymbol{\theta}\equiv\mathcal{A}(S), without direct dependence on SS. Again, this independence is not trivial.

Thus, we consider the following mapping from a set of demonstrations and a query to the predicted output: T([S,x])=f(x;A(S))T([S,x])=f(x;\mathcal{A}(S)).

If we can break down the forward pass of the LLM into the above two components, we can view ICL as operating on the following hypothesis class: H={f(;θ)θ}\mathcal{H}=\{f(\cdot;\boldsymbol{\theta})\mid\boldsymbol{\theta}\}. In the next section we propose an implementation of such a class.

2 A Proposed Hypothesis Class

There are many possible realizations of the above framework, that correspond to different choices of A\mathcal{A} and ff. We next describe the realization we focus on, which naturally follows from the transformer architecture. We consider an ICL setting as in Fig. 1, where the input ends with a query xx (i.e., Corn) followed by an “\rightarrow” symbol. As mentioned above, we view learning as composed of two steps: calculating a parameter vector θ\boldsymbol{\theta} based on the training sample SS, and applying the rule defined by this parameter vector to the query xx. A presumably simple way for a transformer to do this is for the first LL layers of the \rightarrow representations to calculate θ\boldsymbol{\theta} and then for the remaining layers to take θ\boldsymbol{\theta} and xx as input and produce an output. See Fig. 1. Recall that SS and xx are accessible to the transformer at any layer, presenting a challenge with our view.

In the following sections, we address this challenge and present experiments validating our view. Namely, we show that we can isolate our proposed A\mathcal{A} and ff in the forward pass of LLMs performing ICL. We also show that the θ\boldsymbol{\theta} vectors are interpretable and correspond to learned tasks.

Validity of the Hypothesis Class View

We first show that separating the forward pass into the two distinct components A\mathcal{A} and ff, defined in §2.2, maintains the high accuracy of ICL.

We face some challenges in a regular forward pass: first, the initial LL layers that correspond to A\mathcal{A}, updating the representations of \rightarrow to create θ\boldsymbol{\theta}, can attend to the query xx. Thus, they may depend on xx, creating an unwanted dependence of θ\boldsymbol{\theta} on xx. Second, the remaining layers that correspond to ff, may directly access SS, instead of using only xx and θ\boldsymbol{\theta}.

We propose the following procedure to tackle these challenges: to solve the first problem, we introduce a “dummy query” xx^{\prime} and calculate the representations of \rightarrow using that query. We use the representation of \rightarrow after the first LL layers, calculated using xx^{\prime}, as the vector θ\boldsymbol{\theta} (as demonstrated on the left side of Fig. 2). An alternative was to block attention to xx, but it led to poor performance. To solve the second problem of calculating f(x,θ)f(x,\boldsymbol{\theta}) without allowing direct dependence on SS, we perform a forward pass of the transformer only on xx and \rightarrow,Ignoring positional embeddings, this is equivalent to blocking the attention to SS in these layers. and “patch” the θ\boldsymbol{\theta} we previously extracted at the LLth layer of the \rightarrow (right side of Fig. 2).Note that the second token can actually be anything, because it is overridden by patching. We use \rightarrow for simplicity.

2 Tasks and Models

We consider a diverse set of 18 tasks across 4 categories: algorithmic, translation, linguistic, and factual knowledge. For simplicity, we limit ourselves to single-token outputs. A representative subset of the tasks is described in Tab. 1. A complete detailed table, as well as more information regarding the data, are provided in § A.1.

Models

We use multiple open LLMs: LLaMA 7B, 13B, and 30B Touvron et al. (2023), GPT-J 6B Wang and Komatsuzaki (2021), and Pythia 2.8B, 6.9B, and 12B Biderman et al. (2023).

3 Finding L𝐿L

The mechanism we described in §2.2 has a free parameter - the layer LL where A\mathcal{A} ends and ff begins. We use the proposed (A,f)(\mathcal{A},f) implementation for different choices of LL and evaluate the accuracy on a development set to find the best layer.

Fig. 3 shows the accuracy on the development set, for different choices of LL. We focus here on the LLaMA models and include the rest in § A.2. Interestingly, all models exhibit a performance peak at a similar intermediate layer, irrespective of their parameters and layer count differences.

4 Accuracy of Hypothesis Based Prediction

We next compare the accuracy of the (A,f)(\mathcal{A},f) mechanism to that of a regular forward pass performing ICL. For each model and task, we evaluate the following three procedures:

Regular An application of the LLM to the demonstrations SS and query xx. Namely T([S,x])T([S,x]), as in regular ICL.

Hypothesis Our proposed procedure from § 3.1 where A\mathcal{A} generates θ\boldsymbol{\theta} using a dummy xx^{\prime}, and f(;θ)f(\cdot;\boldsymbol{\theta}) is applied to xx by running the transformer on [x,][x,\rightarrow] with θ\boldsymbol{\theta} patched at layer LL of \rightarrow.

Baseline A forward pass of the LLM only on xx, without demonstrations SS. That is, T([x,])T([x,\rightarrow]). This is the same as the application of ff from our separated procedure, but without patching θ\boldsymbol{\theta}.

Fig. 4 shows the average accuracy across all tasks of these 3 procedures, for each model. Full results are reported in Tab. LABEL:table:main_results in § A.2. Across all models, our procedure maintains around 80-90% of the accuracy of regular ICL, while the baseline reaches only 10-20%. This shows that our proposed separation to A\mathcal{A} and ff provides a good empirical approximation of the process underlying ICL.

Robustness of Task Vectors

In our setting, θ\boldsymbol{\theta} is derived from SS and a dummy query xx^{\prime}. It is natural to examine the robustness of θ\boldsymbol{\theta} to variations in these inputs. Intuitively, if it represents the task, it should remain stable across different SS and xx^{\prime} values.

To test this, we use LLaMA 7B to generate 50 task vectors per task with varied SS and xx^{\prime} and conduct two analyses.

A t-SNE dimensionality reduction (Fig. 5) reveals that the task vectors form distinct clusters, each containing task vectors of a single task. Fig. 9 further shows proximity between tasks of the same category, strengthening the idea that they encapsulate task understanding.

Variability of 𝜽𝜽\boldsymbol{\theta}

Fig. 8 shows histograms of distances within and across tasks. It can be seen that vectors within the same task are closer than those between different tasks, indicating that θ\boldsymbol{\theta} is stable within tasks and not highly influenced by xx^{\prime} or SS.

Dominance of 𝜽𝜽\boldsymbol{\theta} Patching

In §3 we prevented ff from directly accessing SS. However, in a regular forward pass during ICL, the last token can attend to SS. Here we verify that even in this case, ff mainly uses the task vector θ\boldsymbol{\theta}, without directly accessing the demonstrations SS. To this end, we use a pair of tasks, AA and BB, sharing the input space but differing on the output. We first use a “Regular” forward pass, where we provide the model with demonstrations SS for task AA (denoted SAS_{A}), to verify the model can perform this task using ICL. Then, we do a “Conflicting” forward pass, still providing SAS_{A}, while injecting θB\boldsymbol{\theta}_{B}. For more details, refer to Fig. 6 in §A.1.

In Tab.2, the “Regular” forward pass shows high accuracy on task AA (90%+), as anticipated. However, the “Conflicting” forward pass yields high accuracy on task BB, corresponding to the injected task vector θ\boldsymbol{\theta}. This implies that the model mainly relies on θ\boldsymbol{\theta}, largely disregarding the demonstrations SS for task AA. We note that the accuracy on task BB is slightly low, likely consistent with the performance dip seen in Fig. LABEL:table:main_results, and potentially further affected by the presence of SS.

Interpreting 𝜽𝜽\boldsymbol{\theta}

The learned vector θ\boldsymbol{\theta} intuitively captures information about the task demonstrated by SS. Here we provide evidence supporting this interpretation. Since θ\boldsymbol{\theta} is an intermediate hidden state of the transformer, we can employ a vocabulary projection method nostalgebraist (2020); Dar et al. (2022). Namely, we examine the top tokens in the distribution over the vocabulary induced by the hidden state.

Tab. 3 shows the top tokens for three tasks for LLaMA 13B (more models and tasks are provided in Tab. 7 in §A). In multiple cases, we observe tokens that directly describe the task. Importantly, these terms never explicitly appeared in the context. For example in the task of translation from French to English, we observe tokens such as “English” and “translate”. This supports our view that θ\boldsymbol{\theta} carries significant, non-trivial semantic information about the task.

Related Work

A key question with ICL is how it emerges as a capability from pre-training the LLMs. Levine et al. (2022) provides results in this direction that highlight the importance of training data structure. Xie et al. use probabilistic analysis and model pre-training data using Hidden Markov Models to theoretically explain the emergence of ICL, while Chan et al. (2022) empirically explore the effect of several distributional properties of the pre-training data.

Meta-Learning in Transformers

Studies by Akyürek et al. (2022); von Oswald et al. (2022); Garg et al. focus on the meta-learning capabilities of transformers. They typically train models from scratch on elementary tasks such as linear regression, drawing theoretical parallels with algorithms like Gradient Descent and demonstrating how transformers could implement them. A key assumption of these works is a known parameter space within which gradient descent operates. Our work focuses on identifying such a parameter space for LLMs.

ICL in LLMs

Olsson et al. (2022) identify “induction heads” in transformers as a likely main mechanism of ICL. Dai et al. (2022) provide empirical evidence for the connection of ICL to Gradient Descent in LLMs, focusing on classification tasks. Concurrent work by Merullo et al. (2023) also explores a phenomenon similar to the task vectors we study here, where a single vector can encode learned functions. Our findings are complementary to theirs, and future work could explore the relationship between the two more closely.

Conclusions

Through this exploration of ICL in LLMs, we have shed light on a new perspective of ICL learning mechanisms. We have revealed a simple and elegant structure: ICL functions by compressing a given training set into a single task vector, which then guides the transformer to generate appropriate outputs given queries. Our work provides a stepping stone towards understanding how LLMs perform ICL. In light of our findings, future work could focus on understanding how the task vector is constructed as well as how it is used to calculate the output.

Limitations

We study relatively simple tasks, whereas ICL can learn to perform more complex tasks, such as solving arithmetic reasoning problems. It remains to be seen if and how the mechanisms we observe here will translate to these cases. E.g., our approach focuses on cases where a single task vector suffices, while more complex ICL cases may require more elaborate parameterization. We also focus on tasks where the output is a single token, while some other tasks require multi-token outputs.

Finally, as noted above, we do not provide a mechanistic explanation for how the task vector is formed or how it is used. Namely, we do not explain how the transformer performs these calculations using its parameters.

Acknowledgements

This project is funded by the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation program (grant ERC HOLI 819080).

References

Appendix A Appendix

Here we provide additional details and results.

Our study covers 18 tasks in 4 categories: Algorithmic, Translation, Linguistic and Knowledge. A detailed description of all tasks is provided in Tab. 5.

Model Details

More details on the models used in the study are provided in Tab. 4.

Task Data

Here we detail the sources of the data for each task. The accompanying GitHub repository contains the data itself as well as the code used to create it.

Translation: For each language pair, the most frequent words in the source language are first retrieved from https://github.com/frekwencja/most-common-words-multilingual and are then translated to the destination language using the open-source package nltk.

Linguistic: The data for the tenses tasks is parsed from https://github.com/Drulac/English-Verbs-Conjugates. The data for the plural-singular task is taken from https://github.com/sindresorhus/irregular-plurals. Finally, the data for the antonyms task is taken from https://github.com/SuzanaK/english_synonyms_antonyms_list.

Knowledge Data for the knowledge tasks is taken from the counterfactual dataset introduced in Meng et al. (2022).

Conflicting Tasks Experiment

In Fig. 6, we provide more details and a visualization of the experiment described in §5.

A.2 Additional Results

Fig. 7 shows results similar to Fig. 3, but for different models. It is interesting to observe that the curves are similar across different-sized models.

Detailed results for Fig. 4.

Fig. 4 presented results for our (A,f)(\mathcal{A},f) hypothesis-based approach, averaged across tasks. Table. LABEL:table:main_results provides these results for all the specific tasks considered.

Dependence of 𝒜𝒜\mathcal{A} on x𝑥x

Fig. 9 and Fig. 8 provide more results on the geometry of the θ\boldsymbol{\theta} vectors (see main text for discussion).

Inspecting Task Vectors

Tab. 7 is an expanded version of Tab. 3, providing more vocabulary projections of θ\boldsymbol{\theta} for additional tasks and on multiple LLMs.