Transformers generalize differently from information stored in context vs in weights

Stephanie C. Y. Chan, Ishita Dasgupta, Junkyung Kim, Dharshan Kumaran, Andrew K. Lampinen, Felix Hill

Introduction

Transformer-based architectures have an impressive ability to use both information stored in weights during training (“in-weights learning”), and information stored only in the inputs provided at inference time (without any gradient updates to the weights of the model; “in-context learning”) (Chan et al., 2022). In-context learning on pretrained models enables learning efficiently from a few examples (“few-shot learning") (Brown et al., 2020), or even efficiently compressing a large dataset (“prompt tuning") (Li and Liang, 2021; Lester et al., 2021; Sun et al., 2022). Given the evident current and future potential for this new learning paradigm, it is important and useful to understand its inductive biases, especially how it differs from in-weights learning.

One way to understand inductive bias is by examining how models generalize to held-out data. In this work, we adapt the experimental paradigm in Dasgupta et al. (2022) that pose a classification task that distinguishes between two previously defined kinds of generalization behaviors (see 1). A “rule-based” decision is made on the basis of minimal features that support the category boundary (Ashby and Townsend, 1986), while an “exemplar-based” decision generalizes on the basis of similarity to examples from training data (Shepard and Chang, 1963), invoking many or all features available.

This distinction is particularly interesting when comparing in-weights vs in-context learning. Exemplar-based generalization (that uses all available features) is useful in a low-data regime where there is not enough information to form an abstract sparse rule (Feldman, 2020). On the other hand, sparser rule-based generalization may help avoid sensitivity to spurious correlation when training with large, noisy, naturalistic datasets (that are commonly used to train in-weights learning).

We find that transformers exhibit a striking difference in their generalization from in-context vs in-weights information. Transformers display a strong inductive bias towards exemplar-based generalization from in-context information. In contrast, transformers display a strong inductive bias towards rule-based generalization from in-weights information.

However, when we pose a similar task to large transformer models pretrained on language, they exhibit stronger rule-based generalization from in-context information. One interpretation of these results is that the distribution of natural language is more compatible with rule-based generalization from context (rule-based generalization is in fact optimal in compositional domains like langauge (Arjovsky et al., 2019)), and such patterns might present strong enough learning pressure to overcome – and even reverse – transformers’ inherent bias towards exemplar-based generalization from context.

Experimental Design

We adapted the “partial exposure” paradigm from Dasgupta et al. (2022) where each stimulus has two features; only one of the features predicts the label. We evaluate how the model generalizes to a held-out combination (using sparse rules or similarity to exemplars), see Fig 1.

First, we explored generalization in transformers trained on controlled synthetic data, where we can examine generalization from both in-weights and in-context information and directly compare them. Second, we repeat this experiment on pretrained language models and characterize their in-context generalization. Finally, we compare the patterns observed and invesigate factors that explain the differences observed.

Results

For the trained-from-scratch transformers, we passed sequences of stimulus-label pairs as inputs to the transformer model (Vaswani et al., 2017). The sequences consisted of two parts: a context (24 tokens; i.e. 12 stimulus-label pairs) and a query (stimulus). The model was trained to minimize a softmax cross-entropy loss on the prediction for the final (query) stimulus. Each stimulus consists of two subvectors concatenated together into a single token (Fig 4(c)) – these subvectors comprise the two features of the partial exposure paradigm. See Appendix A for further details.

To investigate generalization from in-weights information, we trained the model on partial exposure data, and evaluated on the held-out combination. During training, the label for each stimulus class was fixed, so that the stimulus-label mappings were stored in weights. The context tokens are uninformative for the query. After training, we measured the model’s biases by evaluating on the held-out class combination. See Appendix Fig 4(d) for details. When trained and evaluated in this way, transformers displayed fully rule-based generalization (Fig 2(a)), i.e. based on a sparse rule that only took the first feature into account.

Generalization from in-context information is exemplar-based.

To investigate generalization from in-context information, we first pretrained the model for few-shot in-context learning (see A.2 for more details), i.e. to refer to information provided in-context when making a query prediction. Importantly, pretraining for few-shot learning imparts no bias towards either rule-based or exemplar-based generalization, because either is an equally valid strategy for solving the few-shot problems. Then we evaluated the model on partial exposure sequences. See Appendix Fig 4(b) for example sequences used for training and evaluation. When trained and evaluated in this way, transformers displayed totally exemplar-based generalization, in striking contrast to the rule-based generalization that was observed from weights. That is, when queried on the held-out combination BX, the models were equally likely to output the labels associated with either AX and BW (Fig 2(b)), indicating that they are comparing the query directly (and along all features) to examples that were shown in context.

2 Pretrained language models

Do these patterns hold when evaluating on pretrained language models, arguably the most-used transformer-based model at the moment? We used a large (70B parameters) pre-trained language model (LM) trained for autoregressive language prediction on large-scale web data (Hoffmann et al., 2022). To investigate in-context generalization in this model, we use the same “partial exposure” experimental paradigm, but instead using shape and color words as the two features comprising the stimuli, and nonsense words for the labels (see Fig 5 for an example). We take the completion generated by the LM as the predicted query label.

With the synthetic data, we could ensure no a priori bias towards any feature; we have no such guarantee here. We account for any possible bias toward generalizing along a specific feature with a control condition from (Dasgupta et al., 2022) where neither feature is more predictive in the training data, so the model must use pure exemplar-based generalization. Performance in this condition is now used as the baseline for pure exemplar-based generalization, and deviation from this baseline is the measure for rule-based generalization. See Appendix B for further details. We don’t investigate in-weights generalization in these models, since that would require manipulation and control of the training data and large-scale re-training.

First, in the control condition where the model is forced to use exemplar-based generalization, we find that the language model prefers generalizing along the color dimension rather than the shape dimension (Fig 3(a)). Since bias towards shape gives better classification performance on naturalistic visual stimuli (Landau et al., 1988; Geirhos et al., 2018), it is interesting that we see the opposite bias in a system trained on naturalistic text data; future work should look into possible explanations (e.g. the “pragmatic” nature of language; Degen et al. (2019)). The LM predominantly produces one of the two labels provided in context, but does sometimes produce an unseen word (‘other’ in Fig 3). The LM produces an unseen word more frequently when the query is also unseen in-context (not reported), suggesting a mutual exclusivity bias (Gandhi and Lake, 2020), also worth future investigation.

To measure the degree of rule-based generalization, we compared each partial exposure result to the respective control: for instance, we compared the probability of generalizing along color in the color predictive partial exposure evaluation condition (Fig 3(c)) vs in the control condition. If a model uses pure exemplar-based generalization, there will be no difference in the classification pattern between the partial exposure and the control conditions. Alternatively, rule-based generalization predicts an increased sensitivity to the predictive feature dimension (either shape or color) compared to control. Here, we found evidence of rule-based generalization for both color and shape. This is still in contrast with the purely exemplar-based in-context generalization we saw in our synthetic experiment.

Smaller models are less rule-based.

We investigate this further by evaluating language models of different sizes. We measure ‘rule-ness’ as how much more likely the model is the generalize along the predictive feature (as supported by a sparse rule) in the partial exposure condition, compared to the corresponding (model-specific) control condition. This corresponds to the difference between the purple bar and dotted control in Figs 2 and 3. We find that smaller models (1B and 7B parameters) are steadily less rule-based, with the 1B model effectively performing exact exemplar-based generalization – similar to those trained from scratch (Fig 3(d), further details in Appendix 6).

3 In-context generalization can be made more rule-based with pre-training.

We did not observe the same effect of scale in transformers that were trained from scratch on synthetic data – increasing number of layers, number of attention heads, and number of classes did not lead to more rule-basedness. This may be because we were not able to achieve the necessary scale with those experiments or due to synergistic effects between scale and the type of training data.

To evaluate the role of training data, we evaluated in-context generalization on a transformer trained on synthetic stimuli where the query explicitly required rule-based classification (more details in Appendix A.4). With this training data, the transformer learns rule-based generalization to held-out sequences 2(c). Thus, while transformers exhibit inherent bias towards exemplar-based generalization from context, when trained on data that ecnourages rule-based generalization from context, they can learn to do so. This supports the interpretation that pretrained language model show rule-based generalization because natural language contains implicitly rule-based data, but crucially, this structure only seems to be picked up and used by larger models.

Conclusions

We found distinct patterns of generalization when transformers generalize from information stored in weights vs in context. When trained on synthetic data, generalization from in-weights information is completely rule-based, whereas generalization from in-context information is almost entirely exemplar-based. However, a pre-trained language model is surprisingly rule-based when generalizing from in-context information, and this is increasingly true for larger models. We find that it is indeed possible to induce rule-based generalization from in-context information by pretraining a transformer on an explicitly rule-based classification problem. Together, these findings support the possibility that natural language data (perhaps because of its combinatorial nature) provides a strong learning pressure towards rule-like generalization, which works in concert with model scale.

References

Appendix A Experiment details: Trained-from-scratch transformers

Each subvector belongs a particular “subvector class”, and each subvector class is characterized by a different centroid. The subvectors are sampled from a multivariate normal centered on that centroid. A “stimulus class” is the concatenation of two subvector classes, and a stimulus is a sampling from a stimulus class. Example stimuli are shown in Fig 4. This design is a 64-dimensional generalization of the 2-D classification example from Dasgupta et al. , and ensures that there is no a priori bias towards one feature or another.

Number of features per stimulus: 2 | Feature length: 32 | Number of classes per feature: 10 | Number of values per class: 100 | Covariance scaling on the per-class normal distributions: 0.1

A.2 Pretraining for few-shot learning

To pretrain the model for in-context learning, we pretrained the model on few-shot (4-shot 3-way) sequences, i.e. sequences for which the context consisted of 3 different stimulus classes each repeated 4 times, and where the query class was one of those 3 classes. The classes and labels were randomly assigned for each sequence.

A.3 Evaluating inductive biases

Models were trained and evaluated as shown in Figs 4(b) and 4(d).

An additional label ("2", in the example in Fig 4(b)) and corresponding examples were included in the partial exposure sequences, to ensure that if the model is equally likely to select the labels associated with "A" and "B", it is because of those examples’ similarity to the query stimulus, rather than because the model is selecting labels at chance. This was similarly done for training on partial exposure data in weights (not shown in Fig 4(d)).

Note also that the BW stimulus was shown twice as often as other stimuli in the partial exposure data, as was done in Dasgupta et al. , in order to ensure that there is no bias induced by having one label more frequent than another. We also ran experiments where BW was not repeated, and the pattern of results was the same.

A.4 Pretraining for rule-based generalization

To pretrain the model for rule-based generalization, we used the partial exposure sequences shown in 4(b), but for training as well as for evaluation. In training, the label for the query BX was always the same as the label associated with BW in context (note that the stimuli and labels are randomly assigned, so that BX could be manifested as any of the stimulus classes, and the query label could be any of .)

A.5 Architecture, training, and evaluation details

Num layers: 12 | Embedding size: 64 | Optimizer: Adam | Batch size: 32 | Learning rate schedule: Linear warmup and square root decay, described as min(3e-4 / 4000 * global_step, power(4000, 0.5) * 3e-4 * power(global_step, -0.5))

For each experiment, we ran 16 TPUv3 cores and 4 v100 GPU cores for 200,000 training steps.

Models are evaluated on evaluation data throughout training, and bar plots in Fig 2 show evaluation outputs averaged over the last half of training (100k-200k steps). Error bars indicate 1.96 standard errors across 10 training runs (there is zero variance for the results in Figs 2(a) and 2(c)).

Appendix B Experiment details: Large language model experiments

See Fig 5 for example evaluation sequences.

In order to account for potential low-level word-order effects, we evaluated on four different formats for the stimuli (‘red circle’, ‘a circle that is red’, ‘an object that is circular and red’, ‘an object that is red and circular’). We found no significant differences in qualitative patterns across the different word orderings, so we report the average across all four formats in our results.

We also include more detailed information about the experiments using different model sizes in Fig 6. These show the raw performances (including model-specific controls) on all model sizes and feature sets.