Linear Representations of Sentiment in Large Language Models

Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, Neel Nanda

Introduction

Large language models (LLMs) have displayed increasingly impressive capabilities (Brown et al., 2020; Radford et al., 2019; Bubeck et al., 2023), but their internal workings remain poorly understood. Nevertheless, recent evidence (Li et al., 2023) has suggested that LLMs are capable of forming models of the world, i.e., inferring hidden variables of the data generation process rather than simply modeling surface word co-occurrence statistics. There is significant interest (Christiano et al. (2021), Burns et al. (2022)) in deciphering the latent structure of such representations.

In this work, we investigate how LLMs represent sentiment, a variable in the data generation process that is relevant and interesting across a wide variety of language tasks (Cui et al., 2023). Approaching our investigations through the frame of causal mediation analysis (Vig et al., 2020; Pearl, 2022; Geiger et al., 2023a), we show that these sentiment features are represented linearly by the models, are causally significant, and are utilized by human-interpretable circuits (Olah et al., 2020; Elhage et al., 2021a).

We find the existence of a single direction scientifically interesting as further evidence for the linear representation hypothesis (Mikolov et al., 2013; Elhage et al., 2022)– that models tend to extract properties of the input and internally represent them as directions in activation space. Understanding the structure of internal representations is crucial to begin to decode them, and linear representations are particularly amenable to detailed reverse-engineering (Nanda et al., 2023b).

We show evidence of a phenomenon we have labeled the “summarization motif”, where rather than sentiment being directly moved from valenced tokens to the final token, it is first aggregated on intermediate summarization tokens without inherent valence such as commas, periods and particular nouns.Our use of the term “summarization” is distinct from typical NLP summarization tasks This summarization structure for next token prediction can be seen as a naturally emerging analogue to the explicit classification token in BERT-like models (Devlin et al., 2018). We show that the sentiment stored on summarization tokens is causally relevant for the final prediction. We find this an intriguing example of an “information bottleneck”, where the data generation process is funnelled through a small subset of tokens used as information stores. Understanding the existence and location of information bottlenecks is a key first step to deciphering world models. This finding additionally suggests the models’ ability to create summaries at various levels of abstraction, in this case a sentence or clause rather than a token.

Our contributions are as follows. In Section 3, we demonstrate methods for finding a linear representation of sentiment using a toy dataset and show that this direction correlates with sentiment information in the wild and matters causally in a crowdsourced dataset. In Section 4, we show through activation patching (Vig et al., 2020; Geiger et al., 2020) and ablations that the learned sentiment direction captures summarization behavior that is causally important to circuits performing sentiment tasks. Through this case study, we model an investigation of what a single interpretable direction means on the full data distribution.

Methods

A templatic dataset of continuation prompts we generated with the form

where ADJECTIVE and VERB are either two positive words (e.g., incredible and enjoyed) or two negative words (e.g., horrible and hated) that are sampled from a fixed pool of 85 adjectives (split 55/30 for train/test) and 8 verbs. The expected completion for a positive review is one of a set of positive descriptors we selected from among the most common completions (e.g. great) and the expected completion for a negative review is a similar set of negative descriptors (e.g., terrible).

A similar toy dataset with prompts of the form NAME1 VERB1 parties, and VERB2 them whenever possible. NAME2 VERB3 parties, and VERB4 them whenever possible. One day, they were invited to a grand gala. QUERYNAME feels very To evaluate the model’s output, we measure the logit difference between the “excited” and “nervous” tokens.

SST Socher et al. (2013) consists of 10,662 one sentence movie reviews with human annotated sentiment labels for every phrase from every review.

OWT (Gokaslan & Cohen, 2019) is the pretraining dataset for GPT-2 which we use as a source of random text for correlational evaluations.

(Radford et al., 2019; Biderman et al., 2023) These are families of decoder-only transformer models with sizes varying from 85M to 2.8b parameters. We use GPT2-small for movie review continuation, pythia-1.4b for classification and pythia-2.8b for multi-subject tasks.

2 Finding Directions

(Geiger et al., 2023b) The direction is a learned parameter $\theta$ where the training objective is the average logit difference

after patching using direction $\theta$ (see Section 2.3).

3 Causal Interventions

In activation patching (Geiger et al., 2020; Vig et al., 2020), we create two symmetrical datasets, where each prompt $x_{\text{orig}}$ and its counterpart prompt $x_{\text{flipped}}$ are of the same length and format but where key words are changed in order to flip the sentiment; e.g., “This movie was great” could be paired with “This movie was terrible.” We first conduct a forward pass using $x_{\text{orig}}$ and capture these activations for the entire model. We then conduct forward passes using $x_{\text{flipped}}$ , iteratively patching in activations from the original forward pass for each model component. We can thus determine the relative importance of various parts of the model with respect to the task currently being performed. Geiger et al. (2023b) introduce distributed interchange interventions, a variant of activation patching that we call “directional activation patching ”. The idea is that rather than modifying the standard basis directions of a component, we instead only modify the component along a single direction in the vector space, replacing it during a forward pass with the value from a different input.

We use two evaluation metrics. The logit difference (difference in logits for correct and incorrect answers) metric introduced in Wang et al. (2022), as well as a “logit flip” accuracy metric (Geiger et al., 2022), which quantifies the proportion of cases where we induce an inversion in the predicted sentiment.

We eliminate the contribution of a particular component to a model’s output, usually by replacing the component’s output with zeros (zero-ablation) or the mean over some dataset (mean-ablation), in order to demonstrate its magnitude of importance. We also perform directional ablation, in which a component’s activations are ablated only along a specific (e.g. sentiment) direction.

Finding and Evaluating a ‘Sentiment Direction’

The first question we investigate is whether there exists a direction in the residual stream in a transformer model that represents the sentiment of the input text, as a special case of the linear representation hypothesis (Mikolov et al., 2013). We show that the methods discussed above (2.2) all arrive at a similar sentiment direction. Given some input text to a model, we can project the residual stream at a given token/layer onto a sentiment direction to get a ‘sentiment activation’.

We fit directions using the ToyMovieReview dataset (Section 2.1) across various methods and finding extremely high cosine similarity between the learned sentiment directions (Figure 2). This suggests that these are all noisy approximations of the same singular direction. Indeed, we generally found that the following results were very similar regardless of exactly how we specified the sentiment direction. The directions we found were not sparse vectors, as expected since the residual stream is not a privileged basis (Elhage et al., 2021b).

Here we show a visualisation in the style of Neuroscope (Nanda, 2023a) where the projection is represented by color, with red being negative and blue being positive. It is important to note that the direction being examined here was trained on just 30 positive and 30 negative English adjectives in an unsupervised way (using $K$ -means with $K=2$ ). Notwithstanding, the extreme values along this direction appear readily interpretable in the wild in diverse text domains such as the opening paragraphs of Harry Potter in French (Figure 1). An interactive visualisation of residual stream directions in GPT2-small is available here (Yedidia, 2023) and sentiment directions here.

It is important to note that this type of analysis is qualitative, which should not act as a substitute for rigorous statistical tests as it is susceptible to interpretability illusions (Bolukbasi et al., 2021). We rigorously evaluate our directions using correlational and causal methods.

2 Correlational Evaluation

In a correlational analysis, we classify word sentiment by ‘sentiment activation’ and show that the sentiment direction is sensitive to negation flipping sentiment.

To test the meaning of the sentiment axis, we binned the sentiment activations of OpenWebText tokens from the first residual stream layer of GPT2-small into 20 equal-width buckets and sampled 20 tokens from each. Then we asked GPT-4 to classify into Positive/Neutral/Negative. Specifically, we gave the GPT-4 API prompts of the following form: “Your job is to classify the sentiment of a given token (i.e. word or word fragment) into Positive/Neutral/Negative. Token: ‘{token}’. Context: ‘{context}’. Sentiment: ” where the context length was 20 tokens centered around the sampled token. Only a cursory human sanity check was performed.

In Figure 3, we show an area plot of the classifications by activation bin. We contrast the results for different methods in Table 3. In the area plot we can see that the left side area is dominated by the “Negative” label, whereas the right side area is dominated by the “Positive” label and the central area is dominated by the “Neutral” label. Hence the tails of the activations seem highly interpretable as representing a bipolar sentiment feature. The large space in the middle of the distribution simply occupied by neutral words (rather than a more continuous degradation of positive/negative) indicates superposition of features (Elhage et al., 2022).

Using the $K$ -means sentiment direction after the first layer of GPT2-small, we can obtain a view of how the model updates its view of sentiment during the forward pass, analogous to the “logit lens“ technique from nostalgebraist (2020). In Figure A.5, we see how the sentiment activation flips when the context of the sentiment word denotes that it is negated. Words like ‘fail’, ‘doubt’ and ‘uncertain’ can be seen to flip from negative in the first couple of layers to being positive after a few layers of processing. An interesting task for future circuits analysis research could be to better understand the circuitry used to flip the sentiment axis in the presence of a negation context. We suspect significant MLP involvement (see Section A.5).

3 Causal Evaluation

We evaluate the sentiment direction using directional patching in Figure 4. These evaluations are performed on prompts with out-of-sample adjectives and the direction was not trained on any verbs. Unsupervised methods such as $K$ -means are still able to shift the logit differences and DAS is able to completely flip the prediction.

If the sentiment direction was simply a trivial feature of the token embedding, then one might expect that directional patching would be most effective in the first or final layer. However, we see in Figure 6 that in fact it is in intermediate layers of the model where we see the strongest out-of-distribution performance to SST. This suggests the speculative hypothesis that the model uses the residual stream to form abstract concepts in intermediate layers and this is where the latent knowledge of sentiment is most prominent.

A further verification of causality is shown in Figure A.3. Here we use the technique of “activation addition ” from Turner et al. (2023). We add a multiple of the sentiment direction to the first layer residual stream during each forward pass while generating sentence completions. Here we start from the baseline of a positive movie review: “I really enjoyed the movie, in fact I loved it. I thought the movie was just very…”. By adding increasingly negative multiples of the sentiment direction, we find that indeed the completions become increasingly negative, without completely destroying the coherence of the model’s generated text. We are wary of taking the model’s activations out of distribution using this technique, but we believe that the smoothness of the transition in combination with the knowledge of our findings in the patching setting give us some confidence that these results are meaningful.

We validate our sentiment directions derived from toy datasets (Section 3.3) on SST. We collapsed the labels down to a binary “Positive”/“Negative”, just used the unique phrases rather than any information about their source sentences, restricted to the ‘test’ partition and took a subset where pythia-1.4b can achieve 100% zero shot classification accuracy, removing 17% of examples. Then we paired up phrases of an equal number of tokensWe did this to maximise the chances of sentiment tokens occurring at similar positions to make up 460 clean/corrupted pairs. We used the scaffolding “Review Text: TEXT, Review Sentiment:” and evaluated the logit difference between “Positive” and “Negative” as our patching metric. Using the same DAS direction from Section 3 trained on just a few examples and flipping the corresponding sentiment activation between clean/corrupted in a single layer, we can flip the output 53.5% of the time (Figure 4).

The Summarization Motif for Sentiment

In this sub-section, we present circuitWe use the term “circuit” as defined by Wang et al. (2022), in the sense of a computational subgraph that is responsible for a significant proportion of the behavior of a neural network on some predefined task. analyses that give qualitative hints of the summarization motif, and restrict quantitative analysis of the summarization motif to 4.2. Through an iterative process of path patching (see Section 2.3) and analysing attention patterns, we have identified the circuit responsible for the ToyMovieReview task in GPT2-small (Figure 7) as well as the circuit for the ToyMoodStories task. Below, we provide a brief overview of the circuits we identified, reserving the full details for A.3.

Mechanistically, this is a binary classification task, and a naive hypothesis is that attention heads attend directly from the final token to the valenced tokens and map positive sentiment to positive outputs and vice versa. This happens, but in addition attention head output is causally important at intermediate token positions, which are then read from when producing output at END. We consider this an instance of summarization, in which the model aggregates causally-important information relating to an entity at a particular token for later usage, rather than simply attending back to the original tokens that were the source of the information.

We find that the model performs a simple, interpretable algorithm to perform the task (using a circuit made up of 9 attention heads):

Identify sentiment-laden words in the prompt, at ADJ and VRB.

Write out sentiment information to SUM (the final “movie” token).

Read from ADJ, VRB and SUM and write to END.We note that our patching experiments indicate that there is no causal dependence on the output of other model components at the ADJ and VRB positions–only at the SUM position.

The results of activation patching the residual stream can be seen in the Appendix, Fig. A.7. The output of attention heads is only important at the movie position, which we designate as SUM. We label these heads “sentiment summarizers.” Specific attention heads attend to and rely on information written to this token position as well as to ADJ and VRB.

To validate this circuit and the involvement of the sentiment direction, we patched the entirety of the circuit at the ADJ and VRB positions along the sentiment direction only, achieving a 58.3% rate of logit flips and a logit difference drop of 54.8% (in terms of whether a positive or negative word was predicted). Patching the circuit at those positions along all directions resulted in flipping 97% of logits and a logit difference drop of 75%, showing that the sentiment direction is responsible for the majority of the function of the circuit.

We next examined the circuit that processes the mood dataset in Pythia-2.8b (the smallest model that could perform the task), which is a more complex task that requires more summarization. As such it presents a better object for study of this motif. We reserve a detailed description of the circuit for the Appendix, but here we observed increasing reliance on summarization, specifically:

A set of attention heads attended primarily to the comma following the preference phrase for the queried subject (e.g. John hates parties,), and secondarily to other words in the phrase, as seen in Figure 5. We observed this phenomenon both with regular attention and value-weighted attention, and found via path patching that these heads relied partially on the comma token for their function, as seen in Figure A.9.

Heads attending to preference phrases (both commas and other tokens) tended to write to the repeated name token near the end of the sentence (John) as well as to the feels token–another type of summarization behavior. Later heads attended to the repeated name and feels tokens with an output important to END.

2 Exploring and validating summarization behavior in punctuation

Our circuit analyses reveal suggestive evidence that summarization behavior at intermediate tokens like commas, periods and certain nouns plays an important part in sentiment processing, despite these tokens having no inherent valence. We focus on summarization at commas and periods and explore this further in a series of ablation and patching experiments. We find that in many cases this summarization results in a partial information bottleneck, in which the summarization points become as important (or sometimes more important) than the phrases that precede them for sentiment tasks.

In order to determine the extent of the information bottleneck presented by commas in sentiment processing, we tested the model’s performance on the multi-subject mood stories dataset mentioned above. We froze the model’s attention patterns to ensure the model used the information from the patched commas in exactly the same way as it would have used the original information. Without this step, the model could simply avoid attending to the commas. We then performed activation patching on either the precomma phrases (e.g., patching “John hates parties,” with “John loves parties,”) while freezing the commas so they retain their original, unflipped values; or on the two commas alone, and find a similar drop in the logit difference for both as shown in table 1(a).

We also observed that reliance on summarization tends to increase with greater distances between the preference phrases and the final part of the prompt that would reference them. To test this, we injected irrelevant textE.g. “John loves parties. He has a red hat and wears it everywhere, especially when he is riding his bicycle through the city streets. Mark hates parties. He has a purple hat but only wears it on Sundays, when he takes his weekly walk around the lake. One day, they were invited to a grand gala. John feels very” after each of the preference phrases in our multi-subject mood stories (after ”John loves parties.” etc.) and measured the ratio between logit difference change for the periods at the end of these phrases vs. pre-period phrases, with higher values indicating more reliance on period summaries (Table 1(b)). We found that the periods can be up to 15% more important than the actual phrases as this distance grows. Although these results are only a first step in assessing the importance of summarization importance relative to prompt length, our findings suggest that this motif may only increase in relative importance as models grow in context length, and thus merits further study.

3 Validating summarization behavior in SST

In order to study more rigorously how summarization behaves with natural text, we examined this phenomenon in SST. We appended the suffix “Review Sentiment:” to each of the prompts and evaluate Pythia-2.8b on zero-shot classification according to whether positive or negative have higher probability and are in the top 10 tokens predicted. We then take the subset of examples Pythia-2.8b succeeds on that have at least one comma, which means we start with a baseline of 100% accuracy. We performed ablation and patching experiments on comma representations. If comma representations do not summarize sentiment information, then our experiments should not damage the model’s abilities. However, our results reveal a clear summarization motif for SST.

We performed two baseline experiments in order to obtain a control for our later experiments. First to measure the total effect of the sentiment directions, we performed directional ablation (as described in 2.3) using the sentiment directions found with DAS to every token at every layer, resulting in a 71% reduction in the logit difference and a 38% drop in accuracy (to 62% ). Second, we performed directional ablation on all tokens with a small set of random directions, resulting in a $<1\%$ change to the same metrics.

We then performed directional ablation–using the DAS (2.2) sentiment direction–to every comma in each prompt, regardless of position, resulting in an 18% drop in the logit difference and an 18% drop in zero-shot classification accuracy–indicating that nearly 50% of the model’s sentiment-direction-mediated ability to perform the task accurately was mediated via sentiment information at the commas. We find this particularly significant because we did not take any special effort to ensure that commas were placed at the end of sentiment phrases.

Separately from the above, we performed mean ablation at all comma positions as in 2.3, replacing each comma activation vector with the mean comma activation from the entire dataset in a layerwise fashion. Note that this changes the entire activation on the comma token, not just the activation in the sentiment direction. This resulted in a 17% drop in logit difference and an accuracy drop of 19% .

4 The big picture of summarization

We have identified a phenomenon across multiple models and tasks where sentiment information is not directly transferred from valenced tokens to the final output but is first aggregated at intermediate, non-valenced tokens like commas and periods (and sometimes noun tokens for specific referents). We call this behavior the “summarization motif.” These summarization points serve as partial information bottlenecks and are causally significant for the model’s performance on sentiment tasks. Through a series of ablation and patching experiments, we have validated the importance of this summarization behavior in both toy tasks and real-world datasets like the Stanford Sentiment Treebank. Additional findings suggest that as models grow in context length, the importance of this internal summarization behavior may increase–a subject that warrants further investigation. Overall, the discovery of this summarization behavior adds a new layer of complexity to our understanding of how sentiment is processed and represented in LLMs, and seems likely to be an important part of how LLMs create internal world representations.

Related Work

Understanding the emotional valence in text data is one of the first NLP tasks to be revolutionized by deep learning (Socher et al., 2013) and remains a popular task for benchmarking NLP models (Rosenthal et al., 2017; Nakov et al., 2016; Potts et al., 2021; Abraham et al., 2022). For a review of the literature, see (Pang & Lee, 2008; Liu, 2012; Grimes, 2014).

This research was inspired by the field of Mechanistic Interpretability, an agenda which aims to reverse-engineer the learned algorithms inside models (Olah et al., 2020; Elhage et al., 2021b; Nanda et al., 2023a). Exploring representations (Section 3) and world-modelling behavior inside transformers has garnered significant recent interest. This was studied in the context of synthetic game-playing models by Li et al. (2023) and evidence of linearity was demonstrated by Nanda (2023b) in the same context. Other work studying examples of world-modelling inside neural networks includes Li et al. (2021); Patel & Pavlick (2022); Abdou et al. (2021). Another framing of a very similar line of inquiry is the search for latent knowledge (Christiano et al., 2021; Burns et al., 2022). Prior to the transformer, representations of emotion were studied in Goh et al. (2021) and sentiment was studied by Radford et al. (2017), notably, the latter finding a sentiment neuron which implies a linear representation of sentiment. A linear representation of truth in LLMs was found by Marks & Tegmark (2023).

Our study of the Summarization motif (Section 4) follows from the search for information bottlenecks in models (Li et al. (2021)). Our use of the word ‘motif’, in the style of Olah et al. (2020), is originally inspired from systems biology (Alon, 2006). The idea of exploring representations at different frequencies or levels of abstraction was explored further in Tamkin et al. (2020). Information storage after the relevant token was observed in how GPT2-small predicts gender (Mathwin et al., ).

We approach our experiments from a causal mediation analysis perspective. Our approach to identifying computational subgraphs that utilize feature representations as inspired by the ‘circuits analysis’ framework (Stefan Heimersheim, 2023; Varma et al., 2023; Hanna et al., 2023), especially the tools of mean ablation and activation patching (Vig et al., 2020; Geiger et al., 2021; 2023a; Meng et al., 2023; Wang et al., 2022; Conmy et al., 2023; Chan et al., 2023; Cohen et al., 2023). We use Distributed Alignment Search (Geiger et al., 2023b) in order to apply these ideas to specific subspaces.

Conclusion

The two central novel findings of this research are the existence of a linear representation of sentiment and the use of summarization to store sentiment information. We have seen that the sentiment direction is causal and central to the circuitry of sentiment processing. Remarkably, this direction is so stark in the residual stream space that it can be found even with the most basic methods and on a tiny toy dataset, yet generalise to diverse natural language datasets from the real-world. Summarization is a motif present in larger models with longer context lengths and greater proficiency in zero-shot classification. These summaries present a tantalising glimpse into the world-modelling behavior of transformers.

We also see this research as a model for how to find and study the representation of a particular feature. Whereas in dictionary learning (Bricken et al., 2023) we enumerate a large set of features which we then need to interpret, here we start with an interpretable feature and subsequently verify that a representation of this feature exists in the model, analogously to Zou et al. (2023). One advantage of this is that our fitting process is much more efficient: we can use toy datasets and very simple fitting methods. It is therefore very encouraging to see that the results of this process generalise well to the full data distribution, and indeed we focus on providing a variety of experiments to strengthen the case for the existence of our hypothesised direction.

Did we find a truly universal sentiment direction, or merely the first principal component of directions used across different sentiment tasks? As found by Bricken et al. (2023), we suspect that this feature could be “split” further into more specific sentiment features.

Similarly, one might wonder if there is really a single bipolar sentiment direction or if we have simply found the difference between a “positive” and a “negative” sentiment direction. It turns out that this distinction is not well-defined, given that we find empirically that there is a direction corresponding to “valenced words”. Indeed, if ${\bm{x}}$ is the valence direction and ${\bm{y}}$ is the sentiment direction, then ${\bm{p}}={\bm{x}}+{\bm{y}}$ represents positive sentiment and ${\bm{n}}={\bm{x}}-{\bm{y}}$ is the negative direction. Conversely, we can reframe as starting from the positive/negative directions ${\bm{p}}$ and ${\bm{n}}$ , and then re-derive ${\bm{x}}=\frac{{\bm{p}}+{\bm{n}}}{2}$ and ${\bm{y}}:=\frac{{\bm{p}}-{\bm{n}}}{2}$ .

Many of our casual abstractions do not explain 100% of sentiment task performance. There is likely circuitry we’ve missed, possibly as a result of distributed representations or superposition (Elhage et al., 2022) across components and layers. This may also be a result of self-repair behavior (Wang et al., 2022; McGrath et al., 2023). Patching experiments conducted on more diverse sentence structures could also help to better isolate the circuitry for sentiment from more task-specific machinery.

The use of small datasets versus many hyperparameters and metrics poses a constant risk of gaming our own measures. Our results on the larger and more diverse SST dataset, and the consistent results across a range of models help us to be more confident in our results.

Distributed Alignment Search (DAS) outperformed on most of our metrics but presents possible dangers of overfitting to a particular dataset and taking the activations out of distribution (Lange et al., 2023). We include simpler tools such as Logistic Regression as a sanity check on our findings. Ideally, we would love to see a set of best practices to avoid such illusions.

The summarization motif emerged naturally during our investigation of sentiment, but we would be very interested to study it in a broader range of contexts and understand what other factors of a particular model or task may influence the use of summarization.

When studying the circuitry of sentiment, we focused almost exclusively on attention heads rather than MLPs. However, early results suggest that further investigation of the role of MLPs and individual neurons is likely to yield interesting results (A.5).

Finally, we see the long-term goal of this line of research as being able to help detect dangerous computation in language models such as deception. Even if the existence of a single “deception direction” in activation space seems a bit naive to postulate, hopefully in the future many of the tools developed here will help to detect representations of deception or of knowledge that the model is concealing, helping to prevent possible harms from LLMs.

Author Contributions

Oskar and Curt made equal contributions to this paper. Curt’s focus was on circuit analysis and he discovered the summarization motif, leading to Section 4. Oskar was focused on investigating the direction and eventually conducted enough independent experiments to convince us that the direction was causally meaningful, leading to Section 3. Neel was our mentor as part of SERI MATS, he suggested the initial project brief and provided considerable mentorship during the research. He also did the neuron analysis in Section A.5. Atticus acted a secondary source of mentorship and guidance. His advice was particularly useful as someone with more of a background in causal mediation analysis. He suggested the use of Stanford Sentiment Treebank and the discrete accuracy metric.

Acknowledgments

SERI MATS provided funding, lodging and office space for 2 months in Berkeley, California. The transformer-lens package (Nanda & Bloom, 2022) was indispensable for this research. We are very grateful to Alex Tamkin for his extensive feedback. Other valuable feedback came from Georg Lange, Alex Makelov and Bilal Chughtai. Atticus Geiger is supported by a grant from Open Philanthropy.

Reproducibility Statement

To facilitate reproducibility of the results presented in this paper, we have provided detailed descriptions of the datasets, models, training procedures, algorithms, and analysis techniques used. The ToyMovieReview dataset is fully specified in Section A.7. We use publicly available models including GPT-2 and Pythia, with details on the specific sizes provided in Section 2.1. The methods for finding sentiment directions are described in full in Section 2.2. Our causal analysis techniques of activation patching, ablation, and directional patching are presented in Section 2.3. Circuit analysis details are extensively covered for two examples in Appendix Section A.3. The code for data generation, model training, and analyses is available here.

References

Appendix A Appendix

In Section 2.2, we outline just a few of the many possible techniques for determining a direction which hopefully corresponds to sentiment. Is it overly optimistic to presume the existence of such a direction? The most basic requirement for such a direction to exist is that the residual stream space is clustered. We confirm this in two different ways.

First we fit 2-D PCA to the token embeddings for a set of 30 positive and 30 negative adjectives. In Figure A.1, we see that the positive adjectives (blue dots) are very well clustered compared to the negative adjectives (red dots). Moreover, we see that sentiment words which are out-of-sample with respect to the PCA (squares) also fit naturally into their appropriate color. This applies not just for unseen adjectives (Figure 1(a)) but also for verbs, an entirely out-of-distribution class of word (Figure 1(b)).

Secondly, we evaluate the accuracy of 2-means trained on the Simple Movie Review Continuation adjectives (Section 2.1). The fact that we can classify in-sample is not very strong evidence, but we verify that we can also classify out-of-sample with respect to the $K$ -means fitting process. Indeed, even on hold-out adjectives and on the verb tokens (which are totally out of distribution), we find that the accuracy is generally very strong across models. We also evaluate on a fully out of distribution toy dataset (“simple adverbs”) of the form “The traveller [adverb] walked to their destination. The traveller felt very”. The results can be found in Figure A.2. This is strongly suggestive that we are stumbling on a genuine representation of sentiment.

A.1.2 Activation addition

We perform activation addition (Turner et al., 2023) on GPT2-small for a single positive simple movie review continuation prompt (from Section 2.1) in order to flip the generated outputs from negative to prompt. The “steering coefficient” is the multiple of the sentiment direction which we add to the first layer residual stream. The outputs are extremely negative by the time we reach coefficient -17 and we observe a gradual transition for intermediate coefficients (Figure A.3).

A.1.3 Multi-lingual sentiment

We use the first few paragraphs of Harry Potter in English and French as a standard text (Elhage et al., 2021b). We find that intermediate layers of pythia-2.8b demonstrate intuitive sentiment activations for the French text (Figure A.4). It is important to note that none of the models are very good at French, but this was the smallest model where we saw hints of generalisation to other languages. The representation was not evident in the first couple of layers, probably due to the poor tokenization of French words.

A.1.4 Interpretability of negations

We visualise the sentiment activations for all 12 layers of GPT2-small simultaneously on the prompt “You never fail. Don’t doubt it. I am not uncertain” (Figure A.5). This allows us to observe how fail, doubt and uncertain shift from negative to positive sentiment during the forward pass of the model.

A.2 Is sentiment really a hyperplane?

In our directional patching experiments, we have somewhat artificially selected just 1 dimension as our hypothesised structure for the sentiment subspace. We can perform DAS with any number of dimensions. Figure A.6 demonstrates that whilst increasing the DAS dimension improves the patching metric in-sample (6(a)), the metric does not improve out-of-distribution (6(b)).

A.3 Detailed circuit analysis

In order to build a picture of each circuit, we used the process pioneered in Wang et al. (2022):

Identify which model components have the greatest impact on the logit difference when path patching is applied (with the final result of the residual stream set as the receiver).

Examine the attention patterns (value-weighted, in some cases) and other behaviors of these components (in practice, attention heads) in order to get a rough idea of what function they are performing.

Perform path-patching using these heads (or a distinct cluster of them) as receivers.

Repeat the process recursively, performing contextual analyses of each “level” of attention heads in order to understand what they are doing, and continuing to trace the circuit backwards.

In each path-patching experiment, change in logit difference is used as the patching metric. We started with GPT-2 as an example of a classic LLM displays a wide range of behaviors of interest, and moved to larger models when necessary for the task we wanted to study (choosing, in each case, the smallest model that could do the task).

We examined the circuit performing tasks for the following sentence template:

Using a threshold of 5%-or-greater damage to the logit difference for our patching experiments, we found that GPT-2 Small contained 4 primary heads contributing to the most proximate level of circuit function–10.4, 9.2, 10.1, and 8.5 (using “layer.head” notation). Examining their value-weighted attention patterns, we found that attention to ADJ and VRB in the sentence was most prominent in the first three heads, but 8.5 attended primarily to the second “movie” token. We also observed that 9.2 attended to this token as well as to ADJ. (Results of activation patching can be seen in Fig. A.7.)

Conducting path-patching with 8.5 and 9.2 as receivers, we identified two heads–7.1 and 7.5–that primarily attend to ADJ and VRB from the “movie” token. We further determined that the output of these heads, when path-patched through 9.2 and 8.5 as receivers, was causally important to the circuit (with patching causing a logit difference shift of 7% and 4% respectively for 7.1 and 7.5). This was not the case for other token positions, which demonstrates that causally relevant information is indeed being specially written to the “movie” position. We thus designated it the SUM token in this circuit, and we label 8.5 a summary-reader head.

Repeating our analysis with lower thresholds yielded more heads with the same behavior but weaker effect sizes, adding 9.10, 11.9, and 6.4 as summary reader, direct sentiment reader, and sentiment summarizer respectively. This gives a total of 9 heads making up the circuit.

A.3.2 Multi-subject mood stories circuit - Pythia 2.8b

We also examined the circuit for this sentence template: Carl hates parties, and avoids them whenever possible. Jack loves parties, and joins them whenever possible. One day, they were invited to a grand gala. Jack feels very [excited/nervous]. We did not attempt to reverse-engineer the entire circuit, but examined it from the perspective of what matters causally for sentiment processing–especially determining to what extent summarization occurred.

Following the same process as with GPT-2 with preference/sentiment-flipped prompts (that is, taking $x_{orig}$ to be “John hates parties,… Mary loves parties,” and $x_{flipped}$ to be “John loves parties,… Mary hates parties”), we initially identified 5 key heads that were most causally important to the logit difference at END: 17.19, 22.5, 14.4, 20.10, and 12.2 (in “layer.head” notation). Examining the value-weighted attention patterns, we observed that the top token receiving attention from END was always the repeated name RNAME (e.g., “John” in “John feels very”) or the “feels” token FEEL, indicating that some summarization may have taken place there.

We also observed that the top token attended to from RNAME and FEEL was in fact the comma at the end of the queried preference phrase (that is, the comma at the end of “John hates parties”). We designate this position COMMASUM.

Interestingly, we observed that most of these heads were multi-functional: that is, they both attended to COMMASUM from RNAME and FEEL, and also attended to RNAME and FEEL from END, producing output in the direction of the logit difference. This is possible because these heads exist at different layers, and later heads can read the summarized information from previous heads as well as writing their own summary information.

Specifically, the direct effect heads were:

Head 17.19 did not attend to commas significantly, but did attend to the periods at the end of each preference sentence in addition to its primary attention to RNAME and FEEL, and did not display COMMASUM-reading behavior.

Head 22.5 attended almost exclusively to FEEL, and did not display COMMASUM-reading behavior.

Other direct effect heads (14.4, 20.10 and 12.2) did show COMMASUM-reading behavior as well as reading from the near-end tokens to produce output in the direction of the logit difference. In each case, we verified with path-patching that information from these positions was causally relevant.

We also found important heads (12.17 being by far the most important) that are only engaged with attending to COMMASUM and producing output at RNAME and FEEL.

We further investigated what circuitry was causally important to task performance mediated through the COMMASUM positions, but did not flesh this out in full detail; after finding initial examples of summarization, we focused on its causal relevance and interaction with the sentiment direction, leaving deeper investigation to future work.

A.4 Additional summarization findings

Though there is overlap between the attention heads involved in the circuitry for processing sentiment from key phrases and that from summarization points, there are also some clear differences, suggesting that the ability to read summaries could be a specific capability developed by the model (rather than the model simply attending to high-sentiment tokens).

As can be seen in Figure A.8, there are distinct groups of attention heads that result in damage to the logit difference in different situations–that is, some react when phrases are patched, some react disproportionately to comma patching, and one head seems to have a strong response for either patching case. This is suggestive of semi-separate summary-reading circuitry, and we hope future work will result in further insights in this direction.

A.5 Neurons writing to sentiment direction in GPT2-small are interpretable

We observed that the cosine similarities of neuron out-directions with the sentiment direction are extremely heavy tailed (Figure A.10). Thanks to Neuroscope (Nanda, 2023a), we can quickly see whether these neurons are interpretable. Indeed, here are a few examples from the tails of that distribution:

L3N1605 activates on “hesitate” following a negation

Neuron L6N828 seems to be activating on words like “however” or “on the other hand” if they follow something negative

Neuron L5N671 activates on negative words that follow a “not” contraction (e.g. didn’t, doesn’t)

L6N1237 activates strongly on “but” following “not bad”

We take L3N1605, the “not hesitate” neuron, as an extended example and trace backwards through the network using Direct Logit AttributionThis technique decomposes model outputs into the sum of contributions of each component, using the insight from Elhage et al. (2021b) that components are independent and additive. We computed the relative effect of different model components on L3N1605 in the two different cases “I would not hesitate” vs. “I would always hesitate”. The main contributors to this difference are L1H0, L3H10, L3H11 and MLP2. Expanding out MLP2 into individual neurons we find that the contributions to L3N1605 are sparse. For example, L2N1154 activates on words like “don’t”, “not”, “no”, etc. It activates on “not” but not “hesitate” in “I would not hesitate” but activates on “hesitate” in “I would always hesitate”. Visualizing the attention pattern of L1H0 shows that it attends from “hesitate” to the previous token if it is “not”, but not if it is “always”.

These anecdotal examples suggest at a complex network of machinery for transmitting sentiment information across components of the network using a single critical axis of the residual stream as a communication channel. We think that exploring these neurons further could be a very interesting avenue of future research, particularly for understanding how the model updates sentiment based on negations where these neurons seem to play a critical role.

A.6 Detailed description of metrics

Logit Flip: Similar to logit difference, this is the percentage of cases where the logit difference between $T^{\text{positive}}$ and $T^{\text{negative}}$ is inverted after a causal intervention. This is a more discrete measure which is helpful for gauging whether the magnitude of the logit differences is sufficient to actually flip model predictions.

Accuracy: Out of a set of prompts, the percentage for which the logits for tokens $T^{\text{correct}}$ are greater than $T^{\text{incorrect}}$ . In practice, usually each of these sets only has one member (e.g., “Positive” and “Negative”).

A.7 Toy dataset details

The ToyMovieReview dataset consists of prompts of the form ”I thought this movie was ADJ, I VRB it. [NEWLINE] Conclusion: This movie is”. We substituted different adjective and verb tokens into the two variable placeholders to create a prompt for each distinct adjective. We averaged the logit difference across 5 positive and 5 negative completions to determine whether the continuation was positive or negative.

The ToyMoodStories dataset consists of prompts of the form “NAME1 VRB1.1 parties, and VRB1.2 them whenever possible. NAME2 VRB2.1 parties, and VRB2.2 them whenever possible. One day, they were invited to a grand gala. QUERYNAME feels very”. To evaluate the model’s output, we measure the logit difference between the “excited” and “nervous” tokens.

In each case, the two verbs in each sentence will agree in sentiment, and the sentence with NAME1 will always have opposite sentiment to that of NAME2.

Names are sampled from the following list:

Each combination of NAME1, NAME2, QUERYNAME are included in the dataset (where half the time QUERYNAME matches the first name, and half the time it matches the second). Where necessary for computational tractability, we take a subsample of the first 16 items of this dataset.