Neurons in Large Language Models: Dead, N-gram, Positional

Elena Voita, Javier Ferrando, Christoforos Nalmpantis

Introduction

The range of capabilities of language models expands with scale and at larger scales models become so strong and versatile that a single model can be integrated into various applications and decision-making processes (Brown et al., 2020; Kaplan et al., 2020; Wei et al., 2022; Ouyang et al., 2022; OpenAI, 2023; Anil et al., 2023). This increases interest and importance of understanding the internal workings of these large language models (LLMs) and, specifically, their evolution with scale. Unfortunately, scaling also increases the entry threshold for interpretability researchers since dealing with large models requires a lot of computational resources. In this work, we analyze a family of OPT models up to 66b parameters and deliberately keep our analysis very lightweight so that it could be done using a single GPU.

We focus on neurons inside FFNs, i.e. individual activations in the representation between the two linear layers of the Transformer feedforward blocks (FFNs). Differently from e.g. neurons in the residual stream, FFN neurons are more likely to represent meaningful features: the elementwise nonlinearity breaks the rotational invariance of this representation and encourages features to align with the basis dimensions (Elhage et al., 2021). When such a neuron is activated, it updates the residual stream by pulling out the corresponding row of the second FFN layer; when it is not activated, it does not update the residual stream (Figure 6).Since OPT models have the ReLU activation function, the notion of “activated” or “not activated” is trivial and means non-zero vs zero. Therefore, we can interpret functions of these FFN neurons in two ways: (i) by understanding when they are activated, and (ii) by interpreting the corresponding updates coming to the residual stream.

First, we find that in the first half of the network, many neurons are “dead”, i.e. they never activate on a large collection of diverse data. Larger models are more sparse in this sense: for example, in the 66b model more that $70\%$ of the neurons in some layers are dead. At the same time, many of the alive neurons in this early part of the network are reserved for discrete features and act as indicator functions for tokens and n-grams: they activate if and only if the input is a certain token or an n-gram. The function of the updates coming from these token detectors to the residual stream is also very surprising: at the same time as they promote concepts related to the potential next token candidate (which is to be expected according to Geva et al. (2021, 2022)), they are explicitly targeted at removing information about current input, i.e. their triggers. This means that in the bottom-up processing where a representation of the current input token gets gradually transformed into a representation for the next token, current token identity is removed by the model explicitly (rather than ends up implicitly “buried” as a result of additive updates useful for the next token). To the best of our knowledge, this is the first example of mechanisms specialized at removing (rather than adding) information from the residual stream.

Finally, we find that some neurons are responsible for encoding positional information regardless of textual patterns. Similarly to token and n-gram detectors, many of these neurons act as indicator functions of position ranges, i.e. activate for positions within certain ranges and do not activate otherwise. Interestingly, these neurons often collaborate. For example, the second layer of the 125m model has 10 positional neurons whose indicated positional ranges are in agreement: together, they efficiently cover all possible positions and no neuron is redundant. In a broader picture, positional neurons question the key-value memory view of the FFN layers stating that “each key correlates with textual patterns in the training data and each value induces a distribution over the output vocabulary” Geva et al. (2021, 2022). Neurons that rely on position regardless of textual pattern indicate that FFN layers can be used by the model in ways that do not fit the key-value memory view. Overall, we argue that the roles played by these layers are still poorly understood.

are “dead”, i.e. never activate on a large diverse collection of data;

act as token- and n-gram detectors that, in addition to promoting next token candidates, explicitly remove current token information;

encode position regardless of textual content which indicates that the role of FFN layers extends beyond the key-value memory view.

With scale, models have more dead neurons and token detectors and are less focused on absolute position.

Data and Setting

We use OPT Zhang et al. (2022), a suite of decoder-only pre-trained transformers that are publicly available. We use model sizes ranging from 125M to 66B parameters and take model weights from the HuggingFace model hub.https://huggingface.co/models

Data.

We use data from diverse sources containing development splits of the datasets used in OPT training as well as several additional datasets. Overall, we used (i) subsets of the validation and test part of the Pile Gao et al. (2020) including Wikipedia, DM Mathematics, HackerNews, (ii) RedditPushshift.io Reddit dataset is a previously existing dataset extracted and obtained by a third party that contains preprocessed comments posted on the social network Reddit and hosted by pushshift.io. Baumgartner et al. (2020); Roller et al. (2021), (iii) code data from Codeparrothttps://huggingface.co/datasets/codeparrot/codeparrot-clean.

For the experiments in Section 3 when talking about dead neurons, we use several times more data. Specifically, we add more data from Wikipedia, DM Mathematics and Codeparrot, as well as add new domains from the Pilehttps://huggingface.co/datasets/EleutherAI/pile: EuroParl, FreeLaw, PubMed abstracts, Stackexchange.

Overall, the data used in Section 3 has over 20M tokens, in the rest of the paper – over 5M tokens.

Single-GPU processing.

We use only sets of neuron values for some data, i.e. we run only forward passes of the full model or its several first layers. Since large models do not fit in a single GPU, we load one layer at a time keeping the rest of the layers on CPU. This allows us to record neuron activations for large models: all the main experiments in this paper were done on a single GPU.

Dead Neurons

Let us start from simple statistics such as neuron activation frequency (Figure 1).

First, we find that many neurons never activate on our diverse data, i.e. they can be seen as “dead”. Figure 1(a) shows that the proportion of dead neurons is very substantial: e.g., for the 66b model, the proportion of dead neurons in some layers is above $70\%$ . We also see that larger models are more sparse because (i) they have more dead neurons and (ii) the ones that are alive activate less frequently (Figure 1(b)).

Only first half of the model is sparse.

Next, we notice that this kind of sparsity is specific only to early layers. This leads to a clear distinction between the first and the second halves of the network: while the first half contains a solid proportion of dead neurons, the second half is fully “alive”. Additionally, layers with most dead neurons are the ones where alive neurons activate most rarely.

Packing concepts into neurons.

This difference in sparsity across layers might be explained by “concept-to-neuron” ratio being much smaller in the early layers than in the higher layers. Intuitively, the model has to represent sets of encoded in a layer concepts by “spreading” them across available neurons. In the early layers, encoded concepts are largely shallow and are likely to be discrete (e.g., lexical) while at the higher layers, networks learn high-level semantics and reasoning Peters et al. (2018); Liu et al. (2019); Jawahar et al. (2019); Tenney et al. (2019); Geva et al. (2021). Since the number of possible shallow patterns is not large and, potentially, enumerable, in the early layers the model can (and, as we will see later, does) assign dedicated neurons to some features. The more neurons are available to the model, the easier it is to do so – this agrees with the results in Figure 1 showing that larger models are more sparse. Differently, the space of fine-grained semantic concepts is too large compared to the number of available neurons which makes it hard to reserve many dedicated neuron-concept pairs.There can, however, be a few specialized neurons in the higher layers. For example, BERT has neurons responsible for relational facts Dai et al. (2022).

Are dead neurons completely dead?

Note that the results in Figure 1(a) can mean one of the two things: (i) these neurons can never be activated (i.e. they are “completely dead”) or (ii) they correspond to patterns so rare that we never encountered them in our large diverse collection of data. While the latter is possible, note that this does not change the above discussion about sparsity and types of encoded concepts. On the contrary: it further supports the hypothesis of models assigning dedicated neurons to specific concepts.

N-gram-Detecting Neurons

Now, let us look more closely into the patterns encoded in the lower half of the models and try to understand the nature of the observed above sparsity. Specifically, we analyze how neuron activations depend on an input n-gram. For each input text with tokens $x_{1},x_{2},...,x_{S}$ , we record neuron activations at each position and if a neuron is activated (i.e., non-zero) at position $k$ , we say that the n-gram $(x_{k-n+1},\dots,x_{k})$ triggered this neuron.

In Sections 4.1-4.4 we talk about unigrams (i.e., tokens) and come to larger n-grams in Section 4.5.

First, let us see how many n-grams are able to trigger each neuron. For each neuron we evaluate the number of n-grams that cover at least $95\%$ of the neuron’s activations. For the bottom half of the network, Figure 2 shows how neurons in each layer are categorized by the number of covering them n-grams (we show unigrams here and larger n-grams in Appendix A).

We see that, as anticipated, neurons in larger models are covered by less n-grams. Also, the largest models have a substantial proportion of neurons that are covered by as few as 1 to 5 tokens. This agrees with our hypothesis in the previous section: the model spreads discreet shallow patterns across specifically dedicated neurons.Note that the 350m model does not follow the same pattern as all the rest: we will discuss this model in Section 6.

2 Token-Detecting Neurons

Presence of neurons that can be triggered by only a few (e.g., 1-5) tokens point to the possibility that some neurons act as token detectors, i.e. activate if and only if the input is one of the corresponding tokens, regardless of the previous context. To find such neurons, we (1) pick neurons that can be triggered by only 1-5 tokens, (2) gather tokens that are covered by this neuron (if the neuron activates at least $95\%$ of the time the token is present), (3) if altogether, these covered tokens are responsible for at least $95\%$ of neuron activations.We exclude the begin-of-sentence token from these computations because for many neurons, this token is responsible for the majority of the activations.

Figure 3(a) shows that there are indeed a lot of token-detecting neurons. As expected, larger models have more such neurons and the 66b model has overall 5351 token detectors. Note that each token detector is responsible for a group of several tokens that, in most of the cases, are variants of the same word (e.g., with differences only in capitalization, presence of the space-before-word special symbol, morphological form, etc.). Figure 5 (top) shows examples of groups of tokens detected by token-detecting neurons.

Interestingly, the behavior of the largest models (starting from 13b of parameters) differs from that of the rest. While for smaller models the number of token detectors increases then goes down, larger models operate in three monotonic stages and start having many token-detecting neurons from the very first layer (Figures 3). This already shows qualitative differences between the models: with more capacity, larger models perform more complicated reasoning with more distinct stages.

3 Ensemble-Like Behaviour of the Layers

Now, let us look at “detected” tokens, i.e. tokens that have a specialized detecting them neuron. Figure 3(b) shows the number of detected tokens in each layer as well as cumulative over layers number of detected tokens. We see that, e.g., the 66b model focuses on no more than 1.5k tokens in each layer but over 10k tokens overall. This means that across layers, token-detecting neurons are responsible for largely differing tokens. Indeed, Figure 4 shows that in each following layer, detected tokens mostly differ from all the tokens covered by the layers below. All in all, this points to an ensemble-like (as opposed to sequential) behavior of the layers: layers collaborate so that token-detecting neurons cover largely different tokens in different layers. This divide-and-conquer-style strategy allows larger models to cover many tokens overall and use their capacity more effectively.

Originally, such an ensemble-like behavior of deep residual networks was observed in computer vision models Veit et al. (2016). For transformers, previous evidence includes simple experiments showing that e.g. dropping or reordering layers does not influence performance much Fan et al. (2020); Zhao et al. (2021).

4 Token Detectors Suppress Their Triggers

Now let us try to understand the role of token-detecting neurons in the model by interpreting how they update the residual stream. Throughout the layers, token representation in the residual stream gets transformed from the token embedding for the current input tokenFor OPT models, along with an absolute positional embedding. to the representation that encodes a distribution for the next token. This transformation happens via additive updates coming from attention and FFN blocks in each layer. Whenever an FFN neuron is activated, the corresponding row of the second FFN layer (multiplied by this neuron’s value) is added to the residual stream (see illustration in Figure 6). By projecting this FFN row onto vocabulary, we can get an interpretation of this update (and, thus, the role of this neuron) in terms of its influence on the output distribution encoded in the residual stream.

Previously, this influence was understood only in terms of the top projections, i.e. tokens that are promoted Geva et al. (2021, 2022). This reflects an existing view supporting implicit rather than explicit loss of the current token identity over the course of layers. Namely, the view that the current identity gets “buried” as a result of updates useful for the next token as opposed to being removed by the model explicitly. In contrast, we look not only at the top projections but also at the bottom: if these projections are negative, the corresponding tokens are suppressed by the model (Figure 6).

Explicit token suppression in the model.

We find that often token-detecting neurons deliberately suppress the tokens they detect. Figure 5 shows several examples of token-detecting neurons along with the top promoted and suppressed concepts. While the top promoted concepts are in line with previous work (they are potential next token candidates which agrees with Geva et al. (2021, 2022)), the top suppressed concepts are rather unexpected: they are exactly the tokens triggering this neuron. This means that vector updates corresponding to these neurons point in the direction of the next token candidates at the same time as they point away from the tokens triggering the neuron. Note that this is not trivial since these updates play two very different roles at the same time. Overall, for over $80\%$ of token-detecting neurons their corresponding updates point in the negative direction from the triggering them tokens (although, the triggering tokens are not always at the very top suppressed concepts as in the examples in Figure 6).

Overall, we argue that models can have mechanisms that are targeted at removing information from the residual stream which can be explored further in future work.

5 Beyond Unigrams

In Appendix A, we show results for bigrams and trigrams that mirror our observations for unigrams: (i) larger models have more specialized neurons, (ii) in each layer, models cover mostly new n-grams. Interestingly, for larger n-grams we see a more drastic gap between larger and smaller models.

Positional Neurons

When analyzing dead neurons (Section 3), we also noticed some neurons that, consistently across diverse data, never activate except for a few first token positions. This motivates us to look further into how position is encoded in the model and, specifically, whether some neurons are responsible for encoding positional information.

Intuitively, we want to find neurons whose activation patterns are defined by or, at least, strongly depend on token position. Formally, we identify neurons whose activations have high mutual information with position. For each neuron, we evaluate mutual information between two random variables:

$act$ – neuron is activated or not ( $\{Y,N\}$ ),

$pos$ – token position ( $\{1,2,\dots,T\}$ ).

We gather neuron activations for full-length data (i.e., $T=2048$ tokens) for Wikipedia, DM Mathematics and Codeparrot. Let $fr^{(pos)}_{n}$ be activation frequency of neuron $n$ at position $pos$ and $fr_{n}$ be the total activation frequency of this neuron. Then the desired mutual information is as follows:For more details, see appendix B.1.

Choosing the neurons.

We pick neurons with $I(act,pos)>0.05$ , i.e. high mutual information with position – this gives neurons whose activation frequency depends on position rather than content. Indeed, if e.g. a neuron is always activated within certain position range regardless of data domain, we can treat this neuron as responsible for position; at least, to a certain extent.

2 Types of Positional Neurons

After selecting positional neurons, we categorize them according to their activation pattern, i.e. activation frequency depending on position (Figure 7).

These neurons are shown in purple in Figure 7. When such a pattern is strong (top row), the activation pattern is an indicator function of position ranges. In other words, such a neuron is activated if and only if the position falls into a certain set. Note that since the activation pattern does not change across data domains, it is defined solely by position and not the presence of some lexical or semantic information.

Both types of activation extremes.

These are the neurons whose activation pattern is not oscillatory but still has intervals where activation frequency reaches both “activation extremes”: 0 (never activated) and 1 (always activated). Most frequently, such a neuron is activated only for positions less than or greater than some value and not activated otherwise. Similarly to oscillatory neurons, when such a pattern is strong (Figure 7, top row), it is also (almost) an indicator function.

Only one type of activation extremes.

Differently from the previous two types, activation patterns for these neurons can reach only one of the extreme values 0 or 1 (Figure 7, green). While this means that they never behave as indicator functions, there are position ranges where a neuron being activated or not depends solely on token position.

Other.

Finally, these are the neurons whose activation patterns strongly depend on position but do not have intervals where activation frequency stays 0 or 1 (Figure 7, yellow). Typically, these activation patterns have lower mutual information with position than the previous three types.

Strong vs weak pattern.

We also distinguish “strong” and “weak” versions of each type which we will further denote with color intensity (Figure 7, top vs bottom rows). For the first three types of positional neurons, the difference between strong and weak patterns lies in whether on the corresponding position ranges activation frequency equals 0 (or 1) or close, but not equals, to 0 (or 1). For the last type, this difference lies in how well we can predict activation frequency on a certain position knowing this value for the neighboring positions (informally, “thin” vs “thick” graph).

3 Positional Neurons Across the Models

For each of the models, Figure 8 illustrates the positional neurons across layers.

First, we notice that smaller models rely substantially on oscillatory neurons: this is the most frequent type of positional neurons for models smaller than 6.7b of parameters. In combination with many “red” neurons acting as indicator functions for wider position ranges, the model is able to derive token’s absolute position rather accurately. Interestingly, larger models do not have oscillatory neurons and rely on more generic patterns shown with red- and green-colored circles. We can also see that from 13b to 66b, the model loses two-sided red neurons and uses the one-sided green ones more. This hints at one of the qualitative differences between smaller and larger models: while the former encode absolute position more accurately, the latter ones are likely to rely on something more meaningful than absolute position. This complements recent work showing that absolute position encoding is harmful for length generalization in reasoning tasks Kazemnejad et al. (2023). Differently from their experiments with same model size but various positional encodings, we track changes with scale. We see that, despite all models being trained with absolute positional encodings, stronger models tend to abstract away from absolute position.

Positional neurons work in teams.

Interestingly, positional neurons seem to collaborate to cover the full set of positions together. For example, let us look more closely at the 10 strongly oscillatory neurons in the second layer of the 125m model (shown with dark purple circles in Figure 8). Since they act as indicator functions, we can plot position ranges indicated by each of these neurons. Figure 9 shows that (i) indicated position ranges for these neurons are similar up to a shift, (ii) the shifts are organized in a “perfect” order in a sense that altogether, these ten neurons efficiently cover all positions such that none of these neurons is redundant.

The two stages within the model.

Finally, Figure 8 reveals two stages of up-and-downs of positional information within the model: roughly, the first third of the model and the rest. Interestingly, preferences in positional patterns also change between the stages: e.g., preference for “red” neurons changes to oscillatory purple patterns for the 1.3b and 2.7b models, and “red” patterns become less important in the upper stage for the 13b and 30b models. Note that the first third of the model corresponds to the sparse stage with the dead neurons and n-gram detectors (Sections 3, 4). Therefore, we can hypothesize that in these two stages, positional information is first used locally to detect shallow patterns, and then more globally to use longer contexts and help encode semantic information.

Previously, the distinct bottom-up stages of processing inside language models were observed in Voita et al. (2019a). The authors explained that the way representations gain and lose information throughout the layers is defined by the training objective and why, among other things, positional information should (and does) get lost. This agrees with our results in this work: we can see that while there are many positional patterns in the second stage, they are weaker than in the first stage.

4 Positional Neurons are Learned Even Without Positional Encoding

Recently, it turned out that even without positional encoding, autoregressive language models still learn positional information Haviv et al. (2022). We hypothesize that the mechanism these “NoPos” models use to encode position is positional neurons. To confirm this, we train two versions of the 125m model, with and without positional encodings, and compare the types of their positional neurons.

We trained 125m models with the standard OPT setup but smaller training dataset: we used OpenWebText corpus Gokaslan and Cohen (2019), an open clone of the GPT-2 training data Radford et al. (2019). This dataset contains 3B tokens (compared 180B for OPT).

Positional neurons without positional encoding.

Figure 10 shows positional neurons in two 125m models: trained with and without positional encoding. We see that, indeed, the model without positional encoding also has many strong positional patterns. Note, however, that the NoPos model does not have oscillatory neurons which, in combination with other positional neurons, allow encoding absolute position rather accurately. This means that the NoPos model relies on more generic patterns, e.g. “red” neurons encoding whether a position is greater/less than some value.

Oscillatory neurons require longer training.

Finally, we found that oscillatory patterns appear only with long training. Figure 11 shows positional patterns learned by the baseline 125m model trained for 50k, 150k and 300k training batches. We see that all models have very strong positional patterns, but only the last of them has oscillatory neurons. Apparently, learning absolute position requires longer training time.

5 Doubting FFNs as Key-Value Memories

Current widely held belief is that feed-forward layers in transformer-based language models operate as key-value memories. Specifically, “each key correlates with textual patterns in the training examples, and each value induces a distribution over the output vocabulary” (Geva et al. (2021, 2022); Dai et al. (2022); Meng et al. (2022); Ferrando et al. (2023), among others). While in Section 4.4 we confirmed that this is true for some of the neurons, results in this section reveal that FFN layers can be used by the model in ways that do not fit the key-value memory view. In particular, activations of strong positional neurons are defined by position regardless of textual content, and the corresponding values do not seem to encode meaningful distributions over vocabulary. This means that the role of these neurons is different from matching textual patterns to sets of the next token candidates. In a broader context, this means that the roles played by Transformer feed-forward layers are still poorly understood.

The 350m Model: The Odd One Out

As we already mentioned above, the 350m model does not follow the same pattern as the rest of the models. Specifically, it does not have dead neurons (Section 3) and its neuron activations do not seem to be sparse with respect to triggering them n-grams as we saw for all the other models in Figure 2.There are, however, positional neurons; see Figure 16 in Appendix B.2).

This becomes less surprizing when noticing that the 350m model is implemented differently from all the rest: it applies LayerNorm after attention and feed-forward blocks, while all the other models – before.https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py Apparently, such seemingly minor implementation details can affect interpretability of model components rather significantly. Indeed, previous work also tried choosing certain modeling aspects to encourage interpretability. Examples of such work include choosing an activation function to increase the number of interpretable neurons Elhage et al. (2022), large body of work on sparse softmax variants to make output distributions or attention more interpretable (Martins and Astudillo (2016); Niculae and Blondel (2017); Peters et al. (2019); Correia et al. (2019); Martins et al. (2020), among others), or more extreme approaches with explicit modular structure that is aimed to be interpretable by construction (Andreas et al. (2016); Hu et al. (2018); Kirsch et al. (2018); Khot et al. (2021), to name a few). Intuitively, choosing ReLU activation function as done in the OPT models can be seen as having the same motivation as developing sparse softmax variants: exact zeros in the model are inherently interpretable.

Additional Related Work

Historically, neurons have been a basic unit of analysis. Early works started from convolutional networks first for images Krizhevsky et al. (2012) and later for convolutional text classifiers Jacovi et al. (2018). Similar to our work, Jacovi et al. (2018) also find n-gram detectors; although, for small convolutional text classifiers this is an almost trivial observation compared to large Transformer-based language models as in our work. For recurrent networks, interpretable neurons include simple patterns such as line lengths, brackets and quotes Karpathy et al. (2015), sentiment neuron Radford et al. (2017) and various neurons in machine translation models, such as tracking brackets, quotes, etc, as well as neurons correlated with higher-level concepts e.g. verb tense Bau et al. (2019). For Transformer-based BERT, Dai et al. (2022) find that some neurons inside feed-forward blocks are responsible for storing factual knowledge. Larger units of analysis include attention blocks (Voita et al. (2018, 2019b); Clark et al. (2019); Kovaleva et al. (2019); Baan et al. (2019); Correia et al. (2019), etc), feed-forward layers Geva et al. (2021, 2022) and circuits responsible for certain tasks Wang et al. (2022); Geva et al. (2023); Hanna et al. (2023).

Acknowledgements

The authors thank Nicola Cancedda, Yihong Chen, Igor Tufanov and FAIR London team for fruitful discussions and helpful feedback.

References

Appendix A N-gram-Detecting Neurons

Figure 12 shows how neurons in each layer are categorized by the number of covering them bigrams, Figure 13 – trigrams. As expected, neurons in larger models are covered by less n-grams.

A.2 Trigram-Detecting Neurons

Similarly to token-detecting neurons in Section 4.2, we also find neurons that are specialized on 3-grams. Specifically, we (1) pick neurons that are covered by only 1-50 trigrams, (2) gather trigrams that are covered by this neuron (if the neuron activated at least $95\%$ of the time the trigram is present), (3) if altogether, these covered trigrams are responsible for at least $95\%$ of neuron activations.

Figure 14 shows the results. Overall, the results further support our main observations: larger models have more neurons responsible for n-grams. Interestingly, when looking at trigrams rather than tokens, at 30b of parameters we see a drastic jump in the number of covered n-grams. This indicates that one of the qualitative differences between larger and smaller models lies in the expansion of the families of features they are able to represent.

A.3 Ensemble-Like Layer Behavior

Figure 15 shows the number of covered trigrams in each layer. We see that in each layer, models cover largely new trigrams.

Appendix B Positional Neurons

For each neuron, we evaluate mutual information between two random variables: