Analyzing Transformers in Embedding Space

Guy Dar, Mor Geva, Ankit Gupta, Jonathan Berant

Introduction

Transformer-based models [Vaswani et al., 2017] currently dominate Natural Language Processing [Devlin et al., 2018; Radford et al., 2019; Zhang et al., 2022] as well as many other fields of machine learning [Dosovitskiy et al., 2020; Chen et al., 2020; Baevski et al., 2020]. Consequently, understanding their inner workings has been a topic of great interest. Typically, work on interpreting Transformers relies on feeding inputs to the model and analyzing the resulting activations [Adi et al., 2016; Shi et al., 2016; Clark et al., 2019]. Thus, interpretation involves an expensive forward, and sometimes also a backward pass, over multiple inputs. Moreover, such interpretation methods are conditioned on the input and are not guaranteed to generalize to all inputs. In the evolving literature on static interpretation, i.e., without forward or backward passes, Geva et al. [2022b] showed that the value vectors of the Transformer feed-forward module (the second layer of the feed-forward network) can be interpreted by projecting them into the embedding space, i.e., multiplying them by the embedding matrix to obtain a representation over vocabulary items.We refer to the unique items of the vocabulary as vocabulary items, and to the (possibly duplicate) elements of a tokenized input as tokens. When clear, we might use the term token for vocabulary item. Elhage et al. have shown that in a 2-layer attention network, weight matrices can be interpreted in the embedding space as well. Unfortunately, their innovative technique could not be extended any further.

In this work, we extend and unify the theory and findings of Elhage et al. and Geva et al. [2022b]. We present a zero-pass, input-independent framework to understand the behavior of Transformers. Concretely, we interpret all weights of a pretrained language model (LM) in embedding space, including both keys and values of the feed-forward module (Geva et al. [2020, 2022b] considered just FF values) as well as all attention parameters (Elhage et al. analyzed simplified architectures up to two layers of attention with no MLPs).

Our framework relies on a simple observation. Since Geva et al. [2022b] have shown that one can project hidden states to the embedding space via the embedding matrix, we intuit this can be extended to other parts of the model by projecting to the embedding space and then projecting back by multiplying with a right-inverse of the embedding matrix. Thus, we can recast inner products in the model as inner products in embedding space. Viewing inner products this way, we can interpret such products as interactions between pairs of vocabulary items. This applies to (a) interactions between attention queries and keys as well as to (b) interactions between attention value vectors and the parameters that project them at the output of the attention module. Taking this perspective to the extreme, one can view Transformers as operating implicitly in the embedding space. This entails the existence of a single linear space that depends only on the tokenizer, in which parameters of different Transformers can be compared. Thus, one can use the embedding space to compare and transfer information across different models that share a tokenizer.

We provide extensive empirical evidence for the validity of our framework, focusing mainly on GPT-2 medium [Radford et al., 2019]. We use GPT-2 for two reasons. First, we do this for concreteness, as this paper is mainly focused on introducing the new framework and not on analyzing its predictions. Second, and more crucially, unlike many other architectures (such as BERT [Devlin et al., 2018], RoBERTa [Liu et al., 2019], and T5 [Raffel et al., 2019]), the GPT family has a linear language modeling head (LM head) – which is simply the output embedding matrix. All the other architectures’ LM heads are two layer networks that contain non-linearities before the output embedding matrix. Our framework requires a linear language modeling head to work. That being said, we believe in practice this will not be a major obstacle, and we indeed see in the experiments that model alignment works well for BERT in spite of the theoretical difficulties. We leave the non-linearities in the LM head for future work.

On the interpretation front (Fig. 1, Left), we provide qualitative and quantitative evidence that Transformer parameters can be interpreted in embedding space. We also show that when fine-tuning GPT-2 on a sentiment analysis task (over movie reviews), projecting changes in parameters into embedding space yields words that characterize sentiment towards movies. Second (Fig. 1, Center), we show that given two distinct instances of BERT pretrained from different random seeds [Sellam et al., 2022], we can align layers of the two instances by casting their weights into the embedding space. We find that indeed layer i of the first instance aligns well to layer i of the second instance, showing the different BERT instances converge to a semantically similar solution. Last (Fig. 1, Right), we take a model fine-tuned on a sentiment analysis task and “transfer” the learned weights to a different model that was only pretrained by going through the embedding spaces of the two models. We show that in 30% of the cases, this procedure, termed stitching, results in a classifier that reaches an impressive accuracy of 70% on the IMDB benchmark [Maas et al., 2011] without any training.

Overall, our findings suggest that analyzing Transformers in embedding space is valuable both as an interpretability tool and as a way to relate different models that share a vocabulary and that it opens the door to interpretation methods that operate in embedding space only. Our code is available at https://github.com/guyd1995/embedding-space.

Background

We now present the main components of the Transformer [Vaswani et al., 2017] relevant to our analysis. We discuss the residual stream view of Transformers, and recapitulate a view of the attention layer parameters as interaction matrices $W_{\text{VO}}$ and $W_{\text{QK}}$ [Elhage et al., 2021]. Similar to them, we exclude biases and layer normalization from our analysis.

The Transformer consists of a stack of layers, each including an attention module followed by a Feed-Forward (FF) module. All inputs and outputs are sequences of $N$ vectors of dimensionality $d$ .

2 The Residual Stream

We rely on a useful view of the Transformer through its residual connections popularized by Elhage et al. .Originally introduced in nostalgebraist . Specifically, each layer takes a hidden state as input and adds information to the hidden state through its residual connection. Under this view, the hidden state is a residual stream passed along the layers, from which information is read, and to which information is written at each layer. Elhage et al. and Geva et al. [2022b] observed that the residual stream is often barely updated in the last layers, and thus the final prediction is determined in early layers and the hidden state is mostly passed through the later layers.

An exciting consequence of the residual stream view is that we can project hidden states in every layer into embedding space by multiplying the hidden state with the embedding matrix $E$ , treating the hidden state as if it were the output of the last layer. Geva et al. [2022a] used this approach to interpret the prediction of Transformer-based language models, and we follow a similar approach.

Importantly, $W_{\text{QK}}^{i},W_{\text{VO}}^{i}$ are input-independent. Intuitively, $W_{\text{QK}}$ encodes the amount of attention between pairs of tokens. Similarly, in $W_{\text{VO}}^{i}$ , the matrices $W_{\text{V}}$ and $W_{\text{O}}$ can be viewed as a transition matrix that determines how attending to certain tokens affects the subsequent hidden state.

We can restate the attention equations in terms of the interaction matrices. Recall (Eq. 1) that the output of the $i$ ’th head of the attention module is $A^{i}V_{\text{att}}^{i}$ and the final output of the attention module is (without the residual connection):

Similarly, the attention map $A^{i}$ at the $i$ ’th head in terms of $W_{\text{QK}}$ is (softmax is done row-wise):

Parameter Projection

In this section, we propose that Transformer parameters can be projected into embedding space for interpretation purposes. We empirically support our framework’s predictions in §4-§5.

Zooming in on this operation, we see that it takes the previous hidden state in the embedding space ( $\hat{X}$ ) and produces an output in the embedding space which will be incorporated into the next hidden state through the residual stream. Thus, $E^{\prime}W_{\text{VO}}^{i}E$ is a transition matrix that takes a representation of the embedding space and outputs a new representation in the same space.

Similarly, the matrix $W_{\text{QK}}^{i}$ can be viewed as a bilinear map (Eq. 2.3). To interpret it in embedding space, we perform the following operation with $E^{\prime}$ :

Therefore, the interaction between tokens at different positions is determined by an $e\times e$ matrix that expresses the interaction between pairs of vocabulary items.

Overall, FF keys and values are intimately connected – the $i$ -th key controls the coefficient of the $i$ -th value, so we expect their interpretation to be related. While not central to this work, we empirically show that key-value pairs in the FF module are similar in embedding space in Appendix B.1.

Choosing $E^{\prime}=E^{\textrm{T}}$ In practice, we do not use an exact right inverse (e.g. the pseudo-inverse). We use the transpose of the embedding matrix $E^{\prime}=E^{\textrm{T}}$ instead. The reason pseudo-inverse doesn’t work is that for interpretation we apply a top- $k$ operation after projecting to embedding space (since it is impractical for humans to read through a sorted list of $50K$ tokens). So, we only keep the list of the vocabulary items that have the $k$ largest logits, for manageable values of $k$ . In Appendix A, we explore the exact requirements for $E^{\prime}$ to interact well with top- $k$ . We show that the top $k$ entries of a vector projected with the pseudo-inverse do not represent the entire vector well in embedding space. We define keep- $k$ robust invertibility to quantify this. It turns out that empirically $E^{\textrm{T}}$ is a decent keep-k robust inverse for $E$ in the case of GPT-2 medium (and similar models) for plausible values of $k$ . We refer the reader to Appendix A for details.

To give intuition as to why $E^{\textrm{T}}$ works in practice, we switch to a different perspective, useful in its own right. Consider the FF keys for example – they are multiplied on the left by the hidden states. In this section, we suggested to re-cast this as $h^{T}K=(h^{T}E)(E^{\prime}K)$ . Our justification was that the hidden state is interpretable in the embedding space. A related perspective (dominant in previous works too; e.g. Mickus et al. ) is thinking of the hidden state as an aggregation of interpretable updates to the residual stream. That is, schematically, $h=\sum_{i=1}^{k}\alpha_{i}r_{i}$ , where $\alpha_{i}$ are scalars and $r_{i}$ are vectors corresponding to specific concepts in the embedding space (we roughly think of a concept as a list of tokens related to a single topic). Inner product is often used as a similarity metric between two vectors. If the similarity between a column $K_{i}$ and $h$ is large, the corresponding $i$ -th output coordinate will be large. Then we can think of $K$ as a detector of concepts where each neuron (column in $K$ ) lights up if a certain concept is “present” (or a superposition of concepts) in the inner state. To understand which concepts each detector column encodes we see which tokens it responds to. Doing this for all (input) token embeddings and packaging the inner products into a vector of scores is equivalent to simply multiplying by $E^{\textrm{T}}$ on the left (where $E$ is the input embedding in this case, but for GPT-2 they are the same). A similar argument can be made for the interaction matrices as well. For example for $W_{\text{VO}}$ , to understand if a token embedding $e_{i}$ maps to a $e_{j}$ under a certain head, we apply the matrix to $e_{i}$ , getting $e_{i}^{T}W_{\text{VO}}$ and use the inner product as a similarity metric and get the score $e_{i}^{T}W_{\text{VO}}e_{j}$ .

Interpretability Experiments

In this section, we provide empirical evidence for the viability of our approach as a tool for interpreting Transformer parameters. For our experiments, we use Huggingface Transformers (Wolf et al. ; License: Apache-2.0).

Attention Module We take GPT-2 medium (345M parameters; Radford et al. ) and manually analyze its parameters. GPT-2 medium has a total of 384 attention heads (24 layers and 16 heads per layer). We take the embedded transition matrices $E^{\prime}W_{\text{VO}}^{i}E$ for all heads and examine the top- $k$ pairs of vocabulary items. As there are only 384 heads, we manually choose a few heads and present the top- $k$ pairs in Appendix C.1 ( $k=50$ ). We observe that different heads capture different types of relations between pairs of vocabulary items including word parts, heads that focus on gender, geography, orthography, particular part-of-speech tags, and various semantic topics. In Appendix C.2 we perform a similar analysis for $W_{\text{QK}}$ . We supplement this analysis with a few examples from GPT-2 base and large (117M, 762M parameters – respectively) as proof of concept, similarly presenting interpretable patterns.

A technical note: $W_{\text{VO}}$ operates on row vectors, which means it operates in a “transposed” way to standard intuition – which places inputs on the left side and outputs on the right side. It does not affect the theory, but when visualizing the top- $k$ tuples, we take the transpose of the projection $(E^{\prime}W_{\text{VO}}^{i}E)^{\textrm{T}}$ to get the “natural” format (input token, output token). Without the transpose, we would get the same tuples, but in the format (output token, input token). Equivalently, in the terminology of linear algebra, it can be seen as a linear transformation that we represent in the basis of row vectors and we transform to the basis of column vectors, which is the standard one.

FF Module Appendix C.3 provides examples of key-value pairs from the FF modules of GPT-2 medium. We show random pairs $(k,v)$ from the set of those pairs such that when looking at the top-100 vocabulary items for $k$ and $v$ , at least 15% overlap. Such pairs account for approximately 5% of all key-value pairs. The examples show how key-value pairs often revolve around similar topics such as media, months, organs, etc. We again include additional examples from GPT-2 base and large.

Knowledge Lookup Last, we show we can use embeddings to locate FF values (or keys) related to a particular topic. We take a few vocabulary items related to a certain topic, e.g., [‘cm’, ‘kg’, ‘inches’], average their embeddings,We subtract the average embedding $\mu$ from $E$ before averaging, which improves interpretability. and rank all FF values (or keys) based on their dot-product with the average. Appendix C.4 shows a few examples of FF values found with this method that are related to programming, measurements, and animals.

2 Hidden State and Parameters

One merit of zero-pass interpretation is that it does not require running inputs through the model. Feeding inputs might be expensive and non-exhaustive. In this section and in this section only, we run a forward pass over inputs and examine if the embedding space representations of dynamically computed hidden states are “similar” to the representations of the activated static parameter vectors. Due to the small number of examples we run over, the overall GPU usage is still negligible.

A technical side note: we use GPT-2, which applies LayerNorm to the Transformer output before projecting it to the embedding space with $E$ . Thus, conservatively, LayerNorm should be considered as part of the projection operation. Empirically, however, we observe that projecting parameters directly without LayerNorm works well, which simplifies our analysis in §3. Unlike parameters, we apply LayerNorm to hidden states before projection to embedding space to improve interpretability. This nuance was also present in the code of Geva et al. [2022a].

to capture if activated parameter vectors cover the main vocabulary items corresponding to the hidden state.

Figure 2 presents the $R_{k}$ score averaged across tokens per layer. As a baseline, we compare $R_{k}$ of the activated vectors $\{\hat{x}_{i}\}_{i=1}^{m}$ of the correctly-aligned hidden state $\hat{h}$ at the output of the relevant layer (blue bars) against the $R_{k}$ when randomly sampling $\hat{h}_{\text{rand}}$ from all the hidden states (orange bars). We conclude that representations in embedding space induced by activated parameter vector mirror, at least to some extent, the representations of the hidden states themselves. Appendix §B.2 shows a variant of this experiment, where we compare activated parameters throughout GPT-2 medium’s layers to the last hidden state, which produces the logits used for prediction.

3 Interpretation of Fine-tuned Models

We now show that we can interpret the changes a model goes through during fine-tuning through the lens of embedding space. We fine-tune the top-3 layers of the 12-layer GPT-2 base (117M parameters) with a sequence classification head on IMDB sentiment analysis (binary classification) and compute the difference between the original parameters and the fine-tuned model. We then project the difference of parameter vectors into embedding space and test if the change is interpretable w.r.t. sentiment analysis.

Appendix D shows examples of projected differences randomly sampled from the fine-tuned layers. Frequently, the difference or its negation is projected to nouns, adjectives, and adverbs that express sentiment for a movie, such as ‘amazing’, ‘masterpiece’, ‘incompetence’, etc. This shows that the differences are indeed projected into vocabulary items that characterize movie reviews’ sentiments. This behavior is present across $W_{\text{Q}},W_{\text{K}},W_{\text{V}},K$ , but not $V$ and $W_{\text{O}}$ , which curiously are the parameters added to the residual stream and not the ones that react to the input directly.

Aligning Models in Embedding Space

The assumption Transformers operate in embedding space leads to an exciting possibility – we can relate different models to one another so long as they share the vocabulary and tokenizer. In §5.1, we show that we can align the layers of BERT models trained with different random seeds. In §5.2, we show the embedding space can be leveraged to “stitch” the parameters of a fine-tuned model to a model that was not fine-tuned.

Taking our approach to the extreme, the embedding space is a universal space, which depends only on the tokenizer, in which Transformer parameters and hidden states reside. Thus, we can align parameter vectors from different models in this space and compare them even if they come from different models, as long as they share a vocabulary.

Last, to obtain a one-to-one layer alignment, we use the Hungarian algorithm [Kuhn, 1955], which assigns exactly one layer from the first model to a layer from the second model. The algorithm’s objective is to maximize, given a similarity matrix $\mathcal{S}$ , the sum of scores of the chosen pairs, such that each index in one model is matched with exactly one index in the other. We repeat this for all parameter groups ( $W_{\text{Q}},W_{\text{K}},W_{\text{V}},W_{\text{O}},K$ ).

Figure 3 (left) shows the resulting alignment. Clearly, parameters from a certain layer in model $A$ tend to align to the same layer in model $B$ across all parameter groups. This suggests that different layers from different models that were trained separately (but with the same training objective and data) serve a similar function. As further evidence, we show that if not projected, the matching appears absolutely random in Figure §3 (right). We show the same results for other seed pairs as well in Appendix B.3.

2 Zero-shot Stitching

Model stitching [Lenc and Vedaldi, 2015; Csiszárik et al., 2021; Bansal et al., 2021] is a relatively under-explored feature of neural networks, particularly in NLP. The idea is that different models, even with different architectures, can learn representations that can be aligned through a linear transformation, termed stitching. Representations correspond to hidden states, and thus one can learn a transformation matrix from one model’s hidden states to an equivalent hidden state in the other model. Here, we show that going through embedding space one can align the hidden states of two models, i.e., stitch, without training.

Stitching produces models with accuracies that are higher than random on IMDB evaluation set, but not consistently. Figure 4 shows the accuracy of stitched models against the layer index from model $A$ over which stitching is performed. Out of 11 random seeds, three models obtained accuracy that is significantly higher than the baseline 50% accuracy, reaching an accuracy of roughly 70%, when stitching is done over the top layers.

Related Work

Interpreting Transformers is a broad area of research that has attracted much attention in recent years. A large body of work has focused on analyzing hidden representations, mostly through probing [Adi et al., 2016; Shi et al., 2016; Tenney et al., 2019; Rogers et al., 2020]. Voita et al. [2019a] used statistical tools to analyze the evolution of hidden representations throughout layers. Recently, Mickus et al. proposed to decompose the hidden representations into the contributions of different Transformer components. Unlike these works, we interpret parameters rather than the hidden representations.

Another substantial effort has been to interpret specific network components. Previous work analyzed single neurons [Dalvi et al., 2018; Durrani et al., 2020], attention heads [Clark et al., 2019; Voita et al., 2019b], and feedforward values [Geva et al., 2020; Dai et al., 2021; Elhage et al., 2022]. While these works mostly rely on input-dependent neuron activations, we inspect “static” model parameters, and provide a comprehensive view of all Transformer components.

Our work is most related to efforts to interpret specific groups of Transformer parameters. Cammarata et al. made observations about the interpretability of weights of neural networks. Elhage et al. analyzed 2-layer attention networks. We extend their analysis to multi-layer pre-trained Transformer models. Geva et al. [2020, 2022a, 2022b] interpreted feedforward values in embedding space. We coalesce these lines of work and offer a unified interpretation framework for Transformers in embedding space.

Discussion

While our work has limitations (see §8), we think the benefits of our work overshadow its limitations. We provide a simple approach and a new set of tools to interpret Transformer models and compare them. The realm of input-independent interpretation methods is still nascent and it might provide a fresh perspective on the internals of the Transformer, one that allows to glance intrinsic properties of specific parameters, disentangling their dependence on the input. Moreover, many models are prohibitively large for practitioners to run. Our method requires only a fraction of the compute and memory requirements, and allows interpreting a single parameter in isolation.

Importantly, our framework allows us to view parameters from different models as residents of a canonical embedding space, where they can be compared in model-agnostic fashion. This has interesting implications. We demonstrate two consequences of this observation (model alignment and stitching) and argue future work can yield many more use cases.

Limitations

Our work has a few limitations that we care to highlight. First, it focuses on interpreting models through the vocabulary lens. While we have shown evidence for this, it does not preclude other factors from being involved. Second, we used $E^{\prime}=E^{\textrm{T}}$ , but future research may find variants of $E$ that improve performance. Additionally, most of the work focused on GPT-2. This is due to shortcomings in the current state of our framework, as well as for clear presentation. We believe non-linearities in language modeling are resolvable, as is indicated in the experiment with BERT.

In terms of potential bias in the framework, some parameters might consider terms related to each due to stereotypes learned from the corpus.

References

Appendix A Rethinking Interpretation

The process of interpreting a vector $v$ in Geva et al. [2022b] proceeds in two steps: first the projection of the vector to the embedding space ( $vE$ ); then, we use the list of the tokens that were assigned the largest values in the projected vector, i.e.: $\texttt{top-k}(vE)$ , as the interpretation of the projected vector. This is reasonable since (a) the most activated coordinates contribute the most when added to the residual stream, and (b) this matches how we eventually decode: we project to the embedding space and consider the top-1 token (or one of the few top tokens, when using beam search).

This is a stronger notion of inverse – not only is $EE^{\prime}\approx I$ , but even when truncating the vector in the embedding space we can still reconstruct it with $E^{\prime}$ .

We claim that $E^{\textrm{T}}$ is a decent instantiation of $E^{\prime}$ and provide some empirical evidence. While a substantive line of work [Ethayarajh, 2019, Gao et al., 2019, Wang et al., 2020, Rudman et al., 2021] has shown that embedding matrices are not isotropic (an isotropic matrix $E$ has to satisfy $EE^{\textrm{T}}=\alpha I$ for some scalar $\alpha$ ), we show that it is isotropic enough to make $E^{\textrm{T}}$ a legitimate compromise. We randomly sample 300 vectors drawn from the normal distribution $\mathcal{N}(0,1)$ , and compute for every pair $x,y$ the cosine similarity between $x^{\textrm{T}}y$ and $\texttt{keep-k}(x^{\textrm{T}}E)\texttt{keep-k}(E^{\prime}y)$ for $k=1000$ , and then average over all pairs. We repeat this for $E^{\prime}\in\{E^{+},E^{\textrm{T}}\}$ and obtain a score of $0.10$ for $E^{+}$ , and $0.83$ for $E^{\textrm{T}}$ , showing the $E^{\textrm{T}}$ is better under when using top- $k$ . More globally, we compare $E^{\prime}\in\{E^{+},E^{\textrm{T}}\}$ for $k\in\{10,50,100,200,300,500\}$ with three distributions:

$x,y$ drawn from the normal $\mathcal{N}(0,1)$ distribution

$x,y$ drawn from hidden states along Transformer computations.

In Figure 5 we show the results, where dashed lines represent $E^{+}$ and solid lines represent $E^{\textrm{T}}$ . The middle row shows the plots for GPT-2 medium, which is the main concern of this paper. For small values of $k$ (which are more appropriate for interpretation), $E^{\textrm{T}}$ is superior to $E^{+}$ across all distributions. Interestingly, the hidden state distribution is the only distribution where $E^{+}$ has similar performance to $E^{\textrm{T}}$ . Curiously, when looking at higher values of $k$ the trend is reversed ( $k=\{512,1024,2048,4096,10000,15000,20000,30000\}$ ) - see Figure 5 (Right).

This settles the deviation from findings showing embedding matrices are not isotropic, as we see that indeed as $k$ grows, $E^{\textrm{T}}$ becomes an increasingly bad approximate right-inverse of the embedding matrix. The only distribution that keeps high performance with $E^{\textrm{T}}$ is the hidden state distribution, which is an interesting direction for future investigation.

For completeness, we provide the same analysis for GPT-2 base and large in Figure 5. We can see that GPT-2 base gives similar conclusions. GPT-2 large, however, seems to show a violent zigzag movement for $E^{+}$ but for most values it seems to be superior to $E^{\textrm{T}}$ . It is however probably best to use $E^{\textrm{T}}$ since it is more predictable. This zigzag behavior is very counter-intuitive and we leave it for future work to decipher.

Appendix B Additional Material

We define the following metric applying on vectors after projecting them into the embedding space:

where $\texttt{top-k}(v)$ is the set of $k$ top activated indices in the vector $v$ (which correspond to tokens in the embedding space). This metric is the Jaccard index [Jaccard, 1912] applied to the top- $k$ tokens from each vector. In Figure 6, Left, we demonstrate that FF key vectors and their corresponding value vectors are more similar (in embedding space) than two random key and value vectors. In Figure 6, Right, we show a similar result for attention value and output vectors. In Figure 6, Bottom, the same analysis is done for attention query and key vectors. This shows that there is a much higher-than-chance relation between corresponding FF keys and values (and the same for attention values and outputs).

B.2 Final Prediction and Parameters

We show that the final prediction of the model is correlated in embedding space with the most activated parameters from each layer. This implies that these objects are germane to the analysis of the final prediction in the embedding space, which in turn suggests that the embedding space is a viable choice for interpreting these vectors. Figure 7 shows that just like §4.2, correspondence is better when hidden states are not randomized, suggesting their parameter interpretations have an impact on the final prediction.

B.3 Parameter Alignment Plots for Additional Model Pairs

Alignment in embedding space of layers of pairs of BERT models trained with different random seeds for additional model pairs.

Seed 2 VS Seed 3

Seed 3 VS Seed 4

Seed 4 VS Seed 5

Appendix C Example Cases

Below we show output-value pairs from different heads of GPT-2 medium. For each head, we show the 50 pairs with the largest values in the $e\times e$ transition matrix. There are 384 attention heads in GPT-2 medium from which we manually choose a subset. Throughout the section some lists are marked with asterisks indicating the way this particular list was created:

- pairs of the form $(x,x)$ were excluded from the list

- pairs where both items are present in the corpus (we use IMDB training set).

Along with GPT-2 medium, we also provide a few examples from GPT-2 base and GPT-2 large.

C.1.2 Gender

C.1.3 Geography

C.1.4 British Spelling

C.1.5 Related Words

C.2 Query-Key Matrices

GPT-2 Medium - Layer 22 Head 5 (names and parts of names seem to attend to each other here)

C.3 Feedforward Keys and Values

Key-value pairs, $(k_{i},v_{i})$ , where at least 15% of the top- $k$ vocabulary items overlap, with $k=100$ . We follow our forerunner’s convention of calling the index of the value in the layer “dimension” (Dim).

Here again we use two asterisks (**) to represent lists where we discarded tokens outside the corpus vocabulary. GPT-2 Medium - Layer 0 Dim 116

C.4 Knowledge Lookup

Given a few seed embeddings of vocabulary items we find related FF values by taking a product of the average embeddings with FF values.

Seed vectors: ["python", "java", "javascript"] Layer 14 Dim 1215 (ranked 3rd)

Seed vectors: ["cm", "kg", "inches"] Layer 20 Dim 2917 (ranked 1st)

Appendix D Sentiment Analysis Fine-Tuning Vector Examples

Below we show the finetuning vector of the classifier weight. “POSITIVE” designates the vector corresponding to the label “POSITIVE”, and similarly for “NEGATIVE”.

In the following sub-sections, we sample 4 difference vectors per each parameter group (FF keys, FF values; attention query, key, value, and output subheads), and each one of the fine-tuned layers (layers 9-11). We present the ones that seemed to contain relevant patterns upon manual inspection. We also report the number of “good” vectors among the four sampled vectors for each layer and parameter group.