Repeat After Me: Transformers are Better than State Space Models at Copying

Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach

Introduction

Transformers (Vaswani et al., 2017) are the workhorse of modern sequence modeling, achieving remarkable performance on a variety of tasks, but they have unavoidable inefficiencies. Specifically, they require Ω(L)\Omega(L) memory111In some naive implementations of transformers, it is common to allocate a L×LL\times L matrix to compute the attention. However, memory efficient implementations, such as FlashAttention (Dao et al., 2022), compute the attention with O(L)O(L) memory. and compute to predict the next token of a sequence of length LL.

This has spurred a boom in attempts to create architectures that can achieve similar performance as transformers, but with O(1)O(1) memory to predict each token. This class of models includes state space models like S4 (Gu et al., 2021) or Mamba (Gu & Dao, 2023), as well as traditional RNN models (Hochreiter & Schmidhuber, 1997) and models that can be trained in parallel like linear attention (Katharopoulos et al., 2020; Choromanski et al., 2020) and parallel RNNs (Bradbury et al., 2016; Peng et al., 2023; Sun et al., 2023). In this paper, we will refer to this entire class of models that use a fixed-size memory as “generalized state space models” or GSSMs (see a formal definition in Section 2).

Recent work has demonstrated impressive performance of GSSMs, but it is not yet clear what these models sacrifice for their improved efficiency, if anything. In this paper, we find that one particular capability that is sacrificed is the ability to retrieve and repeat parts of the input context. As a result, transformers are better than GSSMs at a variety of tasks that require accessing arbitrary parts of the context.

To understand this gap in capabilities, we begin by presenting a theoretical analysis of the copying task222Note that we study copying of the input and not copying of training data (McCoy et al., 2023; Carlini et al., 2022). First, we show via construction that a simple transformer model can copy strings of length that is exponential in the number of heads of the transformer. This construction relies on the ability of the transformer to implement a mechanism of “storage” and retrieval of sequences of n tokens (n-grams), where the n-grams are used to track where to copy from. In contrast, we show that, trivially, GSSMs cannot accurately copy strings with more bits than the size of the latent state.

Pythia: 410M 1.4B 2.8B Mamba: 360M 1.4B 2.8B

Our theory studies representation expressivity, but not whether these representations will be learned. Moreover, in practice a large GSSM may have enough capacity to represent the entire input in the latent state, at least in theory. To resolve these concerns, we conduct a variety of synthetic experiments with models of \sim160M parameters. We find that transformers are both much more efficient at learning to copy (Figure 1(a)) and also generalize better to longer inputs (Figure 1(b)). Additionally, we verify experimentally that the copy “algorithm” learned by transformers indeed relies on n-grams to perform a lookup of where to copy from (Figure 3), similarly to our theoretical construction.

Finally, we present a variety of experiments on pre-trained models to test their ability to remember and access the input context. In particular, we show that Pythia transformers (Biderman et al., 2023) outperform Mamba GSSMs (Gu & Dao, 2023) of similar size at a variety of memory-intensive tasks including copying and retrieving information from the context (Figure 1(c)). This is especially notable since the Mamba models achieve lower perplexity than the Pythia models at language modeling on the Pile (Gao et al., 2020). These experiments illustrate the practical relevance of the memory issues that we raise, and hint at one way that architectual choices can impact the downstream performance of LLMs above and beyond training perplexity.

Theory: Representational Capacity

In this section we use the copy task for a theoretical comparison between state space models and transformers. We prove two main results. First, we construct a small transformer that solves the copy task for sequences lengths that are exponential in the transformer size. Second, we show that any state space model fails to solve the copy task, unless its latent state grows linearly with the sequence length.

Let D{\mathbb{D}} be a dictionary, which contains DD “alphabet” tokens. A sequence-to-sequence model is a function H:DDH:{\mathbb{D}}^{*}\to{\mathbb{D}}^{*}, which maps an input sequence of tokens to an output sequence. We think of the input x1,,xix_{1},\dots,x_{i} as the “prompt” to the model, and of the output sequence H(x1,,xi)H(x_{1},\dots,x_{i}) as the generated “answer”.

A sequence-to-token mapping is a function h:DDh:{\mathbb{D}}^{*}\to{\mathbb{D}}. Any sequence-to-token model hh naturally defines a sequence-to-sequence model HH by auto-regressive inference. Namely, for every input sequence x1,,xiDx_{1},\dots,x_{i}\in{\mathbb{D}} we define recursively xi+j=h(x1,,xi+j1)x_{i+j}=h(x_{1},\dots,x_{i+j-1}) and let H(x1,,xi)=(xi+1,xi+2,)H(x_{1},\dots,x_{i})=(x_{i+1},x_{i+2},\dots).

A state space S{\mathcal{S}} is some finite set. We denote by mem(S)\mathrm{mem}({\mathcal{S}}) the number of bits required to encode the states of S{\mathcal{S}}, namely mem(S)=log(S)\mathrm{mem}({\mathcal{S}})=\log(\left\lvert{\mathcal{S}}\right\rvert). A generalized state space model (GSSM) is a sequence model defined by an update rule u:S×DSu:{\mathcal{S}}\times{\mathbb{D}}\to{\mathcal{S}} and some output function r:SDr:{\mathcal{S}}\to{\mathbb{D}}. Let s0Ss_{0}\in{\mathcal{S}} be some initial state. Given some sequence x1,,xLx_{1},\dots,x_{L}, the state of the model at iteration ii is denoted by Si(x1,,xi)S_{i}(x_{1},\dots,x_{i}) and the output token is denoted by Ri(x1,,xi)R_{i}(x_{1},\dots,x_{i}). The state and output are defined recursively: 1) S0()=s0S_{0}(\emptyset)=s_{0}, 2) Si(x1,,xi)=u(Si1(x1,,xi1),xi)S_{i}(x_{1},\dots,x_{i})=u(S_{i-1}(x_{1},\dots,x_{i-1}),x_{i}), 3) Ri(x1,,xi)=r(Si(x1,,xi))R_{i}(x_{1},\dots,x_{i})=r(S_{i}(x_{1},\dots,x_{i})).

It is important to note that for any sequence model, there are two types of memory considerations: 1) input-independent memory (parameters) and 2) input-dependent memory (activations). The GSSM definition constraints the input-dependent memory (activations), which corresponds to mem(S)\mathrm{mem}({\mathcal{S}}), and does not restrict in any way the amount of input-independent memory (parameters) or the run-time of state updates. Since our main goal is to show a lower bound on the state space memory, leaving all other considerations unconstrained only strengthens our results.

Transformers.

Given some input of length LL and dimension dd, denoted x1,,xLRd{\bm{x}}_{1},\dots,{\bm{x}}_{L}\in{\mathbb{R}}^{d}, an attention head is parameterized by Wk,Wq,WvRd×dW_{k},W_{q},W_{v}\in{\mathbb{R}}^{d\times d}. We denote ki=Wkxi,qi=Wqxi,vi=Wvxi{\bm{k}}_{i}=W_{k}{\bm{x}}_{i},{\bm{q}}_{i}=W_{q}{\bm{x}}_{i},{\bm{v}}_{i}=W_{v}{\bm{x}}_{i} and denote Ki=[k1,,ki]Rd×iK_{i}=[{\bm{k}}_{1},\dots,{\bm{k}}_{i}]\in{\mathbb{R}}^{d\times i} and Vi=[v1,,vi]Rd×iV_{i}=[{\bm{v}}_{1},\dots,{\bm{v}}_{i}]\in{\mathbb{R}}^{d\times i}. We denote the output of the head at token ii by oiRd{\bm{o}}_{i}\in{\mathbb{R}}^{d}, where oi=Visoftmax(Kiqi){\bm{o}}_{i}=V_{i}\cdot\mathrm{softmax}(K_{i}\cdot{\bm{q}}_{i}).

We consider a transformer with ll attention heads, each one of dimension dd so that the full dimension of the Transformer is dldl. An embedding is some mapping Ψ:DRd\Psi:{\mathbb{D}}\to{\mathbb{R}}^{d}. An MLP is a function f:RdlRdlf:{\mathbb{R}}^{dl}\to{\mathbb{R}}^{dl} s.t. f(x)=U1σ(U2x)f({\bm{x}})=U_{1}\sigma(U_{2}{\bm{x}}), for some activation function σ\sigma. Both the embedding and the MLP layer are assumed to be applied on the token level. An attention-block is a set of ll heads applied in parallel, and a transformer-block is an attention-block followed by an MLP which operates on the concatenated output of the ll heads. The output of the model is sampled based on the output of the final layer. For simplicity, we study the arg max\operatorname*{arg\,max} “sampling” (i.e., predicting the most probable token).

The copy task.

To define the copy task, we add two special tokens to D{\mathbb{D}}: (1) beginning-of-sequence token, denoted \textscBOS\left\langle{\tiny\textsc{BOS}}\right\rangle, and (2) copy token, denoted \textscCOPY\left\langle{\tiny\textsc{COPY}}\right\rangle. So now D=D+2|{\mathbb{D}}|=D+2. A length-LL copy distribution DL{\mathcal{D}}_{L} over DL+2{\mathbb{D}}^{L+2} generates strings of the form: “\textscBOS,x1,,xL,\textscCOPY\left\langle{\tiny\textsc{BOS}}\right\rangle,x_{1},\dots,x_{L},\left\langle{\tiny\textsc{COPY}}\right\rangle”, where x(D{\textscBOS,\textscCOPY})L{\bm{x}}\in({\mathbb{D}}\setminus\{\left\langle{\tiny\textsc{BOS}}\right\rangle,\left\langle{\tiny\textsc{COPY}}\right\rangle\})^{L}.

For some sequence-to-sequence model H:DDH:{\mathbb{D}}^{*}\to{\mathbb{D}}^{*}, we denote the error of HH on a copy distribution DL{\mathcal{D}}_{L} by

where H1:L()H_{1:L}(\cdot) denotes the first LL tokens generated by HH. That is, we expect the model to output an exact copy of x{\bm{x}}.

2 Transformers can copy inputs of exponential length

In this section we show that transformers can implement the copy operation for input sequences with length exponential in the number of heads. Namely, we construct a transformer with two blocks that gets small error on the copy task.

The key idea in the construction is to first “hash” sequences of nn tokens (nn-grams), then at each iteration of the auto-regression attend to the previous occurrence of the most recent nn-gram, and output the succeeding token. That is, we show that a transformer can implement the copying algorithm illustrated in Figure 3 (and see also Algorithm 1 in the Appendix).

Positional embedding: Hard-ALiBi.

To perform the hashing described in the algorithm, we need to be able to leverage local positional information to define a hash, and also to apply this hash function globally on the entire input. To do this, we use a hard version of ALiBi (Press et al., 2021), which we call Hard-ALiBi. Just as in ALiBi, we add a bias bib_{i} to the ii-th attention head as follows: oi=Visoftmax(Kiqi+bi){\bm{o}}_{i}=V_{i}\cdot\mathrm{softmax}(K_{i}\cdot{\bm{q}}_{i}+b_{i}). Specifically, we set bib_{i} s.t. bi,j=b_{i,j}=-\infty for jimj\leq i-m and bi,j=0b_{i,j}=0 for j>imj>i-m. We allow different heads with different choices of mm and also allow for m=m=\infty which corresponds to softmax attention with no positional embedding. This is illustrated in Figure 8(c) (Appendix). While the Hard-ALiBi is introduced for our theoretical construction, we observe it also offers significant benefits empirically, as discussed in Section 3.

Guarantees.

The copy algorithm given in Algorithm 1 (and similarly, our transformer construction) can perfectly copy the input sequence, as long as there are no repeated nn-gram patterns in the input. Therefore, the error of the algorithm depends on the probability of repeated nn-grams:

Let DL{\mathcal{D}}_{L} be some copy distribution. For some nNn\in{\mathbb{N}}, let pngram(DL)p_{\mathrm{n-gram}}({\mathcal{D}}_{L}) be the probability that x1,,xLx_{1},\dots,x_{L} contains two repeated sequences of nn tokens. Namely:

Below we state the main theoretical result on copying with transformers, showing that transformers can copy their input, with error bounded by the probability of repeated nn-grams:

For all nn, there exists a depth-2 transformer T{\mathcal{T}} of dimension O(nlog(D))O(n\log(D)) s.t. for all 2nLDn2n\leq L\leq D^{n}, and for any copy distribution DL{\mathcal{D}}_{L}, errDL(T)<pngram(DL)\mathrm{err}_{{\mathcal{D}}_{L}}({\mathcal{T}})<p_{\mathrm{n-gram}}({\mathcal{D}}_{L}).

Intuitively, the probability of repeated nn-grams decays quickly when increasing the value of nn. Indeed, we show that for the uniform distribution over sequences, this probability decays exponentially with nn:

Let DL{\mathcal{D}}_{L} be the copy distribution generated by sampling x{\bm{x}} from the uniform distribution over the “alphabet” (non-special) tokens. Then, pngram(DL)<L2Dnp_{\mathrm{n-gram}}({\mathcal{D}}_{L})<L^{2}D^{-n}.

Combining the above results, we get that transformers can copy sequences of tokens drawn from the uniform distribution, using a number of parameters that depends only logarithmically on the input sequence length.

Fix some ϵ(0,1/2)\epsilon\in(0,1/2) and some LΩ(log(1/ϵ))L\geq\Omega(\log(1/\epsilon)). There exists a depth-2 transformer T{\mathcal{T}} of dimension O(log(L/ϵ)log(D))O(\log(L/\epsilon)\log(D)) s.t. for the uniform copy distribution DL{\mathcal{D}}_{L}, errDL(T)<ϵ\mathrm{err}_{{\mathcal{D}}_{L}}({\mathcal{T}})<\epsilon.

For simplicity we do not limit the precision of the parameters or activations, but note that our results hold for finite-precision transormers, using O(log(log(L)))O(\log(\log(L))) bits.

3 State Space Models cannot copy inputs beyond memory size

We saw that transformers are able to copy uniform sequences of tokens, with parameter count logarithmic in the sequence length. We now show that GSSMs cannot copy uniform input sequences, unless the capacity of their state space grows linearly with the size of the sequence length. This is intuitive: to be able to copy the entire input sequence, the model needs to store it in its state space, which requires the memory to grow linearly with the sequence length.

Fix some GSSM HH over state space S{\mathcal{S}}. Then, for all LL, for the uniform copy distribution DL{\mathcal{D}}_{L}, the model HH has error errDL(H)>1SDL\mathrm{err}_{{\mathcal{D}}_{L}}(H)>1-\frac{\left\lvert{\mathcal{S}}\right\rvert}{D^{L}}.

Given Theorem 2.7, the following Corollary is immediate:

Fix some LNL\in{\mathbb{N}}. Then, every GSSM HH with state space S{\mathcal{S}} s.t. mem(S)<Llog(D)1\mathrm{mem}({\mathcal{S}})<L\log(D)-1 has error errDL(H)>1/2\mathrm{err}_{{\mathcal{D}}_{L}}(H)>1/2 for the uniform copy distribution DL{\mathcal{D}}_{L}.

As mentioned previously, the input-dependent memory of transformers grows linearly with the sequence length, which is less memory-efficient compared to GSSMs. However, it is interesting to note that from the above result, at least for the copy task, transformers are almost optimal in terms of their input-dependent memory. More specifically, an implication of Theorem 2.3 is that there exists a transformer which can copy inputs of length LL using O~(L)\tilde{O}(L) input-dependent memory333We use O~\tilde{O} to hide logarithmic factors., and due to Corollary 2.8 this is indeed optimal (up to logarithmic factors).

Learning to Copy

In the previous section, we proved that transformers can represent the copy operation for exponentially long sequences, while GSSMs fail to copy long sequences due to their limited memory. While these results show that in theory, transformers can outperform GSSMs, our theoretical results do not establish that such a gap will be observed in practice for two reasons. First, it is not clear that transformers can indeed learn to copy from examples. Second, GSSMs in practice may use a large latent state memory, so that our bounds only hold for very long sequences of tokens. For example, a latent state of 1000 32-bit floating point numbers has enough bits to store at least 2000 tokens from a 50K token vocabulary. However, even though a GSSM could fit the context into memory, it may not learn to do so.

Our goal in this section is to verify that our theoretical analysis bears out experimentally when training models from scratch on synthetic data, before moving on to study pretrained models in the next section. Specifically, we train transformers and GSSMs (LSTM (Hochreiter & Schmidhuber, 1997) and Mamba (Gu & Dao, 2023)) on variants of the copy task shown in Figure 2.

We now provide a brief overview of our experimental setup. Further details may be found in Appendix A.

In all our experiments, we set the model hyperparameters so that the Mamba and transformers have a similar number of parameters (160\approx 160 million parameters). Since we find that large LSTMs are hard to train (as confirmed in Pascanu et al. (2013)), we use the largest LSTM we managed to train which has 40\approx 40 million parameters.

Dataset.

During training, we generate in an online manner a batch of 64 examples at each epoch. At test time, we evaluate our models on 1010 batches of 128128 examples. We report the mean and standard-deviation over these 10 batches. If not specified otherwise, our token space V\mathcal{V} is of size 30 and made of the alphabet letters i.e. V={a,,z,\textscBOS,\textscEOS,\textscCOPY}\mathcal{V}=\{a,\dots,z,\left\langle{\tiny\textsc{BOS}}\right\rangle,\left\langle{\tiny\textsc{EOS}}\right\rangle,\left\langle{\tiny\textsc{COPY}}\right\rangle\} where \textscBOS\left\langle{\tiny\textsc{BOS}}\right\rangle is the beginning of sentence token, \textscEOS\left\langle{\tiny\textsc{EOS}}\right\rangle the end of sentence token and \textscCOPY\left\langle{\tiny\textsc{COPY}}\right\rangle the separator token. All the strings are sampled uniformly i.e. we first sample the length of the sequence and then independently sample each position of the string from V\mathcal{V}. Finally, we “pack the context” with i.i.d. sequences during training similarly to (Zhou et al., 2023): we fill the context with multiple independent samples of the task.

Positional information.

Positional information also plays an important role in the length generalization capacity of Transformers (Jelassi et al., 2023; Kazemnejad et al., 2023; Shen et al., 2023). Previously popular methods of input-layer positional embeddings (e.g. sinusoidal (Vaswani et al., 2017) or learned (Radford et al., 2019)) have been replaced by relative positional encodings at each attention layer (e.g. RoPE (Su et al., 2023), Alibi (Press et al., 2021), or NoPE (Kazemnejad et al., 2023)). Below, we experiment these positional encodings along with the Hard-Alibi encoding introduced in Section 2.

2 Data efficiency on the copy task

We begin by training our models on the simple task of copying a sequence of input tokens described in Figure 2. The model gets an input of L\leq L tokens followed by a Separator (\textscCOPY\left\langle{\tiny\textsc{COPY}}\right\rangle) token, and needs to output the same sequence again from the beginning. In this section, we focus on in-distribution learning: we train on strings of random length L=300\leq L=300 and record the string-level accuracy on evaluation strings sampled from the training distribution.

Results for this experiment are shown in 1(a). Clearly, there is a large gap between the transformers and GSSMs. We observe that the transformers need 100x less samples than the best GSSMs to learn the copy task.

Note that the sharp changes in accuracy displayed in 1(a) are due to the log-scaled x-axis and choice of string-level accuracy as a metric. In 9(a), we report the character-level accuracy, which yields smoother curves demonstrating the learning process of GSSMs. Regarding LSTMs, we find that they do not manage to learn on length-300 strings even at the character level. In 9(b), we show that LSTMs are able to learn to copy on shorter strings and that string length is the bottleneck.

3 Length generalization on the copy task

The prior experiment demonstrates superior efficiency of learning in-distribution. Now, we test the ability of the learned functions to generalize out-of-distribution. Specifically, we consider generalization from short sequences to longer sequences. Testing this sort of generalization can help us to better understand which function the model has learned, i.e. whether the model has truly learned the “correct” copy operation or whether it just learned to copy sequences of the particular size it was trained on.

Here, we train all models on sequences of 50\leq 50 tokens, and test them on sequences of up to 10001000 tokens, reporting string-level accuracy. As seen in 1(b), all models are able to (eventually) solve the task in-distribution on lengths of 50\leq 50, but transformer-based models display much better generalization to longer inputs compared to GSSMs. Namely, we observe that the performance of the GSSMs (LSTM and MAMBA) drops to zero almost immediately when increasing the input length, while the performance of transformers decays much more gradually with length.

When looking at the relative performance of different transformer models in 1(b), it becomes clear that the positional encoding is important to length generalization. Specifically, the ALiBi and NoPE transformers dramatically outperform the RoPE model on longer inputs. This is likely because the sinusoidal embeddings of RoPE create a more dramatic change than the decay of ALiBi or NoPE when we go to longer inputs.

Improved generalization with Hard-ALiBi.

To test our understanding of how transformers learn to copy, we now consider swapping in the Hard-ALiBi positional encoding that we used in our theoretical construction of hash-based copying (introduces in Subsection 2.2 and illustrated in Figure 8 in the Appendix). 1(b) shows that a transformer trained with Hard-ALiBi embedding on sequences of length 50\leq 50 achieves almost perfect length generalization up to sequences of length 1000. Note that this is well beyond the context length ever encountered in training.

4 Transformers learn to use n-gram hashing

Next, we attempt to determine whether the transformer trained on the copy task indeed applies the mechanism of storage and retrieval of n-grams. To do this, we evaluate the performance of a transformer with Hard-ALiBi positional encoding trained on the copy task when tested on a distribution of examples that intentionally contains duplicate n-grams. That is, we draw uniform sequences of tokens, and then randomly replace some n-gram with another n-gram that already appears in the sequence, such that each example always contains two copies of the same n-gram (typically followed by a different token). We use the Hard-Alibi model here since it performs the best for the copy task as showed in 1(a). Figure 4 shows the performance of the transformer for different choices of nn. We observe that the transformer maintains roughly the same accuracy for n4n\leq 4, but that its accuracy starts dropping when the inputs contains duplicate sequences of 5 or more tokens. This suggests that the transformer relies on something like 5-gram retrieval to do the copy task.

5 GSSMs cannot arbitrarily retrieve from context

Transformer: NoPE Alibi HAlibi GSSM: LSTM Mamba

Transformer: NoPE Alibi HAlibi GSSM: LSTM Mamba

We now introduce another task to probe the mechanisms that the models use to copy from the context: the n-gram lookup task. In this task the model needs to use a given n-gram as a key to look up the k-token key that follows the query. We consider two variants of the task: suffix keys and prefix keys. In both variants, we assess length generalization to understand the function that the models have learned.

First, we consider the suffix key version of n-gram lookup. In this task, the model is given a sequence LL of input tokens, a separator, and then an n-gram from the input sequence. The model then needs to output a sequence of kk tokens following the chosen n-gram (see Figure 5 for an illustration). This task is closely related to induction heads (Olsson et al., 2022). This task requires the model to be able to “store” the entire context in order to effectively find the correct key to access it’s query. We train all models on sequences of at most 30 tokens and show results in Figure 5. Transformers perform well on this task, with a relatively small drop in performance when increasing the sequence length up to 100. This suggests that transformers can learn to perform n-gram storage and retrieval. GSSMs, however, perform poorly beyond their training distribution. Intuitively, this task still requires the models to store the entire input sequence, something that GSSMs struggle to do.

Next, we try the prefix key version of n-gram lookup. Here we provide the n-gram key at the beginning and then the full input sequence (illustrated in Figure 6). In this version of the task the model does not need to store the entire input since it can look for the key on the fly as the sequence is processed. This is good for the GSSMs, since they can write the key into the state and then ignore inputs that do not match. Indeed, GSSMs achieve perfect length-generalization on this variant. Interestingly, the GSSMs even outperform the NoPE and ALiBi transformers (although not the Hard-Alibi model). We hypothesize that this may be an issue where these positional embeddings make it more difficult to effectively perform the hashing lookup over a long distance in relative positions. Taken together, these results illustrate how GSSMs seem to be memory limited, but can be effective when the tasks only require a summary of the inputs rather than storing the entire context.

Pre-trained Models

In this section, we compare the performance of pre-trained transformers and pre-trained GSSMs on memory-intensive tasks such as copying long strings, retrieval and few-shot question answering. We show that transformers outperform GSSMs of similar scale on such memory-intensive tasks, even when the GSSM has lower perplexity as a language model. These results confirm that the limitation of GSSMs raised in previous sections apply to large scale models trained on real pretraining data.

In the experiments below, we compare Pythia transformer models (Biderman et al., 2023) of sizes ranging from 410M to 2.8B against Mamba models (Gu & Dao, 2023) of similar sizes. All these models have been pre-trained on the Pile (Gao et al., 2020) and use the same tokenizer. The Mamba models generally have slightly lower perplexity on the training set for a given size. The main difference between the Pythia and the Mamba models is their architectural design.

We compare these models by measuring their performance while varying the input instance length and consider two types of tasks: copy-based and information retrieval tasks. The copy-based tasks consist of presenting a random text to the model and asking it to copy the text. In the information retrieval tasks, we provide a text to the model and ask it a related question. These retrieval tasks can be seen as ”selective copy”, since the model needs to copy a small chunk of the input text in order to respond to the question. To measure performance, we use the string-level accuracy in all the experiments except in 7(c) where we consider question answering and thus report the F1 score. We evaluate the models over 10 batches of size 64 for all the tasks except for question answering where we evaluate over 50 questions because the number of questions with a given context length is limited. Further details are in Appendix A.

Pythia: 410M 1.4B 2.8B Mamba: 360M 1.4B 2.8B

Pythia: 410M 1.4B 2.8B Mamba: 360M 1.4B 2.8B

2 Copying the input context

We first observe that pre-trained transformers outperform pre-trained GSSMs at copying long natural language strings. In 7(a), we randomly sample strings from the C4 dataset (Raffel et al., 2020) with varying number of tokens. Our prompt consists of two copies of the sampled string plus the first word of the string and we expect the model to complete the third copy. Even the smallest transformer model dramatically outperforms the largest GSSM. This happens even though the large GSSMs have enough bits in the state variable to potentially store the context. This confirms the idea that this is an architectual bias of transformers that makes it easier for them to copy from the context.

Unlike strings of tokens sampled uniformly at random, natural text can often be compressed, possibly allowing language models to copy longer strings even with limited memory. To test whether this matters, in 7(b) we conduct the same experiment as above but randomly shuffle the order of the words in the strings. We find that when we shuffle the words, both GSSMs and transformers perform worse on the task, but the effect is more stark for GSSMs. Even the largest GSSM now gets zero accuracy on strings of length 300. This suggests that when the input is more difficult to compress, the GSSM suffers due to its fixed size state.

3 Retrieval from the input context

While copying provides a clear task to separate the model classes, it is not a particularly realistic task. That said, it presents an extreme case of a type of behavior that is highly relevant for many tasks of interest. In particular, many tasks require retrieving specific information from the context that is relevant to the desired output. This subsection presents examples of how our results transfer to more practical tasks.

We first consider a “phone-book” experiment where we provide a synthetic phone-book to the model and ask it to return the phone number when given a name. We generate the phone-book by randomly sampling LL names and their associated phone number. One line of this phone-book looks like “John Powell: 609-323-7777”. Our prompt to the model consists of the phone-book, two few-shot examples and a question asking for the phone number of a randomly sampled name from the phone-book. 1(c) reports the accuracy obtained by the pretrained transformers and GSSMs while varying the size of the phone-book L.L. We observe that even the smallest transformer (410M parameters) outperforms the largest GSSMs (2.8B parameters) when the phone-book size is long enough (L70L\geq 70). This shows that in retrieval tasks which require access to the whole context, GSSMs struggle to store the relevant information in their fixed-size state.

Question-Answering.

In this experiment, we compare the 2.8B parameter Mamba and transformer models444In our experiments, smaller models were unable to achieve reasonable and consistent performance on this dataset., on the SQuAD question-answering dataset (Rajpurkar et al., 2018). This dataset provides text paragraphs together with a few questions regarding the text. We probe the models to answer the question by providing a single demonstration of a question/answer pair (corresponding to the same text) before giving the target question. We bin the paragraphs according to their lengths, and report the F1 score as a function of the paragraph length for both models in 7(c). We observe that while for short paragraphs, both the Pythia transformer and Mamba achieve comparable performance, the performance of Mamba degrades more quickly with the paragraph length, while the transformer-based model maintains a similar accuracy even for longer texts. This result shows that the fixed-memory of GSSMs also limits their performance on standard natural tasks.

Related Work

There exists a broad body of prior work on the representational capacity of GSSMs like RNNs (Merrill, 2019; Merrill et al., 2020) as well as transformers (Weiss et al., 2021; Merrill et al., 2022; Wei et al., 2022; Sanford et al., 2023; Edelman et al., 2022). Previous works that study transformers do so through comparison to other complexity classes, such as threshold circuits (Merrill et al., 2022), RASP language (Weiss et al., 2021) or first-order logic (Chiang et al., 2023) (see Strobl et al. (2023) for a thorough review). These works do not provide insights into how transformers implement algorithms for solving specific problems. In contrast, our theoretical result constructs a transformer for the copy task, which illustrates the mechanism and provides tight bounds on the model size. Together with the result showing that GSSMs cannot copy long sequences, our theory characterizes the power of different sequence models on the copy task. Other theoretical separation results between transformers and RNNs (Sanford et al., 2023; Merrill, 2019) use more complex tasks of less practical relevance.

Other papers have previously demonstrated the capacity of transformers to leverage the entire input context for tasks like retrieval, question answering, and in-context learning (Devlin et al., 2018; Raffel et al., 2020; Petroni et al., 2020; Brown et al., 2020; Liu et al., 2023b). Another line of work has studied the “induction head” mechanism in transformers that performs a retrieval operation much like the one we observe for copying (Olsson et al., 2022). But, to our knowledge, there is not a comparison in related work between transformers and GSSMs of similar quality on these tasks.

Several of our experiments study length generalization as a way to assess whether the model found the “right way” to solve the task. Prior work on length generalization in transformers has focused on the data distribution (Anil et al., 2022), positional embeddings (Kazemnejad et al., 2023), and arithmetic tasks (Delétang et al., 2022; Ruoss et al., 2023; Jelassi et al., 2023; Zhou et al., 2023). We extend many of these ideas to the copying task.

Finally, we note that while we focus on tasks where transformers outperform GSSMs, there are also tasks where GSSMs outperform transformers. For example, Liu et al. (2023a) shows that transformers fail to generalize out of distribution for “flip-flop language modeling”, while LSTMs do so easily. These tasks require tracking a small O(1)O(1) state variable over time. Another benefit of GSSMs is the ability to input long contexts like DNA sequences that may be impractical for transformers (Nguyen et al., 2023).

Discussion

We have demonstrated through theory and experiments that transformers are better than GSSMs at copying from their input context. However, we emphasize that state space models have many advantages over transformers. The memory and computational complexity of GSSMs does not increase with the input length, which is ideal for training and inference on long inputs. Additionally, state space models such as RNNs are better at tracking state variables across long sequences (Liu et al., 2023a), which may be useful for generating long consistent text. Importantly, language processing in the human brain appears to be much more similar to how state space models process language (Tikochinski et al., 2024). We therefore believe that future work should focus on building hybrid architectures that endow state space models with an attention-like mechanism, allowing them to retrieve relevant pieces of text from their input. Indeed, humans have an incredibly limited capacity for memorizing sequences (Miller, 1956), but can translate entire novels if we allow them to look back at the text (Shelton, 1612).

Acknowledgements

We thank Boaz Barak for helpful discussions. Kempner Institute computing resources enabled this work. Samy Jelassi acknowledges funding supported by the Center of Mathematical Sciences and Applications. This work has been made possible in part by a gift from the Chan Zuckerberg Initiative Foundation to establish the Kempner Institute for the Study of Natural and Artificial Intelligence. Sham Kakade acknowledges funding from the Office of Naval Research under award N00014-22-1-2377.

References

Appendix A Experimental setup

In this section, we provide additional details about our experimental setup. We first give a description of the positional encodings used in our transformers experiments (Subsection A.1) and then give details about the training and evaluation procedures (Subsection A.2).

We consider multiple positional encoding schemes in our experiments in Section 3:

the NoPE scheme (Kazemnejad et al., 2023) where no positional information is added to any of the attention scores (8(a)). This architecture choice helps to get better length generalization in multiple tasks including the copy task.

the ALiBi scheme (Press et al., 2021) which biases the attention scores with a penalty that is proportional to their distance (8(b)). mm is a head-specific slope fixed before training.

the Hard-ALiBi scheme introduced in Section 2 which has MM masked attention heads where we explicitly force the model to attend to their directly previous tokens and HMH-M heads set to be NoPE attention heads. In 8(c), we display the case where we have M=4M=4 masked heads: in the first head, the tokens just attend to themselves; in the second head, the tokens attend to themselves and to previous ones; in the third head, the tokens attend to themselves, the previous ones and the second preceding tokens. The remaining HMH-M heads are set to NoPE.

A.2 Pretraining and evaluation details

We implement all of our training in Pytorch (Paszke et al., 2019). We use the HuggingFace library (Wolf et al., 2019) and the Mamba GitHub repository (Gu & Dao, 2023).

Architectures.

In our experiments in Section 3, the backbone of our transformers is the GPT-NeoX architecture. We set the number of layers to 12, the hidden size to 1024 and the number of heads H=16H=16. We consider the different positional encodings that are described in Subsection A.1. For Alibi, we set the head-specific scalar as in the original paper i.e. mh=2h/2m_{h}=2^{-h/2} for h{1,,H}.h\in\{1,\dots,H\}. For the Hard-Alibi model, we sweep over the number of masked heads M{2,,10}M\in\{2,\dots,10\} and found that the best model corresponds to M=6M=6. Regarding the Mamba models, we set the number of layers to 24 and the hidden size 1024. We also sweep over the state space dimension S{16,32,64,128,256}S\in\{16,32,64,128,256\} and found the best model is S=32S=32. This choice of hyperparameters ensures that both transformers and Mamba models have a comparable number of parameters. Lastly, our LSTM is made of 4 layers and width 1024.

Training hyperparameters.

In Section 3, at each epoch, we sample online a batch size of size 64. We fill the context with examples so we choose a context length (C=420C=420 for all the experiments except 1(a) where we set C=620C=620) and pack as many examples as possible to fit this context. So in our case, one sample contains many instances. We run the experiments for 15 epochs for both transformers and Mamba while for LSTMs we need 300 epochs. All methods are trained with the AdamW optimizer (Loshchilov & Hutter, 2017) with learning rate 5e-5, a linear rate decay schedule, 300 steps of warmup and default weight decay of 1e-1. Finally, to train all the models, we use the next-token prediction loss but we apply a mask on the input instance so that we only penalize the model whenever it makes a mistake on the labels (and not on the inputs and labels jointly).

Compute resources.

Pretraining was all done on an internal cluster using RTX8000 GPUs. We estimate that the final training run needed to produce the results in the paper took approximately 600 GPU hours.

Evaluation algorithm.

We evaluate the models over 10 batches of size 64 for all the tasks except for the question answering one where we evaluate over 50 questions because the number of questions with a given context length is limited.

Decoding algorithm.

At inference, all our models use greedy decoding for generation and we set the temperature to 0.

Appendix B Additional Experiments

In Subsection B.1, we focus on the in-distribution learning of the copy task and show that the number of samples needed by GSSMs is much higher than the one for transformers. In Subsection B.2, we study the performance of pre-trained models on the copy task in the case where the strings are sampled uniformly. This experiment shows that when the text to copy is totally random, the gap between pre-trained transformers and GSSMs is even larger.

Transformer: RoPE NoPE Alibi HAlibi GSSM: LSTM Mamba

In this section, we provide additional plots to complement the data efficiency experiment from 1(a). We want to highlight the following points:

in 1(a), we see a sharp transition for the Mamba learning curve. However, 9(a) shows that the learning process is more smooth at the character level. Besides, LSTMs are not able to learn the copy on length-300 strings even at the character level.

We consider the experiment of learning to copy much shorter strings namely strings with length 30\leq 30. 9(b) shows that the gap in terms of training examples between transformers and Mamba is much smaller i.e. Mamba only needs 10x more data. Besides, we see that the LSTM is able to learn the copy task but it needs 100x more data than transformers.

B.2 Pre-trained models on the uniform copy task

In this section, we provide an additional experiment that shows the superiority of pre-trained Pythia over pre-trained Mamba models in the copy task.

Pythia: 410M 1.4B 2.8B Mamba: 360M 1.4B 2.8B

We consider the same setup as in Section 3: we sample uniform strings of alphabet characters with a fixed length and ask the model to copy it by using the same prompt format as the one described in Subsection 4.2.

This setting is a more extreme version of 7(b) since the strings are more random: in 7(b), the order of the nouns were random but the nouns were English nouns while in 7(b), the strings are totally random. In Figure 10, we see a clear separation between the transformers and Mamba models with the smallest Pythia outperforming the largest Mamba. However, compared to 7(b), the Pythia performance is much higher since the 1.4B model able to get almost 100% accuracy.

Appendix C Proofs - Upper Bound

This section gives a detailed proof of Theorem 2.3 and Lemma 2.4.

We begin by introducing some technical lemmas that we use in the proof of Theorem 2.3.

𝑛2…𝐿i=n+2,\dots,L do kis(xin,xin+1,,xi1)k_{i}\leftarrow s(x_{i-n},x_{i-n+1},\dots,x_{i-1}) vixiv_{i}\leftarrow x_{i} end for for j=1,,Lj=1,\dots,L do if jnj\leq n then yjxjy_{j}\leftarrow x_{j} else qjs(yjn,,yj1)q_{j}\leftarrow s(y_{j-n},\dots,y_{j-1}) Let i[L]i\in[L] s.t. ki=qjk_{i}=q_{j}, and set yjxiy_{j}\leftarrow x_{i} end if end for Output: sequence y1,,yLy_{1},\dots,y_{L} Lemma C.1. Let ht(x1,,xi)=1min(t,i)j=max(1,it+1)ixjh_{t}({\bm{x}}_{1},\dots,{\bm{x}}_{i})=\frac{1}{\min(t,i)}\sum_{j=\max(1,i-t+1)}^{i}{\bm{x}}_{j}. Then, hth_{t} can be computed using a hard-ALiBi attention head.

Let Wk,Wq=0W_{k},W_{q}=0 (zero matrix) and let Wv=IdW_{v}=I_{d} (indentity matrix). We choose bi{0,}ib_{i}\in\{0,-\infty\}^{i} s.t.

Assume that d=log(D)+2d=\left\lceil\log(D)\right\rceil+2. Then, there exists an embedding Ψ\Psi s.t.

For every xDx\in{\mathbb{D}} it holds that Ψ(x)2=1\left\lVert\Psi(x)\right\rVert_{2}=1 and Ψ(x)1\left\lVert\Psi(x)\right\rVert_{\infty}\leq 1.

For xxx^{\prime}\neq x it holds that x,x<11d\left\langle x,x^{\prime}\right\rangle<1-\frac{1}{d}.

For every x\textscBOSx\neq\left\langle{\tiny\textsc{BOS}}\right\rangle, Ψ(x),Ψ(\textscBOS)=0\left\langle\Psi(x),\Psi(\left\langle{\tiny\textsc{BOS}}\right\rangle)\right\rangle=0, and for every x\textscCOPYx\neq\left\langle{\tiny\textsc{COPY}}\right\rangle, Ψ(x),Ψ(\textscCOPY)=0\left\langle\Psi(x),\Psi(\left\langle{\tiny\textsc{COPY}}\right\rangle)\right\rangle=0.

Denote d=log(D)d^{\prime}=\left\lceil\log(D)\right\rceil, and observe that we can encode all DD “non-special” tokens as vectors in {±1d}d\left\{\pm\frac{1}{\sqrt{d}}\right\}^{d^{\prime}}, and denote this encoding by Ψ\Psi^{\prime}. Now, define:

Let zRK{\bm{z}}\in{\mathbb{R}}^{K} be some vector such that, for some constants a>b>0a>b>0, there exists i[K]i\in[K] s.t. zi=az_{i}=a and for all jij\neq i we have zjb\left\lvert z_{j}\right\rvert\leq b. Denote s=softmax(z){\bm{s}}=\mathrm{softmax}({\bm{z}}). Then si11+Kexp(ba)s_{i}\geq\frac{1}{1+K\exp(b-a)} and sjexp(ba)s_{j}\leq\exp(b-a) for all jij\neq i.

subscript𝑧𝑖𝐾1𝑏𝑎𝐾𝑏𝑎1𝐾𝑏𝑎\displaystyle\exp(a)=\exp(z_{i})\leq\sum_{j=1}^{K}\exp(z_{j})\leq\exp(z_{i})+(K-1)\exp(b)\leq\exp(a)+K\exp(b)=\exp(a)(1+K\exp(b-a)) Observe the following:

1𝐾𝑏𝑎11𝐾𝑏𝑎\displaystyle s_{i}=\frac{\exp(z_{i})}{\sum_{j=1}^{K}\exp(z_{j})}\geq\frac{\exp(a)}{\exp(a)(1+K\exp(b-a))}=\frac{1}{1+K\exp(b-a)} Finally, for every jij\neq i:

C.2 Proof of Theorem 2.3

We begin by constructing the first block of the transformer, which computes the “lookup-table” for the copy algorithm. This lookup-table consists of pairs of (key,values) for each position ii, where the key encodes the nn-gram preceding the ii-th token, and the value is the ii-th token. Namely, if the sequence is x1,,xix_{1},\dots,x_{i}, then keyi=(xin1,,xi)\mathrm{key}_{i}=(x_{i-n-1},\dots,x_{i}) and valuei=xi\mathrm{value}_{i}=x_{i}. Additionally, the transformer block also computes a query, which is just the “current” nn-gram, i.e. queryi=(xin,,xi)\mathrm{query}_{i}=(x_{i-n},\dots,x_{i}). The copy algorithm matches the current query\mathrm{query} with previous key\mathrm{key}-s, retrieving the matching value\mathrm{value}.

The following theorem shows that by using a combination of nn hard-ALiBi attention heads (with different choice of mm for each head), together with an MLP layer, can compute the correct (keyi,valuei,queryi)(\mathrm{key}_{i},\mathrm{value}_{i},\mathrm{query}_{i}) for each position. We use a slightly modified keyi,queryi\mathrm{key}_{i},\mathrm{query}_{i} to handle cases where the ini\leq n (or, ii is one of the first nn tokens after the \textscCOPY\left\langle{\tiny\textsc{COPY}}\right\rangle token).

Let Ψ\Psi be the one-hot embedding. Then, there exists a hard-ALiBi transformer block with 3 outputs, denoted Tkey,Tquery,TvalueT^{\mathrm{key}},T^{\mathrm{query}},T^{\mathrm{value}}, which correspond to 3 blocks of the output dimension, s.t. Tkey:Rd×R(d+1)n×T^{\mathrm{key}}:{\mathbb{R}}^{d\times*}\to{\mathbb{R}}^{(d+1)n\times*}, Tquery:Rd×R(d+1)n×T^{\mathrm{query}}:{\mathbb{R}}^{d\times*}\to{\mathbb{R}}^{(d+1)n\times*} and Tvalue:Rd×Rd×T^{\mathrm{value}}:{\mathbb{R}}^{d\times*}\to{\mathbb{R}}^{d\times*} satisfying, for all x{\bm{x}} sampled from a length-nn copy distribution,

𝑡1𝑑1𝑡𝑑𝑖Ψsubscript𝑥1…Ψsubscript𝑥𝑖Ψsubscript𝑥𝑖𝑡T^{\mathrm{key}}_{(t-1)d+1:td,i}(\Psi(x_{1}),\dots,\Psi(x_{i}))=\Psi(x_{i-t}) and if ini\leq n

𝑡1𝑑1𝑡𝑑𝑖Ψsubscript𝑥1…Ψsubscript𝑥𝑖0T^{\mathrm{key}}_{(t-1)d+1:td,i}(\Psi(x_{1}),\dots,\Psi(x_{i}))=0 • Additionally, for t=1,,nt=1,\dots,n, for all ii

𝑛𝑑𝑡𝑖Ψsubscript𝑥1…Ψsubscript𝑥𝑖1𝑖𝑡1T^{\mathrm{key}}_{nd+t,i}(\Psi(x_{1}),\dots,\Psi(x_{i}))=\bm{1}\{i=t+1\} 3. Query output:

𝑡1𝑑1𝑡𝑑𝑖Ψsubscript𝑥1…Ψsubscript𝑥𝑖Ψsubscript𝑥𝑖𝑡1T^{\mathrm{query}}_{(t-1)d+1:td,i}(\Psi(x_{1}),\dots,\Psi(x_{i}))=\Psi(x_{i-t+1}) and if i<ni<n

𝑡1𝑑1𝑡𝑑𝑖Ψsubscript𝑥1…Ψsubscript𝑥𝑖0T^{\mathrm{query}}_{(t-1)d+1:td,i}(\Psi(x_{1}),\dots,\Psi(x_{i}))=0 • Additionally, for t=1,,nt=1,\dots,n, for all ii

𝑛𝑑𝑡𝑖Ψsubscript𝑥1…Ψsubscript𝑥𝑖⋅𝑛1𝑖𝐿𝑡T^{\mathrm{key}}_{nd+t,i}(\Psi(x_{1}),\dots,\Psi(x_{i}))=n\cdot\bm{1}\{i=L+t\} Proof. We prove the following:

For the value output, we simply take Tvalue=h1T^{\mathrm{value}}=h_{1} as defined in Lemma C.1.

𝑡1subscriptℎ𝑡1subscript𝒙1…subscript𝒙𝑖⋅𝑡subscriptℎ𝑡subscript𝒙1…subscript𝒙𝑖g_{t}({\bm{x}}_{1},\dots,{\bm{x}}_{i})=(t+1)\cdot h_{t+1}({\bm{x}}_{1},\dots,{\bm{x}}_{i})-t\cdot h_{t}({\bm{x}}_{1},\dots,{\bm{x}}_{i}) where we define h00h_{0}\equiv 0. Observe that if i>ti>t then:

𝑡11𝑡1superscriptsubscript𝑗𝑖𝑡𝑖subscript𝒙𝑗⋅𝑡1𝑡superscriptsubscript𝑗𝑖𝑡1𝑖subscript𝒙𝑗subscript𝒙𝑖𝑡g_{t}({\bm{x}}_{1},\dots,{\bm{x}}_{i})=(t+1)\cdot\frac{1}{t+1}\sum_{j=i-t}^{i}{\bm{x}}_{j}-t\cdot\frac{1}{t}\sum_{j=i-t+1}^{i}{\bm{x}}_{j}={\bm{x}}_{i-t} and if iti\leq t then:

Claim: g^t(Ψ(x1),,Ψ(xi))=1{i>n}Ψ(xit)\hat{g}_{t}(\Psi(x_{1}),\dots,\Psi(x_{i}))=\bm{1}\{i>n\}\cdot\Psi(x_{i-t})

Proof: Fix some j[d]j\in[d]. Observe that for all ii, ejgt(Ψ(x1),,Ψ(xi))1\left\lvert e_{j}\cdot g_{t}(\Psi(x_{1}),\dots,\Psi(x_{i}))\right\rvert\leq 1.

If ini\leq n, we have gn(Ψ(x1),,Ψ(xi))=1ij=1iΨ(xj)g_{n}(\Psi(x_{1}),\dots,\Psi(x_{i}))=\frac{1}{i}\sum_{j^{\prime}=1}^{i}\Psi(x_{j^{\prime}}) and so Ψ(\textscBOS)gn(Ψ(x1),,Ψ(xi))=1\Psi(\left\langle{\tiny\textsc{BOS}}\right\rangle)\cdot g_{n}(\Psi(x_{1}),\dots,\Psi(x_{i}))=1 where we use the properties of Ψ\Psi and the fact that x1=\textscBOSx_{1}=\left\langle{\tiny\textsc{BOS}}\right\rangle. Therefore, g^t,j(Ψ(x1),,Ψ(xi))=0\hat{g}_{t,j}(\Psi(x_{1}),\dots,\Psi(x_{i}))=0.

where we use the fact that xit\textscBOSx_{i-t}\neq\left\langle{\tiny\textsc{BOS}}\right\rangle and therefore Ψ(\textscBOS)Ψ(xit)=0\Psi(\left\langle{\tiny\textsc{BOS}}\right\rangle)\cdot\Psi(x_{i-t})=0.

Claim: g~t(Ψ(x1),,Ψ(xi))=1{i=t+1}\tilde{g}_{t}(\Psi(x_{1}),\dots,\Psi(x_{i}))=\bm{1}\{i=t+1\}

Proof: Denote gt,i=gt(Ψ(x1),,Ψ(xi))g_{t,i}=g_{t}(\Psi(x_{1}),\dots,\Psi(x_{i})) and h1,i=h1(Ψ(x1),,Ψ(xi))h_{1,i}=h_{1}(\Psi(x_{1}),\dots,\Psi(x_{i})). Observe:

If i=t+1i=t+1, then gt,i=Ψ(x1)=Ψ(\textscBOS)g_{t,i}=\Psi(x_{1})=\Psi(\left\langle{\tiny\textsc{BOS}}\right\rangle) and h1,i=Ψ(xi)Ψ(\textscBOS)h_{1,i}=\Psi(x_{i})\perp\Psi(\left\langle{\tiny\textsc{BOS}}\right\rangle) and therefore g~t,i=1\tilde{g}_{t,i}=1.

If i>t+1i>t+1 then gt,i=Ψ(xit)Ψ(\textscBOS)g_{t,i}=\Psi(x_{i-t})\perp\Psi(\left\langle{\tiny\textsc{BOS}}\right\rangle) and h1,i=Ψ(xi)Ψ(\textscBOS)h_{1,i}=\Psi(x_{i})\perp\Psi(\left\langle{\tiny\textsc{BOS}}\right\rangle) and so g~t,i=0\tilde{g}_{t,i}=0.

If 1<it1<i\leq t then Ψ(\textscBOS)gt,i=1i12\Psi(\left\langle{\tiny\textsc{BOS}}\right\rangle)\cdot g_{t,i}=\frac{1}{i}\leq\frac{1}{2} and h1,i=Ψ(xi)Ψ(\textscBOS)h_{1,i}=\Psi(x_{i})\perp\Psi(\left\langle{\tiny\textsc{BOS}}\right\rangle) and so g~t,i=0\tilde{g}_{t,i}=0.

If i=1i=1 then gt,i=h1,i=Ψ(\textscBOS)g_{t,i}=h_{1,i}=\Psi(\left\langle{\tiny\textsc{BOS}}\right\rangle) and therefore g~t,i=0\tilde{g}_{t,i}=0.

Finally, we can take Tkey=[g^1,,g^q,g~1,,g~q]T^{\mathrm{key}}=[\hat{g}_{1},\dots,\hat{g}_{q},\tilde{g}_{1},\dots,\tilde{g}_{q}].

For all t=1,,nt=1,\dots,n, define gt(x1,,xi)=σ(Ψ(\textscCOPY)gt1(x1,,xi))g^{*}_{t}({\bm{x}}_{1},\dots,{\bm{x}}_{i})=\sigma(\Psi(\left\langle{\tiny\textsc{COPY}}\right\rangle)\cdot g_{t-1}({\bm{x}}_{1},\dots,{\bm{x}}_{i})).

Claim: gt(Ψ(x1),,Ψ(xi))=1{i=L+t}g^{*}_{t}(\Psi(x_{1}),\dots,\Psi(x_{i}))=\bm{1}\{i=L+t\}

Proof: Denote gt,i=gt(Ψ(x1),,Ψ(xi))g_{t,i}=g_{t}(\Psi(x_{1}),\dots,\Psi(x_{i})). Observe:

If i=L+ti=L+t then gt1,i=Ψ(xit+1)=Ψ(xL+1)=Ψ(\textscCOPY)g_{t-1,i}=\Psi(x_{i-t+1})=\Psi(x_{L+1})=\Psi(\left\langle{\tiny\textsc{COPY}}\right\rangle) and therefore gt,i=1g^{*}_{t,i}=1.

If iL+ti\neq L+t and i>t1i>t-1 then gt1,i=Ψ(xit+1)Ψ(\textscCOPY)g_{t-1,i}=\Psi(x_{i-t+1})\perp\Psi(\left\langle{\tiny\textsc{COPY}}\right\rangle) and therefore gt,i=0g_{t,i}^{*}=0.

If iti\leq t then since x1,,xi\textscCOPYx_{1},\dots,x_{i}\neq\left\langle{\tiny\textsc{COPY}}\right\rangle we get Ψ(\textscCOPY)gt1,i=0\Psi(\left\langle{\tiny\textsc{COPY}}\right\rangle)\cdot g_{t-1,i}=0 and therefore gt,i=0g_{t,i}^{*}=0.

Therefore, we can take Tquery=[g^0,,g^q1,ng1,,ngq]T^{\mathrm{query}}=[\hat{g}_{0},\dots,\hat{g}_{q-1},n\cdot g^{*}_{1},\dots,n\cdot g^{*}_{q}].

Now, we prove Theorem 2.3 by showing that using a single attention head with no positional embedding on top of the construction in Lemma C.4 realizes the copy algorithm. Since the first block computes the correct choice of keyi,queryi,valuei\mathrm{key}_{i},\mathrm{query}_{i},\mathrm{value}_{i}, by correctly scaling of the attention matrix we verify that the output of the second layer at position ii corresponds to valuej\approx\mathrm{value}_{j} for jj s.t. keyj=queryi\mathrm{key}_{j}=\mathrm{query}_{i}.

Let Tvalue,Tkey,TqueryT^{\mathrm{value}},T^{\mathrm{key}},T^{\mathrm{query}} be the outputs of the Transformer block guaranteed by Lemma C.4. Observe that, for some temprature τR\tau\in{\mathbb{R}}, the following function can be computed by a softmax-attention layer on-top of this block:

where e.g. TvalueT^{\mathrm{value}} denotes Tvalue(Ψ(x1),,Ψ(xi))T^{\mathrm{value}}(\Psi(x_{1}),\dots,\Psi(x_{i})).

For now, assume that all the nn-grams in x{\bm{x}} are unique, and that the length of the input satisfies 2L+2K2L+2\leq K for K=DnK=D^{n}.

Claim: Fix some i>Li>L, denote z=TkeyTiquery{\bm{z}}=T^{\mathrm{key}}\cdot T^{\mathrm{query}}_{i}. Then, ziL+1=nz_{i-L+1}=n and zj<n1d\left\lvert z_{j}\right\rvert<n-\frac{1}{d} for all jiL+1j\neq i-L+1.

Proof: We separate to the following cases:

𝑖𝑛1\displaystyle=\bm{1}\{j>n\}\cdot[\Psi(x_{j-1}),\dots,\Psi(x_{j-n})]^{\top}[\Psi(x_{i}),\dots,\Psi(x_{i-n+1})] =1{j>n}t=1nΨ(xjt)Ψ(xit+1)\displaystyle=\bm{1}\{j>n\}\cdot\sum_{t=1}^{n}\Psi(x_{j-t})\Psi(x_{i-t+1}) Now, if j=iL+1j=i-L+1 then xjt=xiL+1t=xit+1x_{j-t}=x_{i-L+1-t}=x_{i-t+1} and since j>nj>n we get

𝑖𝑡1𝑛T_{j}^{\mathrm{key}}\cdot T_{i}^{\mathrm{query}}=\sum_{t=1}^{n}\left\lVert\Psi(x_{i-t+1})\right\rVert=n If jiL+1j\neq i-L+1, since there are no repeated nn-grams, there is at least some t[n]t\in[n] s.t. Ψ(xjt)Ψ(xit+1)\Psi(x_{j-t})\neq\Psi(x_{i-t+1}) and by the choice of the embedding Ψ(xjt)Ψ(xit+1)11d\Psi(x_{j-t})\cdot\Psi(x_{i-t+1})\leq 1-\frac{1}{d}. In this case, we get TjkeyTiqueryn1d\left\lvert T_{j}^{\mathrm{key}}\cdot T_{i}^{\mathrm{query}}\right\rvert\leq n-\frac{1}{d}.

𝑖𝐿1T_{j}^{\mathrm{key}}\cdot T_{i}^{\mathrm{query}}=ne_{j-1}\cdot e_{i-L}=n\cdot\bm{1}\{j=i-L+1\} which satisfies the required.

𝑖𝑡1T_{j}^{\mathrm{key}}\cdot T_{i}^{\mathrm{query}}=\sum_{t=1}^{n}\Psi(x_{j-t})\Psi(x_{i-t+1}) and as before, since there are no repeated nn-grams, we get TjkeyTiqueryn1d\left\lvert T_{j}^{\mathrm{key}}\cdot T_{i}^{\mathrm{query}}\right\rvert\leq n-\frac{1}{d}

Claim: Fix some ϵ(0,1)\epsilon\in(0,1) and some i>Li>L, denote s=softmax(τTkeyTiquery)=softmax(τz){\bm{s}}=\mathrm{softmax}(\tau T^{\mathrm{key}}\cdot T^{\mathrm{query}}_{i})=\mathrm{softmax}(\tau\cdot{\bm{z}}). If τ=dln(2Kϵ)\tau=d\ln(\frac{2K}{\epsilon}), then siL+11ϵs_{i-L+1}\geq 1-\epsilon and sjϵ2Ks_{j}\leq\frac{\epsilon}{2K} for all jiL+1j\neq i-L+1.

Proof: Using the previous claim, togehter with Lemma C.3, we get that:

siL+111+iexp(τ/d)11+Kexp(τ/d)11+ϵ/2=1ϵ/21+ϵ/21ϵs_{i-L+1}\geq\frac{1}{1+i\exp(-\tau/d)}\geq\frac{1}{1+K\exp(-\tau/d)}\geq\frac{1}{1+\epsilon/2}=1-\frac{\epsilon/2}{1+\epsilon/2}\geq 1-\epsilon

Claim: Fix some ϵ(0,1)\epsilon\in(0,1) and some i>Li>L. Then, for τdln(2Kϵ)\tau\geq d\ln(\frac{2K}{\epsilon}), it holds that:

𝑖𝐿1italic-ϵ\left\lVert H(\Psi(x_{1}),\dots,\Psi(x_{i}))-\Psi(x_{i-L+1})\right\rVert\leq\epsilon Proof: Let s{\bm{s}} as defined in the previous claim. Then:

𝑖𝐿1\displaystyle\left\lVert H(\Psi(x_{1}),\dots,\Psi(x_{i}))-\Psi(x_{i-L+1})\right\rVert =j=1isjΨ(xj)Ψ(xiL+1)\displaystyle=\left\lVert\sum_{j=1}^{i}s_{j}\Psi(x_{j})-\Psi(x_{i-L+1})\right\rVert (1siL+1)Ψ(xiL+1)+jiL+1sjΨ(xj)\displaystyle\leq(1-s_{i-L+1})\left\lVert\Psi(x_{i-L+1})\right\rVert+\sum_{j\neq i-L+1}s_{j}\left\lVert\Psi(x_{j})\right\rVert =(1siL+1)+jiL+1sjϵ+(i1)ϵ2K2ϵ\displaystyle=(1-s_{i-L+1})+\sum_{j\neq i-L+1}s_{j}\leq\epsilon+(i-1)\frac{\epsilon}{2K}\leq 2\epsilon Now, denote by Φ:RdD\Phi:{\mathbb{R}}^{d}\to{\mathbb{D}} the output map given by Φ(z)=argmaxxDzΨ(x)\Phi({\bm{z}})=\arg\max_{x\in{\mathbb{D}}}{\bm{z}}\cdot\Psi(x) (which can be computed by an arg max\operatorname*{arg\,max} over a linear function).

Claim: If τdln(8Kd)\tau\geq d\ln(8Kd), then for all i>Li>L we have Φ(H(Ψ(x1),,Ψ(xi)))=xiL+1\Phi(H(\Psi(x_{1}),\dots,\Psi(x_{i})))=x_{i-L+1}.

Proof: Denote yi=H(Ψ(x1),,Ψ(xi)){\bm{y}}_{i}=H(\Psi(x_{1}),\dots,\Psi(x_{i})). First, using the previous claim, we observe that

𝑖𝐿1\displaystyle{\bm{y}}_{i}\cdot\Psi(x_{i-L+1}) =(yiΨ(xiL+1))Ψ(xiL+1)+Ψ(xiL+1)\displaystyle=({\bm{y}}_{i}-\Psi(x_{i-L+1}))\cdot\Psi(x_{i-L+1})+\left\lVert\Psi(x_{i-L+1})\right\rVert 1yiΨ(xiL+1)114d\displaystyle\geq 1-\left\lVert{\bm{y}}_{i}-\Psi(x_{i-L+1})\right\rVert\geq 1-\frac{1}{4d} Next, observe that for all jiL+1j\neq i-L+1 we have

⋅subscript𝒚𝑖Ψsubscript𝑥𝑖𝐿1Ψsubscript𝑥𝑗⋅Ψsubscript𝑥𝑗Ψsubscript𝑥𝑖𝐿1\displaystyle=({\bm{y}}_{i}-\Psi(x_{i-L+1}))\cdot\Psi(x_{j})+\Psi(x_{j})\cdot\Psi(x_{i-L+1}) yiΨ(xiL+1)+11d134d<yiΨ(xiL+1)\displaystyle\leq\left\lVert{\bm{y}}_{i}-\Psi(x_{i-L+1})\right\rVert+1-\frac{1}{d}\leq 1-\frac{3}{4d}<{\bm{y}}_{i}\cdot\Psi(x_{i-L+1}) From the above claim, the Transformer construction outputs the correct token at each step of the auto-regressive generation. ∎

C.3 Proof of Lemma 2.4

Fix some i<j[L]i<j\in[L]. Let I:={i,,i+n}I:=\{i,\dots,i+n\} and J:={j,,j+n}J:=\{j,\dots,j+n\}. We first bound the probability of drawing some x{\bm{x}} s.t. xI=xJ{\bm{x}}_{I}={\bm{x}}_{J}. Note that there are DIJD^{\left\lvert I\cup J\right\rvert} choices for xIJ{\bm{x}}_{I\cup J}. We count the number of choices for xIJ{\bm{x}}_{I\cup J} s.t. xI=xJ{\bm{x}}_{I}={\bm{x}}_{J}. Notice that in this case, xIJ{\bm{x}}_{I\cup J} is determined by xIJ{\bm{x}}_{I\setminus J}, therefore there are DIJD^{\left\lvert I\setminus J\right\rvert} possible choices. We conclude that

Appendix D Proofs - Lower Bound

In this section, we prove Theorem 2.7. We begin by showing that, for every input, the output of the model in each iteration is a deterministic function of the state of the model after observing the input:

Let Hu,r:DnDnH_{u,r}:{\mathbb{D}}^{n^{\prime}}\to{\mathbb{D}}^{n} be some fixed-state sequence-to-sequence model. Then, there exists map G:SDnG:{\mathcal{S}}\to{\mathbb{D}}^{n} s.t. for all xDn{\bm{x}}\in{\mathbb{D}}^{n^{\prime}}

Let xn+1,,xn+nx_{n^{\prime}+1},\dots,x_{n^{\prime}+n} be the outputs of Hu,rH_{u,r}. We need to show that there exist functions G1,,GnG_{1},\dots,G_{n} s.t. Hu,r(x1,,xn)=G(Sn(x1,,xn))H_{u,r}(x_{1},\dots,x_{n^{\prime}})=G(S_{n^{\prime}}(x_{1},\dots,x_{n})). We give the following recurive definition:

G1(s)=r(s)G_{1}(s)=r(s), G~1(s)=u(s,G1(s))\tilde{G}_{1}(s)=u(s,G_{1}(s)).

Gi(s)=r(G~i1(s))G_{i}(s)=r(\tilde{G}_{i-1}(s)), G~i(s)=u(G~i1(s),Gi(s))\tilde{G}_{i}(s)=u(\tilde{G}_{i-1}(s),G_{i}(s)).

Denote s=Sn(x1,,xn)s=S_{n^{\prime}}(x_{1},\dots,x_{n^{\prime}}) We prove by induction that Gi(s)=xn+iG_{i}(s)=x_{n^{\prime}+i} and also that G~i(s)=Sn+i(x1,,xn+i)\tilde{G}_{i}(s)=S_{n^{\prime}+i}(x_{1},\dots,x_{n^{\prime}+i}).

G1(s)=r(s)=Rn(x1,,xn)=xn+1G_{1}(s)=r(s)=R_{n^{\prime}}(x_{1},\dots,x_{n^{\prime}})=x_{n^{\prime}+1}.

G~1(s)=u(s,G1(s))=u(s,xn+1)=Sn+1(x1,,xn+1)\tilde{G}_{1}(s)=u(s,G_{1}(s))=u(s,x_{n^{\prime}+1})=S_{n^{\prime}+1}(x_{1},\dots,x_{n^{\prime}+1})

Gi(s)=r(G~i1(s))=r(Sn+i1(x1,,xn+i1))=Rn+i1(x1,,xn+i1)=xn+iG_{i}(s)=r(\tilde{G}_{i-1}(s))=r(S_{n^{\prime}+i-1}(x_{1},\dots,x_{n^{\prime}+i-1}))=R_{n^{\prime}+i-1}(x_{1},\dots,x_{n^{\prime}+i-1})=x_{n^{\prime}+i}

G~i(s)=u(G~i1(s,Gi(s)))=u(Sn+i1(x1,,xn+i1),xn+i)=Sn+i(x1,,xn+i)\tilde{G}_{i}(s)=u(\tilde{G}_{i-1}(s,G_{i}(s)))=u(S_{n^{\prime}+i-1}(x_{1},\dots,x_{n^{\prime}+i-1}),x_{n^{\prime}+i})=S_{n^{\prime}+i}(x_{1},\dots,x_{n^{\prime}+i})

Given the previous Lemma, we bound the error of the model by comparing the number of possible states to the number of possible inputs.

From Lemma D.1, there exists some function G:SDnG:{\mathcal{S}}\to{\mathbb{D}}^{n} s.t. Hu,r=GSnH_{u,r}=G\circ S_{n^{\prime}}. For each x{\bm{x}}, we denote by x~\tilde{{\bm{x}}} the sequence \textscBOS,x,\textscCOPY\left\langle{\tiny\textsc{BOS}}\right\rangle,{\bm{x}},\left\langle{\tiny\textsc{COPY}}\right\rangle. Now, observe the following:

𝑛21~𝒙1𝐺subscript𝑆superscript𝑛′2~𝒙𝒙\displaystyle=\frac{1}{D^{n}}\sum_{s\in{\mathcal{S}}}\sum_{{\bm{x}}\in S_{n+2}^{-1}(\tilde{{\bm{x}}})}\bm{1}\{G\circ S_{n^{\prime}+2}(\tilde{{\bm{x}}})={\bm{x}}\} =1DnsSxSn+21(x~)1{G(s)=x}SDn\displaystyle=\frac{1}{D^{n}}\sum_{s\in{\mathcal{S}}}\sum_{{\bm{x}}\in S_{n+2}^{-1}(\tilde{{\bm{x}}})}\bm{1}\{G(s)={\bm{x}}\}\leq\frac{\left\lvert{\mathcal{S}}\right\rvert}{D^{n}}