Are Transformers universal approximators of sequence-to-sequence functions?

Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

Introduction

Self-attention based Transformer networks (Vaswani et al., 2017) have been at the center of the recent progress on various natural language processing (NLP) tasks, including machine translation (Vaswani et al., 2017), language modeling (Radford et al., 2018; 2019), and question answering (Devlin et al., 2018; Yang et al., 2019; Liu et al., 2019). All these tasks involve learning models that map an input sequence of tokens to an output sequence of tokens. Transformers make it feasible to train large models to approximate these sequence-to-sequence functions due to their ability to process the input tokens in a parallel way, as opposed to the sequential nature of RNNs and LSTMs.

A Transformer block consists of two kinds of layers: a self-attention layer and a token-wise feed-forward layer, with skip connections present in both layers. The self-attention layer transforms each input token embedding using a weighted combination of the embeddings of all tokens in the input sequence, where weights are generated by pairwise dot-products among the input token embeddings. The token-wise feed-forward layer then independently processes each of these modified input token embeddings without any interaction among them. Notably, Transformers employ parameter reuse across tokens, as both layers use the same parameters to process each token. Moreover, Transformers have to rely solely on the pairwise dot-products to capture interaction between the input tokens.

Given the parameter sharing and limited interactions between tokens, it is natural to wonder: what class of sequence-to-sequence functions can the Transformer networks represent? Also, what is the role of the two different kinds of layers? Are both layers needed to obtain the representation power of Transformers? In the existing literature, the advantage of Transformers has often been attributed to their capability of computing contextual embeddings/mappings of the input, as opposed to fixed word embeddings as in word2vec (Mikolov et al., 2013). Is it possible to formalize the notion of contextual mappings? If yes, can Transformers actually compute such mappings? Such questions still remain elusive.

In this paper, we provide a mathematical definition of contextual mappings and show that multi-head self-attention layers can indeed compute contextual mappings of the input sequences. We further show that this ability to compute contextual mappings coupled with the value mapping ability of the feed-forward layers makes Transformers universal approximators of any permutation equivariant sequence-to-sequence function. We also improve this result using positional encodings, and show that Transformers can represent any sequence-to-sequence function; i.e., the restriction of permutation equivariance can be removed by positional encodings.

These results on universal approximation of sequence-to-sequence functions raise a natural question: is it possible to have a more efficient architecture to compute contextual mappings, consequently, preserving the ability to universally approximate sequence-to-sequence functions? Towards this, we explore other architectures that can implement contextual mappings (to some extent), and experimentally evaluate their performance. In our experiments, we notice that the models that combine these simpler architectures with Transformers have better performance, compared to the standalone Transformers. We conclude the paper by presenting more discussion and interesting future research directions along these lines.

We prove that Transformers are universal approximators of continuous and permutation equivariant sequence-to-sequence functions with compact support (Theorem 2). We also show that, if Transformers have trainable positional encodings added to the input, then they are universal approximators of continuous sequence-to-sequence functions on a compact domain (Theorem 3).

We formalize the notion of contextual mappings and show that the attention layers can compute contextual mappings, where each unique context is mapped to a unique vector (Lemma 6).

We experimentally evaluate other simpler layers that can compute contextual mappings to some extent, such as bi-linear projections and separable convolutions, and show that substituting some of the self-attention layers with these layers can result in better performance (Section 5).

2 Related works & notation

Analysis of attention-based models. Given the popularity of Transformers, there have been numerous works trying to understand the role of attention layers in natural language processing models. One such line of work focuses on probing the output of attention layers to understand the attention mechanism and internal language representation (Hewitt & Manning, 2019; Clark et al., 2019; Coenen et al., 2019; Vig & Belinkov, 2019). Although these results give valuable insights, a consistent theoretical analysis corroborating these findings is missing.

Universal approximation theorems. Universal approximation theorems are classical results in neural network theory, dating back many decades (Cybenko, 1989; Hornik, 1991). These results show that given unbounded width, a one-hidden-layer neural network can approximate arbitrary continuous function with compact support, up to any accuracy. Other results focusing on depth appeared more recently (Lu et al., 2017; Hanin & Sellke, 2017; Lin & Jegelka, 2018). In particular, Lu et al. (2017); Hanin & Sellke (2017) consider fully-connected ReLU networks whose input dimension is dd, and show that networks with width d+1d+1 and unbounded depth are universal approximators of scalar-valued continuous functions. Lin & Jegelka (2018) show that a residual network with one hidden neuron per residual block is a universal approximator of scalar-valued functions, given unbounded depth. Although Transformer networks do have residual connections, due to their heavy parameter sharing, the existing analyses for residual networks do not extend to Transformers. Sannai et al. (2019) consider universally approximating permutation invariant/equivariant functions using fully-connected ReLU networks.

Turing completeness results on Transformers. Recently, Pérez et al. (2019) have shown that Transformers with infinite precision are Turing complete, which is not the case in finite precision setting (Dehghani et al., 2018). We note that Turing completeness deals with computation on formal languages (thus discrete objects), while universal approximation focuses on functions on a continuum. In other words, these are two different concepts; and one does not imply another.

Transformer networks

We define the Transformer networks as the composition of Transformer blocks. The family of the sequence-to-sequence functions corresponding to the Transformers can be defined as:

As seen in above, both layers (cf. (1) and (2)) of a Transformer block employ parameter reuse/sharing, because each token/column undergoes the same transformations (e.g., WQi{\bm{W}}_{Q}^{i}, WKi{\bm{W}}_{K}^{i}, or W1{\bm{W}}_{1}) regardless of its position. Moreover, interactions between tokens can only be captured through pairwise dot-products in the softmax operator σ[]\sigma[\cdot] (cf. (1)). Given such limitations in a single Transformer block’s representation power, it is not obvious what kinds of sequence-to-sequence functions Th,m,r\mathcal{T}^{h,m,r} can approximate; we provide the answer to this question in the next section.

Transformers are universal approximators of sequence-to-sequence functions

The following result shows that a Transformer network with a constant number of heads hh, head size mm, and hidden layer of size rr can approximate any function in FPE\mathcal{F}_{\rm PE}.

Let 1p<1\leq p<\infty and ϵ>0\epsilon>0, then for any given fFPEf\in\mathcal{F}_{\rm PE}, there exists a Transformer network gT2,1,4g\in\mathcal{T}^{2,1,4}, such that dp(f,g)ϵ\mathsf{d}_{p}(f,g)\leq\epsilon.

Let 1p<1\leq p<\infty and ϵ>0\epsilon>0, then for any given fFCDf\in\mathcal{F}_{\rm CD}, there exists a Transformer network gTP2,1,4g\in\mathcal{T}^{2,1,4}_{\rm P} such that we have dp(f,g)ϵ\mathsf{d}_{p}(f,g)\leq\epsilon.

Theorems 2 and 3 provide an interesting characterization of the representation power of fixed-width Transformer networks. Since the function classes Th,m,r\mathcal{T}^{h,m,r} and TPh,m,r\mathcal{T}^{h,m,r}_{\rm P} become richer as we increase the values of (h,m,r)(h,m,r), our results establish that general Transformer networks are also universal approximators of sequence-to-sequence functions. Remarkably, none of the parameters (h,m,r)(h,m,r) depend on the input sequence length nn or embedding dimension dd.

Here, we would like to again point out that Theorems 2 and 3 appear quite surprising at a first glance, given the parameter sharing across all the tokens in a sequence, e.g., feed-forward layers are applied token-wise and the projection matrices in the self-attention layers are the same across different tokens. Furthermore, attention layers can only capture pairwise interaction between different tokens in the sequence. In the next subsection, we briefly describe one of our key steps in overcoming the aforementioned restrictions and proving universal approximation power of Transformers.

Let us consider a setting where we are interested in embedding two sentences: 1) I am happy; and 2) I am Bob. These sentences are fed to a sequence-to-sequence model as

where vI,vam,vhappy,{\bm{v}}_{\rm I},{\bm{v}}_{\rm am},{\bm{v}}_{\rm happy}, and vBob{\bm{v}}_{\rm Bob} denote dd-dimensional embedding for the tokens ‘I’, ‘am’, ‘happy’, and ‘Bob’, respectively. Since the word ‘I’ occurs in different contexts in these sentences, in order to implement arbitrary sequence-to-sequence functions, the sequence-to-sequence model should map the two occurrences of ‘I’ to different values. We formally define this requirement below.

At the first thought, we can consider getting a contextual mapping by simply averaging all the tokens, because this can capture the one-word difference (e.g., “happy” vs. “Bob”) in two different contexts. However, if there are multiple words that are different, it is not guaranteed that the average will be different. Indeed, requiring unique mappings for all the tokens for any change in any number of tokens, is a steep requirement.

While the self-attention layer does consider pair-wise interactions among different input tokens, it is not clear if this weak form of pair-wise interaction with shared projection weights is sufficient to extract the underlying context. The following result, which we sketch here, shows that self-attention layers can implement a permutation equivariant contextual mapping over almost all elements of a grid in d×n^{d\times n}. We defer the full statement to Section 4.2.

Lemma 6 shows that a series of self-attention layers can implement contextual mappings, despite the apparent restriction that each of them can only capture pair-wise interaction. However, the restriction of permutation equivarance still exists because attention layers are inherently permutation equivariant. Coupled with the ability of token-wise feed-forward layers to map different values in q(L)q({\bm{L}}) to arbitrary output values, we can prove universal approximation capability of Transformers.

2 Proof of the universal approximation theorem (Theorem 2)

Next, we outline the proof of Theorem 2 in greater detail. We refer the reader to Section C for the proof of Theorem 3, since it is a modification of Theorem 2. Even though Theorems 2 and 3 do not specifically mention the required depth for approximation, our proof techniques do characterize it, and we show that our construction is tight in the number of parameters. We defer the discussion of depth to Section 4.4.

Recall that we want to show that given a function fFPEf\in\mathcal{F}_{\rm PE}, we can find a Transformer network gT2,1,4g\in\mathcal{T}^{2,1,4} such that dp(f,g)ϵ\mathsf{d}_{p}(f,g)\leq\epsilon. Without loss of generality, we can assume that the compact support of ff is contained in d×n^{d\times n}. We achieve our desired objective in three key steps:

Step 1. Approximate FPE\mathcal{F}_{\rm PE} with piece-wise constant functions. We first use (a variant of) the classical result that any continuous function can be approximated up to arbitrary accuracy by piece-wise constant functions. For δ>0\delta>0, we define the following class of piece-wise constant functions.

Step 2. Approximate FPE(δ)\overline{\mathcal{F}}_{\rm PE}(\delta) with modified Transformers. We then consider a slightly modified architecture for Transformer networks, where the softmax operator σ[]\sigma[\cdot] and ReLU(){\rm ReLU}(\cdot) are replaced by the hardmax operator σH[]\sigma_{\rm H}[\cdot] and an activation function ϕΦ\phi\in\Phi, respectively. Here, the set of allowed activations Φ\Phi consists of all piece-wise linear functions with at most three pieces, where at least one piece is constant. Let Th,m,r\overline{\mathcal{T}}^{h,m,r} denote the function class corresponding to the sequence-to-sequence functions defined by the modified Transformer networks. The following result establishes that the modified Transformer networks in T2,1,1\overline{\mathcal{T}}^{2,1,1} can closely approximate functions in FPE(δ)\overline{\mathcal{F}}_{\rm PE}(\delta).

For each fFPE(δ)\overline{f}\in\overline{\mathcal{F}}_{\rm PE}(\delta) and 1p<1\leq p<\infty, \exists gT2,1,1\overline{g}\in\overline{\mathcal{T}}^{2,1,1} such that dp(f,g)=O(δd/p)\mathsf{d}_{p}(\overline{f},\overline{g})=O(\delta^{d/p}).

Step 3. Approximate modified Transformers with (original) Transformers. Finally, we show that gT2,1,1\overline{g}\in\overline{\mathcal{T}}^{2,1,1} can be approximated by T2,1,4\mathcal{T}^{2,1,4}. Let gT2,1,4g\in{\mathcal{T}}^{2,1,4} be such that dp(g,g)ϵ/3\mathsf{d}_{p}(\overline{g},g)\leq\epsilon/3.

Theorem 2 now follows from these three steps, because we have

Choosing δ\delta small enough ensures that dp(f,g)ϵ\mathsf{d}_{p}(f,g)\leq\epsilon. ∎

We refer the reader to Sections B.1 and B.2 in the supplementary material for the formal statements and proofs of Steps 11 and 33, respectively. As for Step 22, which is the most critical step in establishing the universal approximation property of Transformers, we provide a sketch of the proof of Proposition 4 in the next section, and refer the reader to Section B.3 for the complete proof.

Proof sketch of Proposition 4: different roles of two layers

As mentioned earlier, the heavy parameter sharing in Transformers makes the goal of universally approximating sequence-to-sequence functions seemingly difficult. Both the self-attention and the feed-forward layer weights inside a Transformer block are fixed across nn tokens. In this section, we show that Transformers are able to overcome this architectural constraint, and compute contextual mappings of the entire input sequence just based on the pair-wise interactions. The token-wise feedforward layers then transform these contextual mappings to the desired output sequence.

We highlight these inner workings of Transformers en route to proving Proposition 4. We want to show that given a piece-wise constant function fFPE(δ)\overline{f}\in\overline{\mathcal{F}}_{PE}(\delta), there exists a modified Transformer network gT2,1,1\overline{g}\in\overline{\mathcal{T}}^{2,1,1} that closely approximates f\overline{f}. We achieve this goal by establishing the following three claims, which correspond to Lemmas 5, 6, and 7.

Next, a series of self-attention layers in the modified Transformer network can take the input L{\bm{L}} and implement a contextual mapping qq such that, for L{\bm{L}} and L{\bm{L}}^{\prime} that are not permutation of each other, all the elements in q(L)q({\bm{L}}) and q(L)q({\bm{L}}^{\prime}) are distinct.

Finally, a series of feed-forward layers in the modified Transformer network can map elements of the contextual embedding q(L)q({\bm{L}}) to the desired output value of fFPE\overline{f}\in\overline{\mathcal{F}}_{\rm PE} at the input X{\bm{X}}.

Before discussing these three claims in detail, we note that even though a Transformer network stacks self-attention and feed-forward layers in an alternate manner, the skip connections enable these networks to employ a composition of multiple self-attention or feed-forward layers. Furthermore, as alluded earlier, these three steps clearly highlight the different roles that self-attention and feed-forward layers play in realizing the ability to universally approximate sequence-to-sequence functions: 1) self-attention layers compute precise contextual maps; and 2) feed-forward layers then assign the results of these contextual maps to the desired output values.

2 Contextual mapping by self-attention layers

If we define an attention layer of the form ZZ+Ψ(Z;b,b){\bm{Z}}\mapsto{\bm{Z}}+\Psi({\bm{Z}};b,b^{\prime}), then any entry Z1,j{Z}_{1,j} in (b,b)(b,b^{\prime}) is shifted up by maxkZ1,kminkZ1,k\max_{k}{Z}_{1,k}-\min_{k}{Z}_{1,k}, while all the other entries stay untouched. We can choose bb and bb^{\prime} to selectively shift certain entries, hence the name selective shift operation.

3 Function value mapping by feed-forward layers

This brings us to the final step, which demonstrates the key utility of the feed-forward layers. After the contextual mapping by self-attention layers, each token captures the entire context available in the input sequence. The following result shows that token-wise application of a composition of feed-forward layers can map these tokens to the desired output values required by the function f\overline{f}.

4 Tightness of constructions

We showed in this section that Theorem 2 requires O(n(1/δ)dn/n!)O(n(1/\delta)^{dn}/n!) Transformer blocks for approximation, where δ\delta is the width of the cubes. Each transformer block is of constant width, so it has O(d)O(d) parameters; this means that the total number of parameters is O(dn(1/δ)dn/n!)O(dn(1/\delta)^{dn}/n!). We note that this exponential dependence cannot be avoided in the worse case. If we assume continuity without any additional smoothness, quantizing the domain to cubes and approximating the function with constants require memorizing (output dim)×(num cubes)/n!(\text{output dim})\times(\text{num cubes})/n! real numbers, where the factor of 1/n!1/n! is due to permutation equivariance. Thus, Theorem 2 is optimal in the order of parameters.

If we compare with the residual network result (Lin & Jegelka, 2018), we can consider “flattening” X{\bm{X}} into a dndn-dimensional vector and fitting the function. The proof technique in (Lin & Jegelka, 2018) requires O((1/δ)dn)O((1/\delta)^{dn}) layers, where each layer has O(dn)O(dn) parameters: the total parameter requirement is O(dn(1/δ)dn)O(dn(1/\delta)^{dn}). This shows that Transformers can approximate permutation equivariant functions in a more efficient way than residual networks.

In Section C, our proof of Theorem 3 shows that we require O(n(1/δ)dn)O(n(1/\delta)^{dn}) layers to approximate continuous (not permutation equivariant) sequence-to-sequence functions. As seen from the argument above, this construction is also optimal in the order of parameters.

Discussion and Experiments

As detailed in Section 4, the ability of the self-attention layers to compute contextual mappings plays a crucial role in the universal approximation property. Interestingly, our analysis shows that replacing the dot-product attention in Transformers with any other component capable of computing contextual mappings should preserve this universal approximation property. This leads naturally to questions about the alternative architectures that realize certain kinds of contextual mappings at different computational and memory costs. We explore and discuss some examples of such alternatives in this section. Our preliminary empirical study demonstrates their practical utility.

Given token embeddings X{\bm{X}} as input, the bi-linear projection layer computes the following update.

This layer advantageously incurs smaller number of matrix multiplications as compared to the dot-product attention. That said, the number of parameters in this layer depend on the sequence length, making it harder to reuse the model across tasks with different input sequence lengths. Moreover, the weights used to compute the contextual embeddings (WP{\bm{W}}_{P}) are independent of the inputs (X{\bm{X}}), whereas in self-attention the weights (σ[(WKiX)TWQiX])(\sigma[({\bm{W}}_{K}^{i}{\bm{X}})^{T}{\bm{W}}_{Q}^{i}{\bm{X}}]) depend on X{\bm{X}}. The first drawback can be addressed by replacing the linear projection with a depth-wise separable convolution layer, which is discussed in the next subsection.

2 Depth-wise separable convolutions

A depth-wise convolution layer (Sifre & Mallat, 2014; Chollet, 2017; Kaiser et al., 2017) involves convolving each dimension of X{\bm{X}} with a corresponding convolution filter of size kk:

3 Experiments

We now present our experiments with these other architectures, with the goal of understanding the extent to which computing contextual mappings can capture the performance of Transformers. As discussed earlier, BProj{\rm BProj} and SepConv{\rm SepConv} do not implement contextual mappings (cf. Definition 3.1), so we do not expect that either BProj{\rm BProj} or SepConv{\rm SepConv} based models to have the same performance as the expensive Transformers. These models do not use input dependent weights to compute attention, and hence have weaker representation power. Instead, our goal is to see if we can use these cheaper layers to replace (some of) the expensive self-attention layers.

We follow the experimental setting from Devlin et al. (2018) to train the Transformers, with the masked language model pre-training followed by a task specific fine-tuning, and work with a 1212 layer architecture based on BERTBASE\text{BERT}_{\text{BASE}}. We present our results on a question answering task (SQuAD) (Rajpurkar et al., 2016) and a sentence entailment task (MNLI) (Williams et al., 2018). In our first set of experiments we train models that employ BProj{\rm BProj} and SepConv{\rm SepConv} layers, instead of the self-attention layer in eq.(1). We notice that, as expected, these simpler models have weaker performance than the self-attention layer. See Table 1 in Section D for a comparison of these models on MNLI.

Next, we swap a varying number of the first few self-attention layers in BERTBASE\text{BERT}_{\text{BASE}} with SepConv{\rm SepConv}, implemented with filter reuse across dimensions (Wu et al., 2019)We refer to Section D for a complete description of the setup.. Fig. 1 illustrates the performance of these hybrid models. Interestingly, models with 11 or 22 convolution layers and rest the self-attention layers, perform better than models with only the self-attention layers. Note that, replacing self-attention layer with SepConv{\rm SepConv} also reduces the computational cost and the number of parameters. One explanation we have is that the first few attention layers tend to attend broadly to the whole sequence (as empirically observed in (Clark et al., 2019)), and the cheaper convolution layers can perform this job more efficiently. A detailed evaluation of such hybrid architectures will be interesting future research.

Our experiments also call for a deeper understanding of the exact nature of the embeddings computed by practical attention models. Since Transformers in practice have fixed depth, we believe that they might not be able to exactly implement contextual mappings as we defined in Definition 3.1. However, there is some preliminary empirical evidence that Transformers do implement some sort of “contextual mappings.” For example, Fig. 4 of Coenen et al. (2019) presents visualizations of embeddings of a single word in different contexts (sentences). They experimentally notice that Transformers, in addition to computing contextual mappings, also map a word into semantic clusters. Formalizing and evaluating this property of Transformers is an interesting direction for future work. We again note that Wu et al. (2019) have proposed an alternative way to compute such embeddings based on dynamic convolution layers. Evaluating the mappings computed by these models should shed more light on the workings of attention models and inspire efficient and better performing architectures.

References

Appendix A Proof of Claim 1

Suppose XP{\bm{X}}{\bm{P}} was given as input, where P{\bm{P}} is a permutation matrix. First note that

where we used PPT=I{\bm{P}}{\bm{P}}^{T}={\bm{I}}. Permutation equivariance of the token-wise feed-forward layer can be shown similarly:

where ReLU(XP)=ReLU(X)P{\rm ReLU}({\bm{X}}{\bm{P}})={\rm ReLU}({\bm{X}}){\bm{P}} was used. This analysis shows that the function class Th,m,r()\mathcal{T}^{h,m,r}(\cdot) is restricted to permutation equivariant functions.

Appendix B Proof details of Theorem 2

For any given fFPEf\in\mathcal{F}_{\rm PE} and 1p<1\leq p<\infty, one can find a δ>0\delta^{*}>0 such that \exists fFPE(δ)\overline{f}\in\overline{\mathcal{F}}_{\rm PE}(\delta^{*}) which satisfies dp(f,f)ϵ/3\mathsf{d}_{p}(f,\overline{f})\leq\epsilon/3.

Thus, the approximation f\overline{f} is also permutation equivariant. This proves the lemma. ∎

For each gT2,1,1\overline{g}\in\overline{\mathcal{T}}^{2,1,1} and 1p<1\leq p<\infty, \exists gT2,1,4g\in\mathcal{T}^{2,1,4} such that dp(g,g)ϵ/3\mathsf{d}_{p}(\overline{g},g)\leq\epsilon/3.

Proof Recall that Th,m,rT^{h,m,r} refers to the class of functions representable with composition of Transformer blocks with hh heads of size mm in self-attention layers and rr hidden nodes in feed-forward layers. The same notation holds for the modified Transformers Th,m,r\overline{\mathcal{T}}^{h,m,r}.

Note that the softmax operator on a matrix A{\bm{A}} can be made arbitrarily close to hardmax by scaling up A{\bm{A}}. That is,

This means that by scaling up parameters inside σ\sigma, we can approximate σH\sigma_{\rm H} arbitrarily closely. Thus, the modified self-attention layers can be approximated with the original self-attention layers of the same number of heads hh and head size mm.

Also, any arbitrary (possibly discontinuous) piecewise linear function ϕΦ\phi\in\Phi can be approximated arbitrarily closely by four ReLU{\rm ReLU}’s. Note that ϕΦ\phi\in\Phi as at most three pieces, and at least one of the pieces is constant. For example, consider the following function ϕΦ\phi\in\Phi:

This function can be approximated by four ReLU{\rm ReLU}’s, as claimed by the lemma:

Also, as we make ϵ0\epsilon\rightarrow 0, we can approximate ϕ\phi as closely as possible using ϕ~\widetilde{\phi}. The cases where the second or third piece is constant can be shown similarly. This means that the modified feed-forward layers (whose activation is ϕΦ\phi\in\Phi) with single hidden node can be approximated with the original feed-forward layers (ReLU{\rm ReLU}) with four hidden nodes.

Thus, given any gT2,1,1\overline{g}\in\overline{\mathcal{T}}^{2,1,1}, there exists a function gT2,1,4g\in{\mathcal{T}}^{2,1,4} arbitrarily close to g\overline{g}, by appropriately choosing the parameters to be large enough. This finishes the proof. ∎

B.3 Finishing proof of Proposition 4

As we have already discussed in Section 4, we establish Proposition 4 in three steps:

Finally, a group of feed-forward layers in the modified Transformer network can map elements of the contextual embedding q(L)q({\bm{L}}) to the desirable values, i.e., the output of fFPE\overline{f}\in\overline{\mathcal{F}}_{\rm PE} on the input X{\bm{X}}.

These steps are formally stated in Lemmas 5, 6, and 7 in the main text. We present the proofs of these lemmas in the subsequent sections.

With the results established in these lemmas, we are now equipped with all the tools necessary to complete the proof of Proposition 4. Let us recall the functions gq,gcg_{\rm q},g_{\rm c}, and gvg_{\rm v} from Lemma 5, 6, and 7, respectively. We now show that the (modified) Transformer network g=gvgcgq\overline{g}=g_{\rm v}\circ g_{\rm c}\circ g_{\rm q} approximates the underlying peicewise constant function fFPE\overline{f}\in\overline{\mathcal{F}}_{\rm PE} over all points in its support except for a set of of measure O(δd)O(\delta^{d}).

B.4 Proof of Lemma 5

The proof strategy is simple; using 1δ+1\frac{1}{\delta}+1 token-wise feed-forward layers, we implement the quantization function gqentg_{\rm q}^{\rm ent} that works on the first row of the input. Then stack another 1δ+1\frac{1}{\delta}+1 layers that quantizes the second row, and so on.

Given input X{\bm{X}}, we first start by clipping X1,:{\bm{X}}_{1,:} in the set (,0)[1,+)(-\infty,0)\cup[1,+\infty) and mapping the intervals to δnd-\delta^{-nd}. This can be done by the following layer:

Next, add 1/δ1/\delta layers of the following form, for k=0,δ,,1δk=0,\delta,\dots,1-\delta.

Each layer quantizes X1,:{\bm{X}}_{1,:} in [kδ,kδ+δ)[k\delta,k\delta+\delta) to kδk\delta, without modifying other intervals.

B.5 Proof of Lemma 6

Before starting the proof, we first describe the key component of our proof, which we refer to the selective shift operation. Consider the following function, which can be expressed with a multiplicative attention head, with head size m=1m=1 and hardmax σH\sigma_{\rm H}:

for j[n]j\in[n]. Note that due to e(1){\bm{e}}^{(1)}, all rows of ψ(Z;bQ)\psi({\bm{Z}};b_{Q}) except the first row are zero. From this observation, one can define a function parametrized by bQb_{Q} and bQb^{\prime}_{Q}, where bQ<bQb_{Q}<b^{\prime}_{Q}, which consists of two attention heads:

What this means is that, if we define an attention layer of the form ZZ+Ψ(Z;bQ,bQ){\bm{Z}}\mapsto{\bm{Z}}+\Psi({\bm{Z}};b_{Q},b^{\prime}_{Q}), then any column Z:,j{\bm{Z}}_{:,j} satisfying uTZ:,j(bQ,bQ){\bm{u}}^{T}{\bm{Z}}_{:,j}\in(b_{Q},b^{\prime}_{Q}) is shifted up in its first coordinate Z1,j{\bm{Z}}_{1,j} by maxkuTZ:,kminkuTZ:,k\max_{k}{\bm{u}}^{T}{\bm{Z}}_{:,k}-\min_{k}{\bm{u}}^{T}{\bm{Z}}_{:,k}, while all the other coordinates stay untouched. We call this the selective shift operation, because we can choose bQb_{Q} and bQb^{\prime}_{Q} to selectively shift certain entries of the input.

For any j[n]j\in[n], it is easy to check two following facts:

If Li,jδnd{\bm{L}}_{i,j}\neq-\delta^{-nd} for all i[d]i\in[d], i.e., L:,j{0,δ,,1δ}d{\bm{L}}_{:,j}\in\{0,\delta,\dots,1-\delta\}^{d}, then uTL:,j[0:δ:δd+1δ]{\bm{u}}^{T}{\bm{L}}_{:,j}\in[0:\delta:\delta^{-d+1}-\delta], and the map L:,juTL:,j{\bm{L}}_{:,j}\mapsto{\bm{u}}^{T}{\bm{L}}_{:,j} from {0,δ,,1δ}d\{0,\delta,\dots,1-\delta\}^{d} to [0:δ:δd+1δ][0:\delta:\delta^{-d+1}-\delta] is a bijection.

If there exists i[d]i\in[d] such that Li,j=δnd{\bm{L}}_{i,j}=-\delta^{-nd}, then uTL:,jδnd+δd+11<0{\bm{u}}^{T}{\bm{L}}_{:,j}\leq-\delta^{-nd}+\delta^{-d+1}-1<0.

Therefore, one can say that uTL:,j{\bm{u}}^{T}{\bm{L}}_{:,j} gives the “column id” for each possible value of L:,j{0,δ,,1δ}d{\bm{L}}_{:,j}\in\{0,\delta,\dots,1-\delta\}^{d}.

The rough idea of the construction is to apply the selective shift operation to each column id, by setting u{\bm{u}} in the definition of Ψ()\Psi(\cdot) to be (1,δ1,δ2,,δd+1)(1,\delta^{-1},\delta^{-2},\dots,\delta^{-d+1}) and choosing bQ=lδ/2b_{Q}=l-\delta/2 and bQ=l+δ/2b^{\prime}_{Q}=l+\delta/2 for each l[0:δ:δd+1δ]l\in[0:\delta:\delta^{-d+1}-\delta]. More concretely, we stack (1/δ)d(1/\delta)^{d} attention layers, with attention parts δdΨ(;lδ/2,l+δ/2)\delta^{-d}\Psi(\cdot;l-\delta/2,l+\delta/2) for each l[0:δ:δd+1δ]l\in[0:\delta:\delta^{-d+1}-\delta], in increasing order of ll. After that, we add an extra single-head attention layer with attention part δ(n+1)dψ(;0)\delta^{-(n+1)d}\psi(\cdot;0).

B.5.1 Category 1

In the first selective shift operation, the (1,1)(1,1)-th entry of L{\bm{L}} (L1,1{L}_{1,1}) is shifted by the operation, while the other entries are left untouched. The updated value L~1,1\widetilde{{L}}_{1,1} is

Therefore, after the operation, the output of the layer is [L~:,1L:,2L:,n]\begin{bmatrix}\widetilde{{\bm{L}}}_{:,1}&{\bm{L}}_{:,2}&\dots&{\bm{L}}_{:,n}\end{bmatrix}, and the new value of the first column L~:,1\widetilde{{\bm{L}}}_{:,1} results in

Let us denote the updated “column id” uTL~:,1{\bm{u}}^{T}\widetilde{{\bm{L}}}_{:,1} as l~1\widetilde{l}_{1}. We can show that ln<l~1l_{n}<\widetilde{l}_{1}, because

The second selective shift operation is applied to l2l_{2}, by which only one entry L1,2{L}_{1,2} will be shifted. The updated value L~1,2\widetilde{{L}}_{1,2} is

After updating, the new inner product of u{\bm{u}} and L~:,2\widetilde{{\bm{L}}}_{:,2} results in

We can show that l~1<l~2\widetilde{l}_{1}<\widetilde{l}_{2}, because

and the last inequality is true because δd>1\delta^{-d}>1 and ln>l2l_{n}>l_{2}. Since we have l~1<l~2\widetilde{l}_{1}<\widetilde{l}_{2}, and the new maximum in uT[L~:,1L~:,2L:,3L:,n]{\bm{u}}^{T}\begin{bmatrix}\widetilde{{\bm{L}}}_{:,1}&\widetilde{{\bm{L}}}_{:,2}&{\bm{L}}_{:,3}&\dots&{\bm{L}}_{:,n}\end{bmatrix} is now l~2\widetilde{l}_{2}, and the new minimum is l3l_{3}.

More generally, we can repeat this process, and show that the jj-th shift operation shifts L1,j{L}_{1,j} by δd(l~j1lj)\delta^{-d}(\widetilde{l}_{j-1}-l_{j}), and results in the new column id

In the general case, l~j1<l~j\widetilde{l}_{j-1}<\widetilde{l}_{j} holds j=[2:n]j=[2:n], because

Therefore, after the jj-th selective shift operation, l~j\widetilde{l}_{j} is the new maximum among {l~1,,l~j,lj+1,,ln}\{\widetilde{l}_{1},\dots,\widetilde{l}_{j},l_{j+1},\dots,l_{n}\} and lj+1l_{j+1} is the new minimum, which makes us possible to continue the process until the nn-th operation.

As a result, after the whole sweep from to δd+1δ\delta^{-d+1}-\delta by the first (1/δ)d(1/\delta)^{d} layers, a total of nn shift operations are applied, and the input L{\bm{L}} is mapped to a new point L~\widetilde{{\bm{L}}}, where uTL~=[l~1l~2l~n]{\bm{u}}^{T}\widetilde{{\bm{L}}}=\begin{bmatrix}\widetilde{l}_{1}&\widetilde{l}_{2}&\dots&\widetilde{l}_{n}\end{bmatrix} and l~1<l~2<<l~n\widetilde{l}_{1}<\widetilde{l}_{2}<\dots<\widetilde{l}_{n}.

We can now prove the following technical lemma, whose proof is deferred to Appendix B.5.4:

After nn shift operations, l~n=uTL~:,n\widetilde{l}_{n}={\bm{u}}^{T}\widetilde{{\bm{L}}}_{:,n} satisfies the following bounds:

Also, the map from [l1l2ln][0:δ:δd+1δ]\begin{bmatrix}l_{1}&l_{2}&\cdots&l_{n}\end{bmatrix}\in[0:\delta:\delta^{-d+1}-\delta] (where l1<l2<<lnl_{1}<l_{2}<\dots<l_{n}) to l~n\widetilde{l}_{n} is one-to-one.

As mentioned earlier, after this sweep, there is another attention layer with attention part δ(n+1)dψ(;0)\delta^{-(n+1)d}\psi(\cdot;0). Since 0<l~1<<l~n0<\widetilde{l}_{1}<\cdots<\widetilde{l}_{n}, what it does to L~\widetilde{{\bm{L}}} is that it adds δ(n+1)dmaxkuTL~:,k=δ(n+1)dl~n\delta^{-(n+1)d}\max_{k}{\bm{u}}^{T}\widetilde{{\bm{L}}}_{:,k}=\delta^{-(n+1)d}\widetilde{l}_{n} to each entry in the first row of L~\widetilde{{\bm{L}}}. The output of this layer is defined to be the function gc(L)g_{\rm c}({\bm{L}}).

Given this result so far, it is now left to check if the constructed network is really a permutation equivariant contextual mapping, i.e., if it satisfies Properties 6.1 and 6.2 in Lemma 6.

B.5.2 Category 2

In the extreme case of n=1n^{\prime}=1, we have maxjlj=minjlj\max_{j}l_{j}=\min_{j}l_{j}, so the selective shift operation applied at ljl_{j} does not shift the entry at all; therefore, at the end of the first (1/δ)d(1/\delta)^{d} attention layers, L~=L\widetilde{{\bm{L}}}={\bm{L}}.

When 1<nn11<n^{\prime}\leq n-1, let the nn^{\prime} distinct values of ljl_{j}’s be l1,,lnl^{\prime}_{1},\dots,l^{\prime}_{n^{\prime}}. The shift operation is applied nn^{\prime} times, to l1,,lnl^{\prime}_{1},\dots,l^{\prime}_{n^{\prime}}, and shifts one or more entries at a time. After the first (1/δ)d(1/\delta)^{d} layers, the output L~\widetilde{{\bm{L}}} has nn^{\prime} distinct l~j=uTL~:,j\widetilde{l}_{j}={\bm{u}}^{T}\widetilde{{\bm{L}}}_{:,j}, 0l~1l~2l~n0\leq\widetilde{l}_{1}\leq\widetilde{l}_{2}\leq\dots\leq\widetilde{l}_{n}, whose distinct values are the same as the numbers we get when we apply shift operations to a length-nn^{\prime} sequence [l1ln]\begin{bmatrix}l^{\prime}_{1}&\dots&l^{\prime}_{n^{\prime}}\end{bmatrix}. Then, applying the same calculations from Category 1 shows that

and it follows from the upper bound in Lemma 10 that

After the global shifting by the last layer with attention part δ(n+1)dψ(;0)\delta^{-(n+1)d}\psi(\cdot;0), we get the output gc(L)g_{\rm c}({\bm{L}}) which satisfies

B.5.3 Category 3

Recall that the selective shift operation is applied to each element of [0:δ:δd+1δ][0:\delta:\delta^{-d+1}-\delta], not to negative values. In case of Category 3, we have minkuTL:,k=l1<0\min_{k}{\bm{u}}^{T}{\bm{L}}_{:,k}=l_{1}<0, and l1l_{1} never gets shifted upwards, so it remains as the minimum for the whole time.

In case where all ljl_{j}’s are negative, selective shift operation never changes the input L{\bm{L}}, so we get L~=L\widetilde{{\bm{L}}}={\bm{L}}. Since we have uTL~<0nT{\bm{u}}^{T}\widetilde{{\bm{L}}}<{\bm{0}}_{n}^{T} (entry-wise), the last layer with attention part δ(n+1)dψ(;0)\delta^{-(n+1)d}\psi(\cdot;0) adds δ(n+1)dminkuTL~:,k<0\delta^{-(n+1)d}\min_{k}{\bm{u}}^{T}\widetilde{{\bm{L}}}_{:,k}<0 to each entry in the first row of L~\widetilde{{\bm{L}}}, further pushing it to the negative side. Therefore, the final output gc(L)g_{\rm c}({\bm{L}}) satisfies uTgc(L)<0nT<tl1nT{\bm{u}}^{T}g_{\rm c}({\bm{L}})<{\bm{0}}_{n}^{T}<t_{l}{\bm{1}}_{n}^{T}.

Now consider the case where at least one ljl_{j} is positive. Let ii be the index that satisfies li1<0lil_{i-1}<0\leq l_{i}. Then, selective shift operation does not affect l1,,li1l_{1},\dots,l_{i-1}, and then it shifts lil_{i} by

where we used δ12\delta^{-1}\geq 2 at the last inequality. The next shift operations shift li+1,,lnl_{i+1},\dots,l_{n} by even larger amount, so at the end of the first (1/δ)d(1/\delta)^{d} layers, we have δ(n+1)d+1l~il~n\delta^{-(n+1)d+1}\leq\widetilde{l}_{i}\leq\dots\leq\widetilde{l}_{n}, while l~j=lj<0\widetilde{l}_{j}=l_{j}<0 for j[i1]j\in[i-1].

Here, the last layer with attention part δ(n+1)dψ(;0)\delta^{-(n+1)d}\psi(\cdot;0) acts differently for negative and positive l~j\widetilde{l}_{j}’s. For negative l~j\widetilde{l}_{j}’s, it adds δ(n+1)dminkl~k=δ(n+1)dl1<0\delta^{-(n+1)d}\min_{k}\widetilde{l}_{k}=\delta^{-(n+1)d}l_{1}<0 to l~1,,l~i1\widetilde{l}_{1},\dots,\widetilde{l}_{i-1}, pushing them further to the negative side. For positive l~j\widetilde{l}_{j}’s, the layer adds δ(n+1)dmaxkl~k=δ(n+1)dl~nδ(2n+2)d+1\delta^{-(n+1)d}\max_{k}\widetilde{l}_{k}=\delta^{-(n+1)d}\widetilde{l}_{n}\geq\delta^{-(2n+2)d+1} to l~i,,l~n\widetilde{l}_{i},\dots,\widetilde{l}_{n}, so that they are all greater than or equal to δ(2n+2)d+1\delta^{-(2n+2)d+1}. Note that δ(2n+2)d+1>tr\delta^{-(2n+2)d+1}>t_{r}.

Therefore, in both cases, we can see that the final output gc(L)g_{\rm c}({\bm{L}}) satisfies uTgc(L):,j[tl,tr]{\bm{u}}^{T}g_{\rm c}({\bm{L}})_{:,j}\notin[t_{l},t_{r}], for all j[n]j\in[n]. This completes the verification of Property 6.4.

B.5.4 Proof of Lemma 10

Proof of lower and upper bounds on l~n\widetilde{l}_{n} are straightforward:

For one-to-one property of the map, consider [l1l2ln]\begin{bmatrix}l_{1}&l_{2}&\cdots&l_{n}\end{bmatrix} and [l1l2ln]\begin{bmatrix}l^{\prime}_{1}&l^{\prime}_{2}&\cdots&l^{\prime}_{n}\end{bmatrix} with increasing entries, which are mapped to l~n\widetilde{l}_{n} and l~n\widetilde{l}^{\prime}_{n}, respectively. Suppose l~n=l~n\widetilde{l}_{n}=\widetilde{l}^{\prime}_{n}. By definition,

Now assume for contradiction that lnlnl_{n}\neq l^{\prime}_{n}. Then, we have δd+1+δlnlnδd+1δ-\delta^{-d+1}+\delta\leq l_{n}-l^{\prime}_{n}\leq\delta^{-d+1}-\delta. However, the remaining terms have “coarse resolution”, and they can never cancel lnlnl_{n}-l^{\prime}_{n} and make the sum zero, because for example, δd(ln1lnln1+ln)\delta^{-d}(l_{n-1}-l_{n}-l^{\prime}_{n-1}+l^{\prime}_{n}) can only have values 0,δd+1,δd+1,2δd+1,2δd+1,0,\delta^{-d+1},-\delta^{-d+1},2\delta^{-d+1},-2\delta^{-d+1},\dots. Thus, ln=lnl_{n}=l^{\prime}_{n} must hold and the first term must be zero.

Similarly, assume that ln1ln1l_{n-1}\neq l^{\prime}_{n-1}. Then, the second term is in the interval [δ2d+1+δd+1,δ2d+1δd+1][-\delta^{-2d+1}+\delta^{-d+1},\delta^{-2d+1}-\delta^{-d+1}]. Again, the remaining terms cannot cancel the second term, hence ln1=ln1l_{n-1}=l^{\prime}_{n-1} must hold. We can proceed this way, and show that lj=ljl_{j}=l^{\prime}_{j} must hold for all j[n]j\in[n], hence proving that the map is one-to-one.

B.6 Proof of Lemma 7

so gc(L)g_{\rm c}({\bm{L}}) is left untouched.

If some other L{\bm{L}} is a permutation of L\overline{\bm{L}}, and L:,i=L:,j{\bm{L}}_{:,i}=\overline{\bm{L}}_{:,j}, then

so ii-th column of gc(L)g_{\rm c}({\bm{L}}) will turn to

which is the desired output. In conclusion, this layer maps the column gc(L):,jg_{\rm c}(\overline{\bm{L}})_{:,j} to (AL):,j({\bm{A}}_{\overline{\bm{L}}})_{:,j}, without affecting any other columns.

Appendix C Proof of Theorem 3

Proof of Theorem 3 can be done in a similar way as Theorem 2. As in the proof of Theorem 2, there are three parts: Lemma 8, Proposition 4, and Lemma 9. The statement and proof of Lemmas 8 and 9 can be done in almost the same way, this time without permutation equivariance.

For the proof of the second part, which corresponds to Proposition 4, we construct the network in a similar way. Recall that we can assume without loss of generality that Xd×n{\bm{X}}\in^{d\times n}. Choose

Then, the first column of X+E{\bm{X}}+{\bm{E}} is in d^{d}, second is in d^{d}, and so on; this means that for all rows, the coordinates are monotonically increasing. So we can use the same technique as the proof of Proposition 4 to divide the input values into cubes, quantize them to L{\bm{L}}, apply contextual mapping, and then value mapping. We describe each step in the following.

In a similar way as Lemma 5, the goal of this step is to quantize the input in d×d××[n1,n]d^{d}\times^{d}\times\dots\times[n-1,n]^{d} to its discrete version:

This can be done by dn/δdn/\delta feed-forward layers. We add dn/δdn/\delta layers of the following form, for k=0,δ,,nδk=0,\delta,\dots,n-\delta and i=1,,di=1,\dots,d:

After dn/δdn/\delta layers, any input entry of X+E{\bm{X}}+{\bm{E}} in [kδ,kδ+δ)[k\delta,k\delta+\delta) is quantized to kδk\delta.

C.2 Contextual mapping by attention layers

By Step 1, we quantized any input X+E{\bm{X}}+{\bm{E}} to its quantized version. We call this quantized version L{\bm{L}}:

As done in Lemma 6, we define u:=(1,δ1,,δd+1){\bm{u}}:=(1,\delta^{-1},\dots,\delta^{-d+1}) and lj:=uTL:,jl_{j}:={\bm{u}}^{T}{\bm{L}}_{:,j}, for all j[n]j\in[n]. Note that, because L:,j[j1:δ:jδ]d{\bm{L}}_{:,j}\in[j-1:\delta:j-\delta]^{d}, we have

and l1<l2<<lnl_{1}<l_{2}<\dots<l_{n}. Notice that this corresponds to the Category 1 in the proof of Lemma 6.

For simplicity of notation, let sj=(j1)k=0d1δks_{j}=(j-1)\sum_{k=0}^{d-1}\delta^{-k}. We stack n(1/δ)dn(1/\delta)^{d} attention layers, with attention parts δdΨ(;lδ/2,l+δ/2)\delta^{-d}\Psi(\cdot;l-\delta/2,l+\delta/2) for each lj=1n[sj:δ:sj+δd+1δ]l\in\bigcup_{j=1}^{n}[s_{j}:\delta:s_{j}+\delta^{-d+1}-\delta], in increasing order of ll.

These n(1/δ)dn(1/\delta)^{d} attention layers perform selective shift operations on ljl_{j}’s, in increasing order of jj. As seen in Appendix B.5.1, shift operations result in l~1<l~2<<l~n\widetilde{l}_{1}<\widetilde{l}_{2}<\dots<\widetilde{l}_{n}. Also, the map from L{\bm{L}} to l~n\widetilde{l}_{n} is one-to-one, which can be shown in the same way as Appendix B.5.4. Since the range of ljl_{j}’s are a bit different, we have a different upper bound on l~n\widetilde{l}_{n}:

Finally, we add an extra single-head attention layer with attention part nδ(n+1)d1ψ(;0)n\delta^{-(n+1)d-1}\psi(\cdot;0). We define the output of this layer as gc(L)g_{\rm c}({\bm{L}}). In a similar way as Appendix B.5.1, this layer shifts all the layers by nδ(n+1)d1l~nn\delta^{-(n+1)d-1}\widetilde{l}_{n}, thus making the intervals corresponding to different values of l~n\widetilde{l}_{n} disjoint from each other. This ensures that different contexts L{\bm{L}} are mapped to distinct numbers in uTgc(L){\bm{u}}^{T}g_{\rm c}({\bm{L}}), thus implementing a contextual mapping.

C.3 Function value mapping by feed-forward layers

Now, it is left to map gc(L)g_{\rm c}({\bm{L}}) to the desired output. As seen in the last step, each different context L{\bm{L}} maps to nn unique numbers uTgc(L){\bm{u}}^{T}g_{\rm c}({\bm{L}}), which are at least δ\delta apart from each other. The value mapping step can be done in a similar way as Lemma 7. The construction now requires O(n(1/δ)dn)O(n(1/\delta)^{dn}) layers because there is no permutation equivariance.

Appendix D Experimental setup

For our experiments we follow the same setting as in BERT (Devlin et al., 2018). We first pre-train the models on the masked language modeling task and the next sentence prediction task. We use English Wikipedia corpus and BooksCorpus dataset (Zhu et al., 2015) for this pre-training. We use BERTBASE\text{BERT}_{\text{BASE}}, a 12 layer Transformer model as the baseline. This model uses an embedding size of 768 and has 12 head self-attention layers and 3072 wide feed forward layers. We train it with the Adam optimizer, with .01.01 dropout and weight decay. We do pre-training for 250k steps with a batch size of 1024 and a max sequence length of 512. Pre-training takes around 2 days on 16 TPUv3 chips. We take the pre-train models and finetune them on the MNLI and SQuAD datasets separately using the same hyper-parameters as in Devlin et al. (2018). MNLI is a sentence entailment task in which, given a premise sentence, requires us to classify a hypothesis sentence into neutral, contradiction or entailment classes. We report the classification accuracy on this task. SQuAD is a question answering task, in which given a paragraph and a question, requires us to identify the answer as a span of the words in the paragraph. For this task we report both the F1 score and the Exact Match (EM) percentage. The metrics are reported on the dev sets of these datasets.

For our experiments with the depth-wise separable convolution layers, we follow the implementation in (Wu et al., 2019). We first use a GLU layer followed by the convolution layer. We use 16 separable convolution filters, of filter length 128, and reuse them, with each filter operating on 48 of the 768 dimensions of the input. This layer also has a skip connection and the output is normalized using layer normalization, similar to the self-attention layer. In our experiments, we replace the self-attention layers of the Transformers, in the lower layers, with this convolution layer. We keep the feed forward layer of the Transformer block the same.

For the experiments performed in this paper, one might consider an alternate explanation that the tasks considered maybe are easy, and do not require any advanced architecture to solve them, and even a simple architecture (bi-linear projection or separable convolution) might solve these tasks. To rule out this case we consider an even simpler architecture, namely average attention, as a baseline for our experiments.

Average attention. An average attention layer replaces the self-attention layer, and just computes the average of projections of all the other tokens. That is, we replace σ[(WKiX)TWQiX]\sigma[({\bm{W}}_{K}^{i}{\bm{X}})^{T}{\bm{W}}_{Q}^{i}{\bm{X}}] in (1) with a matrix full of 1/n1/n. The model still has the skip connections and the feed-forward layers like Transformer.