Are Transformers universal approximators of sequence-to-sequence functions?

Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

Introduction

Self-attention based Transformer networks (Vaswani et al., 2017) have been at the center of the recent progress on various natural language processing (NLP) tasks, including machine translation (Vaswani et al., 2017), language modeling (Radford et al., 2018; 2019), and question answering (Devlin et al., 2018; Yang et al., 2019; Liu et al., 2019). All these tasks involve learning models that map an input sequence of tokens to an output sequence of tokens. Transformers make it feasible to train large models to approximate these sequence-to-sequence functions due to their ability to process the input tokens in a parallel way, as opposed to the sequential nature of RNNs and LSTMs.

A Transformer block consists of two kinds of layers: a self-attention layer and a token-wise feed-forward layer, with skip connections present in both layers. The self-attention layer transforms each input token embedding using a weighted combination of the embeddings of all tokens in the input sequence, where weights are generated by pairwise dot-products among the input token embeddings. The token-wise feed-forward layer then independently processes each of these modified input token embeddings without any interaction among them. Notably, Transformers employ parameter reuse across tokens, as both layers use the same parameters to process each token. Moreover, Transformers have to rely solely on the pairwise dot-products to capture interaction between the input tokens.

Given the parameter sharing and limited interactions between tokens, it is natural to wonder: what class of sequence-to-sequence functions can the Transformer networks represent? Also, what is the role of the two different kinds of layers? Are both layers needed to obtain the representation power of Transformers? In the existing literature, the advantage of Transformers has often been attributed to their capability of computing contextual embeddings/mappings of the input, as opposed to fixed word embeddings as in word2vec (Mikolov et al., 2013). Is it possible to formalize the notion of contextual mappings? If yes, can Transformers actually compute such mappings? Such questions still remain elusive.

In this paper, we provide a mathematical definition of contextual mappings and show that multi-head self-attention layers can indeed compute contextual mappings of the input sequences. We further show that this ability to compute contextual mappings coupled with the value mapping ability of the feed-forward layers makes Transformers universal approximators of any permutation equivariant sequence-to-sequence function. We also improve this result using positional encodings, and show that Transformers can represent any sequence-to-sequence function; i.e., the restriction of permutation equivariance can be removed by positional encodings.

These results on universal approximation of sequence-to-sequence functions raise a natural question: is it possible to have a more efficient architecture to compute contextual mappings, consequently, preserving the ability to universally approximate sequence-to-sequence functions? Towards this, we explore other architectures that can implement contextual mappings (to some extent), and experimentally evaluate their performance. In our experiments, we notice that the models that combine these simpler architectures with Transformers have better performance, compared to the standalone Transformers. We conclude the paper by presenting more discussion and interesting future research directions along these lines.

We prove that Transformers are universal approximators of continuous and permutation equivariant sequence-to-sequence functions with compact support (Theorem 2). We also show that, if Transformers have trainable positional encodings added to the input, then they are universal approximators of continuous sequence-to-sequence functions on a compact domain (Theorem 3).

We formalize the notion of contextual mappings and show that the attention layers can compute contextual mappings, where each unique context is mapped to a unique vector (Lemma 6).

We experimentally evaluate other simpler layers that can compute contextual mappings to some extent, such as bi-linear projections and separable convolutions, and show that substituting some of the self-attention layers with these layers can result in better performance (Section 5).

2 Related works & notation

Analysis of attention-based models. Given the popularity of Transformers, there have been numerous works trying to understand the role of attention layers in natural language processing models. One such line of work focuses on probing the output of attention layers to understand the attention mechanism and internal language representation (Hewitt & Manning, 2019; Clark et al., 2019; Coenen et al., 2019; Vig & Belinkov, 2019). Although these results give valuable insights, a consistent theoretical analysis corroborating these findings is missing.

Universal approximation theorems. Universal approximation theorems are classical results in neural network theory, dating back many decades (Cybenko, 1989; Hornik, 1991). These results show that given unbounded width, a one-hidden-layer neural network can approximate arbitrary continuous function with compact support, up to any accuracy. Other results focusing on depth appeared more recently (Lu et al., 2017; Hanin & Sellke, 2017; Lin & Jegelka, 2018). In particular, Lu et al. (2017); Hanin & Sellke (2017) consider fully-connected ReLU networks whose input dimension is $d$ , and show that networks with width $d+1$ and unbounded depth are universal approximators of scalar-valued continuous functions. Lin & Jegelka (2018) show that a residual network with one hidden neuron per residual block is a universal approximator of scalar-valued functions, given unbounded depth. Although Transformer networks do have residual connections, due to their heavy parameter sharing, the existing analyses for residual networks do not extend to Transformers. Sannai et al. (2019) consider universally approximating permutation invariant/equivariant functions using fully-connected ReLU networks.

Turing completeness results on Transformers. Recently, Pérez et al. (2019) have shown that Transformers with infinite precision are Turing complete, which is not the case in finite precision setting (Dehghani et al., 2018). We note that Turing completeness deals with computation on formal languages (thus discrete objects), while universal approximation focuses on functions on a continuum. In other words, these are two different concepts; and one does not imply another.

Transformer networks

We define the Transformer networks as the composition of Transformer blocks. The family of the sequence-to-sequence functions corresponding to the Transformers can be defined as:

As seen in above, both layers (cf. (1) and (2)) of a Transformer block employ parameter reuse/sharing, because each token/column undergoes the same transformations (e.g., ${\bm{W}}_{Q}^{i}$ , ${\bm{W}}_{K}^{i}$ , or ${\bm{W}}_{1}$ ) regardless of its position. Moreover, interactions between tokens can only be captured through pairwise dot-products in the softmax operator $\sigma[\cdot]$ (cf. (1)). Given such limitations in a single Transformer block’s representation power, it is not obvious what kinds of sequence-to-sequence functions $\mathcal{T}^{h,m,r}$ can approximate; we provide the answer to this question in the next section.

Transformers are universal approximators of sequence-to-sequence functions

The following result shows that a Transformer network with a constant number of heads $h$ , head size $m$ , and hidden layer of size $r$ can approximate any function in $\mathcal{F}_{\rm PE}$ .

Let $1\leq p<\infty$ and $\epsilon>0$ , then for any given $f\in\mathcal{F}_{\rm PE}$ , there exists a Transformer network $g\in\mathcal{T}^{2,1,4}$ , such that $\mathsf{d}_{p}(f,g)\leq\epsilon$ .

Let $1\leq p<\infty$ and $\epsilon>0$ , then for any given $f\in\mathcal{F}_{\rm CD}$ , there exists a Transformer network $g\in\mathcal{T}^{2,1,4}_{\rm P}$ such that we have $\mathsf{d}_{p}(f,g)\leq\epsilon$ .

Theorems 2 and 3 provide an interesting characterization of the representation power of fixed-width Transformer networks. Since the function classes $\mathcal{T}^{h,m,r}$ and $\mathcal{T}^{h,m,r}_{\rm P}$ become richer as we increase the values of $(h,m,r)$ , our results establish that general Transformer networks are also universal approximators of sequence-to-sequence functions. Remarkably, none of the parameters $(h,m,r)$ depend on the input sequence length $n$ or embedding dimension $d$ .

Here, we would like to again point out that Theorems 2 and 3 appear quite surprising at a first glance, given the parameter sharing across all the tokens in a sequence, e.g., feed-forward layers are applied token-wise and the projection matrices in the self-attention layers are the same across different tokens. Furthermore, attention layers can only capture pairwise interaction between different tokens in the sequence. In the next subsection, we briefly describe one of our key steps in overcoming the aforementioned restrictions and proving universal approximation power of Transformers.

Let us consider a setting where we are interested in embedding two sentences: 1) I am happy; and 2) I am Bob. These sentences are fed to a sequence-to-sequence model as

where ${\bm{v}}_{\rm I},{\bm{v}}_{\rm am},{\bm{v}}_{\rm happy},$ and ${\bm{v}}_{\rm Bob}$ denote $d$ -dimensional embedding for the tokens ‘I’, ‘am’, ‘happy’, and ‘Bob’, respectively. Since the word ‘I’ occurs in different contexts in these sentences, in order to implement arbitrary sequence-to-sequence functions, the sequence-to-sequence model should map the two occurrences of ‘I’ to different values. We formally define this requirement below.

At the first thought, we can consider getting a contextual mapping by simply averaging all the tokens, because this can capture the one-word difference (e.g., “happy” vs. “Bob”) in two different contexts. However, if there are multiple words that are different, it is not guaranteed that the average will be different. Indeed, requiring unique mappings for all the tokens for any change in any number of tokens, is a steep requirement.

While the self-attention layer does consider pair-wise interactions among different input tokens, it is not clear if this weak form of pair-wise interaction with shared projection weights is sufficient to extract the underlying context. The following result, which we sketch here, shows that self-attention layers can implement a permutation equivariant contextual mapping over almost all elements of a grid in $^{d\times n}$ . We defer the full statement to Section 4.2.

Lemma 6 shows that a series of self-attention layers can implement contextual mappings, despite the apparent restriction that each of them can only capture pair-wise interaction. However, the restriction of permutation equivarance still exists because attention layers are inherently permutation equivariant. Coupled with the ability of token-wise feed-forward layers to map different values in $q({\bm{L}})$ to arbitrary output values, we can prove universal approximation capability of Transformers.

2 Proof of the universal approximation theorem (Theorem 2)

Next, we outline the proof of Theorem 2 in greater detail. We refer the reader to Section C for the proof of Theorem 3, since it is a modification of Theorem 2. Even though Theorems 2 and 3 do not specifically mention the required depth for approximation, our proof techniques do characterize it, and we show that our construction is tight in the number of parameters. We defer the discussion of depth to Section 4.4.

Recall that we want to show that given a function $f\in\mathcal{F}_{\rm PE}$ , we can find a Transformer network $g\in\mathcal{T}^{2,1,4}$ such that $\mathsf{d}_{p}(f,g)\leq\epsilon$ . Without loss of generality, we can assume that the compact support of $f$ is contained in $^{d\times n}$ . We achieve our desired objective in three key steps:

Step 1. Approximate $\mathcal{F}_{\rm PE}$ with piece-wise constant functions. We first use (a variant of) the classical result that any continuous function can be approximated up to arbitrary accuracy by piece-wise constant functions. For $\delta>0$ , we define the following class of piece-wise constant functions.

Step 2. Approximate $\overline{\mathcal{F}}_{\rm PE}(\delta)$ with modified Transformers. We then consider a slightly modified architecture for Transformer networks, where the softmax operator $\sigma[\cdot]$ and ${\rm ReLU}(\cdot)$ are replaced by the hardmax operator $\sigma_{\rm H}[\cdot]$ and an activation function $\phi\in\Phi$ , respectively. Here, the set of allowed activations $\Phi$ consists of all piece-wise linear functions with at most three pieces, where at least one piece is constant. Let $\overline{\mathcal{T}}^{h,m,r}$ denote the function class corresponding to the sequence-to-sequence functions defined by the modified Transformer networks. The following result establishes that the modified Transformer networks in $\overline{\mathcal{T}}^{2,1,1}$ can closely approximate functions in $\overline{\mathcal{F}}_{\rm PE}(\delta)$ .

For each $\overline{f}\in\overline{\mathcal{F}}_{\rm PE}(\delta)$ and $1\leq p<\infty$ , $\exists$ $\overline{g}\in\overline{\mathcal{T}}^{2,1,1}$ such that $\mathsf{d}_{p}(\overline{f},\overline{g})=O(\delta^{d/p})$ .

Step 3. Approximate modified Transformers with (original) Transformers. Finally, we show that $\overline{g}\in\overline{\mathcal{T}}^{2,1,1}$ can be approximated by $\mathcal{T}^{2,1,4}$ . Let $g\in{\mathcal{T}}^{2,1,4}$ be such that $\mathsf{d}_{p}(\overline{g},g)\leq\epsilon/3$ .

Theorem 2 now follows from these three steps, because we have

Choosing $\delta$ small enough ensures that $\mathsf{d}_{p}(f,g)\leq\epsilon$ . ∎

We refer the reader to Sections B.1 and B.2 in the supplementary material for the formal statements and proofs of Steps $1$ and $3$ , respectively. As for Step $2$ , which is the most critical step in establishing the universal approximation property of Transformers, we provide a sketch of the proof of Proposition 4 in the next section, and refer the reader to Section B.3 for the complete proof.

Proof sketch of Proposition 4: different roles of two layers

As mentioned earlier, the heavy parameter sharing in Transformers makes the goal of universally approximating sequence-to-sequence functions seemingly difficult. Both the self-attention and the feed-forward layer weights inside a Transformer block are fixed across $n$ tokens. In this section, we show that Transformers are able to overcome this architectural constraint, and compute contextual mappings of the entire input sequence just based on the pair-wise interactions. The token-wise feedforward layers then transform these contextual mappings to the desired output sequence.

We highlight these inner workings of Transformers en route to proving Proposition 4. We want to show that given a piece-wise constant function $\overline{f}\in\overline{\mathcal{F}}_{PE}(\delta)$ , there exists a modified Transformer network $\overline{g}\in\overline{\mathcal{T}}^{2,1,1}$ that closely approximates $\overline{f}$ . We achieve this goal by establishing the following three claims, which correspond to Lemmas 5, 6, and 7.

Next, a series of self-attention layers in the modified Transformer network can take the input ${\bm{L}}$ and implement a contextual mapping $q$ such that, for ${\bm{L}}$ and ${\bm{L}}^{\prime}$ that are not permutation of each other, all the elements in $q({\bm{L}})$ and $q({\bm{L}}^{\prime})$ are distinct.

Finally, a series of feed-forward layers in the modified Transformer network can map elements of the contextual embedding $q({\bm{L}})$ to the desired output value of $\overline{f}\in\overline{\mathcal{F}}_{\rm PE}$ at the input ${\bm{X}}$ .

Before discussing these three claims in detail, we note that even though a Transformer network stacks self-attention and feed-forward layers in an alternate manner, the skip connections enable these networks to employ a composition of multiple self-attention or feed-forward layers. Furthermore, as alluded earlier, these three steps clearly highlight the different roles that self-attention and feed-forward layers play in realizing the ability to universally approximate sequence-to-sequence functions: 1) self-attention layers compute precise contextual maps; and 2) feed-forward layers then assign the results of these contextual maps to the desired output values.

2 Contextual mapping by self-attention layers

If we define an attention layer of the form ${\bm{Z}}\mapsto{\bm{Z}}+\Psi({\bm{Z}};b,b^{\prime})$ , then any entry ${Z}_{1,j}$ in $(b,b^{\prime})$ is shifted up by $\max_{k}{Z}_{1,k}-\min_{k}{Z}_{1,k}$ , while all the other entries stay untouched. We can choose $b$ and $b^{\prime}$ to selectively shift certain entries, hence the name selective shift operation.

3 Function value mapping by feed-forward layers

This brings us to the final step, which demonstrates the key utility of the feed-forward layers. After the contextual mapping by self-attention layers, each token captures the entire context available in the input sequence. The following result shows that token-wise application of a composition of feed-forward layers can map these tokens to the desired output values required by the function $\overline{f}$ .

4 Tightness of constructions

We showed in this section that Theorem 2 requires $O(n(1/\delta)^{dn}/n!)$ Transformer blocks for approximation, where $\delta$ is the width of the cubes. Each transformer block is of constant width, so it has $O(d)$ parameters; this means that the total number of parameters is $O(dn(1/\delta)^{dn}/n!)$ . We note that this exponential dependence cannot be avoided in the worse case. If we assume continuity without any additional smoothness, quantizing the domain to cubes and approximating the function with constants require memorizing $(\text{output dim})\times(\text{num cubes})/n!$ real numbers, where the factor of $1/n!$ is due to permutation equivariance. Thus, Theorem 2 is optimal in the order of parameters.

If we compare with the residual network result (Lin & Jegelka, 2018), we can consider “flattening” ${\bm{X}}$ into a $dn$ -dimensional vector and fitting the function. The proof technique in (Lin & Jegelka, 2018) requires $O((1/\delta)^{dn})$ layers, where each layer has $O(dn)$ parameters: the total parameter requirement is $O(dn(1/\delta)^{dn})$ . This shows that Transformers can approximate permutation equivariant functions in a more efficient way than residual networks.

In Section C, our proof of Theorem 3 shows that we require $O(n(1/\delta)^{dn})$ layers to approximate continuous (not permutation equivariant) sequence-to-sequence functions. As seen from the argument above, this construction is also optimal in the order of parameters.

Discussion and Experiments

As detailed in Section 4, the ability of the self-attention layers to compute contextual mappings plays a crucial role in the universal approximation property. Interestingly, our analysis shows that replacing the dot-product attention in Transformers with any other component capable of computing contextual mappings should preserve this universal approximation property. This leads naturally to questions about the alternative architectures that realize certain kinds of contextual mappings at different computational and memory costs. We explore and discuss some examples of such alternatives in this section. Our preliminary empirical study demonstrates their practical utility.

Given token embeddings ${\bm{X}}$ as input, the bi-linear projection layer computes the following update.

This layer advantageously incurs smaller number of matrix multiplications as compared to the dot-product attention. That said, the number of parameters in this layer depend on the sequence length, making it harder to reuse the model across tasks with different input sequence lengths. Moreover, the weights used to compute the contextual embeddings ( ${\bm{W}}_{P}$ ) are independent of the inputs ( ${\bm{X}}$ ), whereas in self-attention the weights $(\sigma[({\bm{W}}_{K}^{i}{\bm{X}})^{T}{\bm{W}}_{Q}^{i}{\bm{X}}])$ depend on ${\bm{X}}$ . The first drawback can be addressed by replacing the linear projection with a depth-wise separable convolution layer, which is discussed in the next subsection.

2 Depth-wise separable convolutions

A depth-wise convolution layer (Sifre & Mallat, 2014; Chollet, 2017; Kaiser et al., 2017) involves convolving each dimension of ${\bm{X}}$ with a corresponding convolution filter of size $k$ :

3 Experiments

We now present our experiments with these other architectures, with the goal of understanding the extent to which computing contextual mappings can capture the performance of Transformers. As discussed earlier, ${\rm BProj}$ and ${\rm SepConv}$ do not implement contextual mappings (cf. Definition 3.1), so we do not expect that either ${\rm BProj}$ or ${\rm SepConv}$ based models to have the same performance as the expensive Transformers. These models do not use input dependent weights to compute attention, and hence have weaker representation power. Instead, our goal is to see if we can use these cheaper layers to replace (some of) the expensive self-attention layers.

We follow the experimental setting from Devlin et al. (2018) to train the Transformers, with the masked language model pre-training followed by a task specific fine-tuning, and work with a $12$ layer architecture based on $\text{BERT}_{\text{BASE}}$ . We present our results on a question answering task (SQuAD) (Rajpurkar et al., 2016) and a sentence entailment task (MNLI) (Williams et al., 2018). In our first set of experiments we train models that employ ${\rm BProj}$ and ${\rm SepConv}$ layers, instead of the self-attention layer in eq.(1). We notice that, as expected, these simpler models have weaker performance than the self-attention layer. See Table 1 in Section D for a comparison of these models on MNLI.

Next, we swap a varying number of the first few self-attention layers in $\text{BERT}_{\text{BASE}}$ with ${\rm SepConv}$ , implemented with filter reuse across dimensions (Wu et al., 2019)We refer to Section D for a complete description of the setup.. Fig. 1 illustrates the performance of these hybrid models. Interestingly, models with $1$ or $2$ convolution layers and rest the self-attention layers, perform better than models with only the self-attention layers. Note that, replacing self-attention layer with ${\rm SepConv}$ also reduces the computational cost and the number of parameters. One explanation we have is that the first few attention layers tend to attend broadly to the whole sequence (as empirically observed in (Clark et al., 2019)), and the cheaper convolution layers can perform this job more efficiently. A detailed evaluation of such hybrid architectures will be interesting future research.

Our experiments also call for a deeper understanding of the exact nature of the embeddings computed by practical attention models. Since Transformers in practice have fixed depth, we believe that they might not be able to exactly implement contextual mappings as we defined in Definition 3.1. However, there is some preliminary empirical evidence that Transformers do implement some sort of “contextual mappings.” For example, Fig. 4 of Coenen et al. (2019) presents visualizations of embeddings of a single word in different contexts (sentences). They experimentally notice that Transformers, in addition to computing contextual mappings, also map a word into semantic clusters. Formalizing and evaluating this property of Transformers is an interesting direction for future work. We again note that Wu et al. (2019) have proposed an alternative way to compute such embeddings based on dynamic convolution layers. Evaluating the mappings computed by these models should shed more light on the workings of attention models and inspire efficient and better performing architectures.

References

Appendix A Proof of Claim 1

Suppose ${\bm{X}}{\bm{P}}$ was given as input, where ${\bm{P}}$ is a permutation matrix. First note that

where we used ${\bm{P}}{\bm{P}}^{T}={\bm{I}}$ . Permutation equivariance of the token-wise feed-forward layer can be shown similarly:

where ${\rm ReLU}({\bm{X}}{\bm{P}})={\rm ReLU}({\bm{X}}){\bm{P}}$ was used. This analysis shows that the function class $\mathcal{T}^{h,m,r}(\cdot)$ is restricted to permutation equivariant functions.

Appendix B Proof details of Theorem 2

For any given $f\in\mathcal{F}_{\rm PE}$ and $1\leq p<\infty$ , one can find a $\delta^{*}>0$ such that $\exists$ $\overline{f}\in\overline{\mathcal{F}}_{\rm PE}(\delta^{*})$ which satisfies $\mathsf{d}_{p}(f,\overline{f})\leq\epsilon/3$ .

Thus, the approximation $\overline{f}$ is also permutation equivariant. This proves the lemma. ∎

For each $\overline{g}\in\overline{\mathcal{T}}^{2,1,1}$ and $1\leq p<\infty$ , $\exists$ $g\in\mathcal{T}^{2,1,4}$ such that $\mathsf{d}_{p}(\overline{g},g)\leq\epsilon/3$ .

Proof Recall that $T^{h,m,r}$ refers to the class of functions representable with composition of Transformer blocks with $h$ heads of size $m$ in self-attention layers and $r$ hidden nodes in feed-forward layers. The same notation holds for the modified Transformers $\overline{\mathcal{T}}^{h,m,r}$ .

Note that the softmax operator on a matrix ${\bm{A}}$ can be made arbitrarily close to hardmax by scaling up ${\bm{A}}$ . That is,

This means that by scaling up parameters inside $\sigma$ , we can approximate $\sigma_{\rm H}$ arbitrarily closely. Thus, the modified self-attention layers can be approximated with the original self-attention layers of the same number of heads $h$ and head size $m$ .

Also, any arbitrary (possibly discontinuous) piecewise linear function $\phi\in\Phi$ can be approximated arbitrarily closely by four ${\rm ReLU}$ ’s. Note that $\phi\in\Phi$ as at most three pieces, and at least one of the pieces is constant. For example, consider the following function $\phi\in\Phi$ :

This function can be approximated by four ${\rm ReLU}$ ’s, as claimed by the lemma:

Also, as we make $\epsilon\rightarrow 0$ , we can approximate $\phi$ as closely as possible using $\widetilde{\phi}$ . The cases where the second or third piece is constant can be shown similarly. This means that the modified feed-forward layers (whose activation is $\phi\in\Phi$ ) with single hidden node can be approximated with the original feed-forward layers ( ${\rm ReLU}$ ) with four hidden nodes.

Thus, given any $\overline{g}\in\overline{\mathcal{T}}^{2,1,1}$ , there exists a function $g\in{\mathcal{T}}^{2,1,4}$ arbitrarily close to $\overline{g}$ , by appropriately choosing the parameters to be large enough. This finishes the proof. ∎

B.3 Finishing proof of Proposition 4

As we have already discussed in Section 4, we establish Proposition 4 in three steps:

Finally, a group of feed-forward layers in the modified Transformer network can map elements of the contextual embedding $q({\bm{L}})$ to the desirable values, i.e., the output of $\overline{f}\in\overline{\mathcal{F}}_{\rm PE}$ on the input ${\bm{X}}$ .

These steps are formally stated in Lemmas 5, 6, and 7 in the main text. We present the proofs of these lemmas in the subsequent sections.

With the results established in these lemmas, we are now equipped with all the tools necessary to complete the proof of Proposition 4. Let us recall the functions $g_{\rm q},g_{\rm c}$ , and $g_{\rm v}$ from Lemma 5, 6, and 7, respectively. We now show that the (modified) Transformer network $\overline{g}=g_{\rm v}\circ g_{\rm c}\circ g_{\rm q}$ approximates the underlying peicewise constant function $\overline{f}\in\overline{\mathcal{F}}_{\rm PE}$ over all points in its support except for a set of of measure $O(\delta^{d})$ .

B.4 Proof of Lemma 5

The proof strategy is simple; using $\frac{1}{\delta}+1$ token-wise feed-forward layers, we implement the quantization function $g_{\rm q}^{\rm ent}$ that works on the first row of the input. Then stack another $\frac{1}{\delta}+1$ layers that quantizes the second row, and so on.

Given input ${\bm{X}}$ , we first start by clipping ${\bm{X}}_{1,:}$ in the set $(-\infty,0)\cup[1,+\infty)$ and mapping the intervals to $-\delta^{-nd}$ . This can be done by the following layer:

Next, add $1/\delta$ layers of the following form, for $k=0,\delta,\dots,1-\delta$ .

Each layer quantizes ${\bm{X}}_{1,:}$ in $[k\delta,k\delta+\delta)$ to $k\delta$ , without modifying other intervals.

B.5 Proof of Lemma 6

Before starting the proof, we first describe the key component of our proof, which we refer to the selective shift operation. Consider the following function, which can be expressed with a multiplicative attention head, with head size $m=1$ and hardmax $\sigma_{\rm H}$ :

for $j\in[n]$ . Note that due to ${\bm{e}}^{(1)}$ , all rows of $\psi({\bm{Z}};b_{Q})$ except the first row are zero. From this observation, one can define a function parametrized by $b_{Q}$ and $b^{\prime}_{Q}$ , where $b_{Q}<b^{\prime}_{Q}$ , which consists of two attention heads:

What this means is that, if we define an attention layer of the form ${\bm{Z}}\mapsto{\bm{Z}}+\Psi({\bm{Z}};b_{Q},b^{\prime}_{Q})$ , then any column ${\bm{Z}}_{:,j}$ satisfying ${\bm{u}}^{T}{\bm{Z}}_{:,j}\in(b_{Q},b^{\prime}_{Q})$ is shifted up in its first coordinate ${\bm{Z}}_{1,j}$ by $\max_{k}{\bm{u}}^{T}{\bm{Z}}_{:,k}-\min_{k}{\bm{u}}^{T}{\bm{Z}}_{:,k}$ , while all the other coordinates stay untouched. We call this the selective shift operation, because we can choose $b_{Q}$ and $b^{\prime}_{Q}$ to selectively shift certain entries of the input.

For any $j\in[n]$ , it is easy to check two following facts:

If ${\bm{L}}_{i,j}\neq-\delta^{-nd}$ for all $i\in[d]$ , i.e., ${\bm{L}}_{:,j}\in\{0,\delta,\dots,1-\delta\}^{d}$ , then ${\bm{u}}^{T}{\bm{L}}_{:,j}\in[0:\delta:\delta^{-d+1}-\delta]$ , and the map ${\bm{L}}_{:,j}\mapsto{\bm{u}}^{T}{\bm{L}}_{:,j}$ from $\{0,\delta,\dots,1-\delta\}^{d}$ to $[0:\delta:\delta^{-d+1}-\delta]$ is a bijection.

If there exists $i\in[d]$ such that ${\bm{L}}_{i,j}=-\delta^{-nd}$ , then ${\bm{u}}^{T}{\bm{L}}_{:,j}\leq-\delta^{-nd}+\delta^{-d+1}-1<0$ .

Therefore, one can say that ${\bm{u}}^{T}{\bm{L}}_{:,j}$ gives the “column id” for each possible value of ${\bm{L}}_{:,j}\in\{0,\delta,\dots,1-\delta\}^{d}$ .

The rough idea of the construction is to apply the selective shift operation to each column id, by setting ${\bm{u}}$ in the definition of $\Psi(\cdot)$ to be $(1,\delta^{-1},\delta^{-2},\dots,\delta^{-d+1})$ and choosing $b_{Q}=l-\delta/2$ and $b^{\prime}_{Q}=l+\delta/2$ for each $l\in[0:\delta:\delta^{-d+1}-\delta]$ . More concretely, we stack $(1/\delta)^{d}$ attention layers, with attention parts $\delta^{-d}\Psi(\cdot;l-\delta/2,l+\delta/2)$ for each $l\in[0:\delta:\delta^{-d+1}-\delta]$ , in increasing order of $l$ . After that, we add an extra single-head attention layer with attention part $\delta^{-(n+1)d}\psi(\cdot;0)$ .

B.5.1 Category 1

In the first selective shift operation, the $(1,1)$ -th entry of ${\bm{L}}$ ( ${L}_{1,1}$ ) is shifted by the operation, while the other entries are left untouched. The updated value $\widetilde{{L}}_{1,1}$ is

Therefore, after the operation, the output of the layer is $\begin{bmatrix}\widetilde{{\bm{L}}}_{:,1}&{\bm{L}}_{:,2}&\dots&{\bm{L}}_{:,n}\end{bmatrix}$ , and the new value of the first column $\widetilde{{\bm{L}}}_{:,1}$ results in

Let us denote the updated “column id” ${\bm{u}}^{T}\widetilde{{\bm{L}}}_{:,1}$ as $\widetilde{l}_{1}$ . We can show that $l_{n}<\widetilde{l}_{1}$ , because

The second selective shift operation is applied to $l_{2}$ , by which only one entry ${L}_{1,2}$ will be shifted. The updated value $\widetilde{{L}}_{1,2}$ is

After updating, the new inner product of ${\bm{u}}$ and $\widetilde{{\bm{L}}}_{:,2}$ results in

We can show that $\widetilde{l}_{1}<\widetilde{l}_{2}$ , because

and the last inequality is true because $\delta^{-d}>1$ and $l_{n}>l_{2}$ . Since we have $\widetilde{l}_{1}<\widetilde{l}_{2}$ , and the new maximum in ${\bm{u}}^{T}\begin{bmatrix}\widetilde{{\bm{L}}}_{:,1}&\widetilde{{\bm{L}}}_{:,2}&{\bm{L}}_{:,3}&\dots&{\bm{L}}_{:,n}\end{bmatrix}$ is now $\widetilde{l}_{2}$ , and the new minimum is $l_{3}$ .

More generally, we can repeat this process, and show that the $j$ -th shift operation shifts ${L}_{1,j}$ by $\delta^{-d}(\widetilde{l}_{j-1}-l_{j})$ , and results in the new column id

In the general case, $\widetilde{l}_{j-1}<\widetilde{l}_{j}$ holds $j=[2:n]$ , because

Therefore, after the $j$ -th selective shift operation, $\widetilde{l}_{j}$ is the new maximum among $\{\widetilde{l}_{1},\dots,\widetilde{l}_{j},l_{j+1},\dots,l_{n}\}$ and $l_{j+1}$ is the new minimum, which makes us possible to continue the process until the $n$ -th operation.

As a result, after the whole sweep from to $\delta^{-d+1}-\delta$ by the first $(1/\delta)^{d}$ layers, a total of $n$ shift operations are applied, and the input ${\bm{L}}$ is mapped to a new point $\widetilde{{\bm{L}}}$ , where ${\bm{u}}^{T}\widetilde{{\bm{L}}}=\begin{bmatrix}\widetilde{l}_{1}&\widetilde{l}_{2}&\dots&\widetilde{l}_{n}\end{bmatrix}$ and $\widetilde{l}_{1}<\widetilde{l}_{2}<\dots<\widetilde{l}_{n}$ .

We can now prove the following technical lemma, whose proof is deferred to Appendix B.5.4:

After $n$ shift operations, $\widetilde{l}_{n}={\bm{u}}^{T}\widetilde{{\bm{L}}}_{:,n}$ satisfies the following bounds:

Also, the map from $\begin{bmatrix}l_{1}&l_{2}&\cdots&l_{n}\end{bmatrix}\in[0:\delta:\delta^{-d+1}-\delta]$ (where $l_{1}<l_{2}<\dots<l_{n}$ ) to $\widetilde{l}_{n}$ is one-to-one.

As mentioned earlier, after this sweep, there is another attention layer with attention part $\delta^{-(n+1)d}\psi(\cdot;0)$ . Since $0<\widetilde{l}_{1}<\cdots<\widetilde{l}_{n}$ , what it does to $\widetilde{{\bm{L}}}$ is that it adds $\delta^{-(n+1)d}\max_{k}{\bm{u}}^{T}\widetilde{{\bm{L}}}_{:,k}=\delta^{-(n+1)d}\widetilde{l}_{n}$ to each entry in the first row of $\widetilde{{\bm{L}}}$ . The output of this layer is defined to be the function $g_{\rm c}({\bm{L}})$ .

Given this result so far, it is now left to check if the constructed network is really a permutation equivariant contextual mapping, i.e., if it satisfies Properties 6.1 and 6.2 in Lemma 6.

B.5.2 Category 2

In the extreme case of $n^{\prime}=1$ , we have $\max_{j}l_{j}=\min_{j}l_{j}$ , so the selective shift operation applied at $l_{j}$ does not shift the entry at all; therefore, at the end of the first $(1/\delta)^{d}$ attention layers, $\widetilde{{\bm{L}}}={\bm{L}}$ .

When $1<n^{\prime}\leq n-1$ , let the $n^{\prime}$ distinct values of $l_{j}$ ’s be $l^{\prime}_{1},\dots,l^{\prime}_{n^{\prime}}$ . The shift operation is applied $n^{\prime}$ times, to $l^{\prime}_{1},\dots,l^{\prime}_{n^{\prime}}$ , and shifts one or more entries at a time. After the first $(1/\delta)^{d}$ layers, the output $\widetilde{{\bm{L}}}$ has $n^{\prime}$ distinct $\widetilde{l}_{j}={\bm{u}}^{T}\widetilde{{\bm{L}}}_{:,j}$ , $0\leq\widetilde{l}_{1}\leq\widetilde{l}_{2}\leq\dots\leq\widetilde{l}_{n}$ , whose distinct values are the same as the numbers we get when we apply shift operations to a length- $n^{\prime}$ sequence $\begin{bmatrix}l^{\prime}_{1}&\dots&l^{\prime}_{n^{\prime}}\end{bmatrix}$ . Then, applying the same calculations from Category 1 shows that

and it follows from the upper bound in Lemma 10 that

After the global shifting by the last layer with attention part $\delta^{-(n+1)d}\psi(\cdot;0)$ , we get the output $g_{\rm c}({\bm{L}})$ which satisfies

B.5.3 Category 3

Recall that the selective shift operation is applied to each element of $[0:\delta:\delta^{-d+1}-\delta]$ , not to negative values. In case of Category 3, we have $\min_{k}{\bm{u}}^{T}{\bm{L}}_{:,k}=l_{1}<0$ , and $l_{1}$ never gets shifted upwards, so it remains as the minimum for the whole time.

In case where all $l_{j}$ ’s are negative, selective shift operation never changes the input ${\bm{L}}$ , so we get $\widetilde{{\bm{L}}}={\bm{L}}$ . Since we have ${\bm{u}}^{T}\widetilde{{\bm{L}}}<{\bm{0}}_{n}^{T}$ (entry-wise), the last layer with attention part $\delta^{-(n+1)d}\psi(\cdot;0)$ adds $\delta^{-(n+1)d}\min_{k}{\bm{u}}^{T}\widetilde{{\bm{L}}}_{:,k}<0$ to each entry in the first row of $\widetilde{{\bm{L}}}$ , further pushing it to the negative side. Therefore, the final output $g_{\rm c}({\bm{L}})$ satisfies ${\bm{u}}^{T}g_{\rm c}({\bm{L}})<{\bm{0}}_{n}^{T}<t_{l}{\bm{1}}_{n}^{T}$ .

Now consider the case where at least one $l_{j}$ is positive. Let $i$ be the index that satisfies $l_{i-1}<0\leq l_{i}$ . Then, selective shift operation does not affect $l_{1},\dots,l_{i-1}$ , and then it shifts $l_{i}$ by

where we used $\delta^{-1}\geq 2$ at the last inequality. The next shift operations shift $l_{i+1},\dots,l_{n}$ by even larger amount, so at the end of the first $(1/\delta)^{d}$ layers, we have $\delta^{-(n+1)d+1}\leq\widetilde{l}_{i}\leq\dots\leq\widetilde{l}_{n}$ , while $\widetilde{l}_{j}=l_{j}<0$ for $j\in[i-1]$ .

Here, the last layer with attention part $\delta^{-(n+1)d}\psi(\cdot;0)$ acts differently for negative and positive $\widetilde{l}_{j}$ ’s. For negative $\widetilde{l}_{j}$ ’s, it adds $\delta^{-(n+1)d}\min_{k}\widetilde{l}_{k}=\delta^{-(n+1)d}l_{1}<0$ to $\widetilde{l}_{1},\dots,\widetilde{l}_{i-1}$ , pushing them further to the negative side. For positive $\widetilde{l}_{j}$ ’s, the layer adds $\delta^{-(n+1)d}\max_{k}\widetilde{l}_{k}=\delta^{-(n+1)d}\widetilde{l}_{n}\geq\delta^{-(2n+2)d+1}$ to $\widetilde{l}_{i},\dots,\widetilde{l}_{n}$ , so that they are all greater than or equal to $\delta^{-(2n+2)d+1}$ . Note that $\delta^{-(2n+2)d+1}>t_{r}$ .

Therefore, in both cases, we can see that the final output $g_{\rm c}({\bm{L}})$ satisfies ${\bm{u}}^{T}g_{\rm c}({\bm{L}})_{:,j}\notin[t_{l},t_{r}]$ , for all $j\in[n]$ . This completes the verification of Property 6.4.

B.5.4 Proof of Lemma 10

Proof of lower and upper bounds on $\widetilde{l}_{n}$ are straightforward:

For one-to-one property of the map, consider $\begin{bmatrix}l_{1}&l_{2}&\cdots&l_{n}\end{bmatrix}$ and $\begin{bmatrix}l^{\prime}_{1}&l^{\prime}_{2}&\cdots&l^{\prime}_{n}\end{bmatrix}$ with increasing entries, which are mapped to $\widetilde{l}_{n}$ and $\widetilde{l}^{\prime}_{n}$ , respectively. Suppose $\widetilde{l}_{n}=\widetilde{l}^{\prime}_{n}$ . By definition,

Now assume for contradiction that $l_{n}\neq l^{\prime}_{n}$ . Then, we have $-\delta^{-d+1}+\delta\leq l_{n}-l^{\prime}_{n}\leq\delta^{-d+1}-\delta$ . However, the remaining terms have “coarse resolution”, and they can never cancel $l_{n}-l^{\prime}_{n}$ and make the sum zero, because for example, $\delta^{-d}(l_{n-1}-l_{n}-l^{\prime}_{n-1}+l^{\prime}_{n})$ can only have values $0,\delta^{-d+1},-\delta^{-d+1},2\delta^{-d+1},-2\delta^{-d+1},\dots$ . Thus, $l_{n}=l^{\prime}_{n}$ must hold and the first term must be zero.

Similarly, assume that $l_{n-1}\neq l^{\prime}_{n-1}$ . Then, the second term is in the interval $[-\delta^{-2d+1}+\delta^{-d+1},\delta^{-2d+1}-\delta^{-d+1}]$ . Again, the remaining terms cannot cancel the second term, hence $l_{n-1}=l^{\prime}_{n-1}$ must hold. We can proceed this way, and show that $l_{j}=l^{\prime}_{j}$ must hold for all $j\in[n]$ , hence proving that the map is one-to-one.

B.6 Proof of Lemma 7

so $g_{\rm c}({\bm{L}})$ is left untouched.

If some other ${\bm{L}}$ is a permutation of $\overline{\bm{L}}$ , and ${\bm{L}}_{:,i}=\overline{\bm{L}}_{:,j}$ , then

so $i$ -th column of $g_{\rm c}({\bm{L}})$ will turn to

which is the desired output. In conclusion, this layer maps the column $g_{\rm c}(\overline{\bm{L}})_{:,j}$ to $({\bm{A}}_{\overline{\bm{L}}})_{:,j}$ , without affecting any other columns.

Appendix C Proof of Theorem 3

Proof of Theorem 3 can be done in a similar way as Theorem 2. As in the proof of Theorem 2, there are three parts: Lemma 8, Proposition 4, and Lemma 9. The statement and proof of Lemmas 8 and 9 can be done in almost the same way, this time without permutation equivariance.

For the proof of the second part, which corresponds to Proposition 4, we construct the network in a similar way. Recall that we can assume without loss of generality that ${\bm{X}}\in^{d\times n}$ . Choose

Then, the first column of ${\bm{X}}+{\bm{E}}$ is in $^{d}$ , second is in $^{d}$ , and so on; this means that for all rows, the coordinates are monotonically increasing. So we can use the same technique as the proof of Proposition 4 to divide the input values into cubes, quantize them to ${\bm{L}}$ , apply contextual mapping, and then value mapping. We describe each step in the following.

In a similar way as Lemma 5, the goal of this step is to quantize the input in $^{d}\times^{d}\times\dots\times[n-1,n]^{d}$ to its discrete version:

This can be done by $dn/\delta$ feed-forward layers. We add $dn/\delta$ layers of the following form, for $k=0,\delta,\dots,n-\delta$ and $i=1,\dots,d$ :

After $dn/\delta$ layers, any input entry of ${\bm{X}}+{\bm{E}}$ in $[k\delta,k\delta+\delta)$ is quantized to $k\delta$ .

C.2 Contextual mapping by attention layers

By Step 1, we quantized any input ${\bm{X}}+{\bm{E}}$ to its quantized version. We call this quantized version ${\bm{L}}$ :

As done in Lemma 6, we define ${\bm{u}}:=(1,\delta^{-1},\dots,\delta^{-d+1})$ and $l_{j}:={\bm{u}}^{T}{\bm{L}}_{:,j}$ , for all $j\in[n]$ . Note that, because ${\bm{L}}_{:,j}\in[j-1:\delta:j-\delta]^{d}$ , we have

and $l_{1}<l_{2}<\dots<l_{n}$ . Notice that this corresponds to the Category 1 in the proof of Lemma 6.

For simplicity of notation, let $s_{j}=(j-1)\sum_{k=0}^{d-1}\delta^{-k}$ . We stack $n(1/\delta)^{d}$ attention layers, with attention parts $\delta^{-d}\Psi(\cdot;l-\delta/2,l+\delta/2)$ for each $l\in\bigcup_{j=1}^{n}[s_{j}:\delta:s_{j}+\delta^{-d+1}-\delta]$ , in increasing order of $l$ .

These $n(1/\delta)^{d}$ attention layers perform selective shift operations on $l_{j}$ ’s, in increasing order of $j$ . As seen in Appendix B.5.1, shift operations result in $\widetilde{l}_{1}<\widetilde{l}_{2}<\dots<\widetilde{l}_{n}$ . Also, the map from ${\bm{L}}$ to $\widetilde{l}_{n}$ is one-to-one, which can be shown in the same way as Appendix B.5.4. Since the range of $l_{j}$ ’s are a bit different, we have a different upper bound on $\widetilde{l}_{n}$ :

Finally, we add an extra single-head attention layer with attention part $n\delta^{-(n+1)d-1}\psi(\cdot;0)$ . We define the output of this layer as $g_{\rm c}({\bm{L}})$ . In a similar way as Appendix B.5.1, this layer shifts all the layers by $n\delta^{-(n+1)d-1}\widetilde{l}_{n}$ , thus making the intervals corresponding to different values of $\widetilde{l}_{n}$ disjoint from each other. This ensures that different contexts ${\bm{L}}$ are mapped to distinct numbers in ${\bm{u}}^{T}g_{\rm c}({\bm{L}})$ , thus implementing a contextual mapping.

C.3 Function value mapping by feed-forward layers

Now, it is left to map $g_{\rm c}({\bm{L}})$ to the desired output. As seen in the last step, each different context ${\bm{L}}$ maps to $n$ unique numbers ${\bm{u}}^{T}g_{\rm c}({\bm{L}})$ , which are at least $\delta$ apart from each other. The value mapping step can be done in a similar way as Lemma 7. The construction now requires $O(n(1/\delta)^{dn})$ layers because there is no permutation equivariance.

Appendix D Experimental setup

For our experiments we follow the same setting as in BERT (Devlin et al., 2018). We first pre-train the models on the masked language modeling task and the next sentence prediction task. We use English Wikipedia corpus and BooksCorpus dataset (Zhu et al., 2015) for this pre-training. We use $\text{BERT}_{\text{BASE}}$ , a 12 layer Transformer model as the baseline. This model uses an embedding size of 768 and has 12 head self-attention layers and 3072 wide feed forward layers. We train it with the Adam optimizer, with $.01$ dropout and weight decay. We do pre-training for 250k steps with a batch size of 1024 and a max sequence length of 512. Pre-training takes around 2 days on 16 TPUv3 chips. We take the pre-train models and finetune them on the MNLI and SQuAD datasets separately using the same hyper-parameters as in Devlin et al. (2018). MNLI is a sentence entailment task in which, given a premise sentence, requires us to classify a hypothesis sentence into neutral, contradiction or entailment classes. We report the classification accuracy on this task. SQuAD is a question answering task, in which given a paragraph and a question, requires us to identify the answer as a span of the words in the paragraph. For this task we report both the F1 score and the Exact Match (EM) percentage. The metrics are reported on the dev sets of these datasets.

For our experiments with the depth-wise separable convolution layers, we follow the implementation in (Wu et al., 2019). We first use a GLU layer followed by the convolution layer. We use 16 separable convolution filters, of filter length 128, and reuse them, with each filter operating on 48 of the 768 dimensions of the input. This layer also has a skip connection and the output is normalized using layer normalization, similar to the self-attention layer. In our experiments, we replace the self-attention layers of the Transformers, in the lower layers, with this convolution layer. We keep the feed forward layer of the Transformer block the same.

For the experiments performed in this paper, one might consider an alternate explanation that the tasks considered maybe are easy, and do not require any advanced architecture to solve them, and even a simple architecture (bi-linear projection or separable convolution) might solve these tasks. To rule out this case we consider an even simpler architecture, namely average attention, as a baseline for our experiments.

Average attention. An average attention layer replaces the self-attention layer, and just computes the average of projections of all the other tokens. That is, we replace $\sigma[({\bm{W}}_{K}^{i}{\bm{X}})^{T}{\bm{W}}_{Q}^{i}{\bm{X}}]$ in (1) with a matrix full of $1/n$ . The model still has the skip connections and the feed-forward layers like Transformer.