Adaptive Attention Span in Transformers

Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, Armand Joulin

Introduction

Language models are at the core of many NLP applications, like machine translation or dialogue. Recently, much progress has been made by a new neural network called Transformer (Vaswani et al., 2017). Part of its success is due to its ability to capture long term dependencies. This is achieved by taking long sequences as inputs and explicitly compute the relations between every token via a mechanism called the “self-attention” layer (Al-Rfou et al., 2019).

While this layer allows for information to propagate across long distances, it has a computational and memory cost that scales quadratically with the size of the input sequence. As a consequence, Transformers hardly scale to sequences of more than a thousand tokens. This is particularly problematic in the case of character level language modeling where dependencies are often spread over a few thousands time steps.

In this work, we propose an alternative to the self-attention layer to reduce the computational burden of a Transformer. Our layer learns its optimal context size, resulting in a network where each attention layer gathers information on their own context. In practice, we observe that this leads to Transformer with small context in the low-level layers and very large ones for the last layers. With this modification, we are able to scale input sequences to more than $8$ k tokens with no loss of performance, nor additional computational or memory cost. We validate our approach on the task of character level language modeling where we reach state-of-the-art performances while reducing the number of FLOPS. The code to reproduce our results is publicly availablehttps://github.com/facebookresearch/adaptive-span.

Approach

Language modeling is the problem of assigning a probability to a sequence of tokens $(w_{1},\dots,w_{T})$ :

Recent progress was made with a new auto-regressive model called Sequential Transformer (Vaswani et al., 2017). A Transformer is made of a sequence of layers that are composed of a block of parallel self-attention layers followed by a feedforward network. We refer to Vaswani et al. (2017) for the details on the structure. In this paper, we make a couple of modifications to the Transformer model: we use the relative position embeddings of Shaw et al. (2018) and the caching mechanism of Dai et al. (2019) to speed up the train and test time.

A core mechanism of a transformer network is the self-attention layer, which consists of multiple attention heads working in parallel. Each attention head applies the attention mechanism of Bahdanau et al. (2015) to its own input. Given a token $t$ in a sequence, the head first computes similarities with its past, i.e., any token $r$ in the span $[t-S,t)$ :

where $\mathbf{W}_{k}$ and $\mathbf{W}_{q}$ are the “key” and “query” matrices, and $\mathbf{p}_{t-r}$ is the relative position embedding. The attention weights are then obtained by applying a softmax function on these similarities:

Finally, the head outputs a vector $\mathbf{y}_{t}$ by taking the average of the past representations weighted by their attention weights:

where $\mathbf{W}_{v}$ is called the “value” matrix. Outputs from different heads are then concatenated together and multiplied by an output matrix $\mathbf{W}_{o}$ before feeding to the next layer.

Similar to the memory access mechanisms of Sukhbaatar et al. (2015), it pulls information from the past to update the current token representation. Repeating this mechanism in consecutive layers allows for information to flow over long distances. However, for each input token, each attention head scales linearly in memory and time in the context size, or attention span. There are typically $12$ layers with $8$ heads each that processes $512$ tokens simultaneously. This drastically limits the maximum attention span used in Transformers.

2 Adaptive attention span

Each attention head of a Transformer shares the same attention span $S$ . This assumes that every head requires the same span to form its representation. As shown in Figure 1, this assumption does not hold in the context of character level language modeling: some heads (e.g., Head A) focus on the recent history, while others take information from the whole available context (e.g., Head B). In this section, we propose to learn the attention span of each head independently to reduce their computational and memory cost.

For each head, we add a masking function to control for the span of the attention. A masking function is a non-increasing function that maps a distance to a value in $ $. We take the following soft masking function$ m_{z} $parametrized by a real value$ z $in$ [0,S]$:

where $R$ is a hyper-parameter that controls its softness. This soft masking function is inspired by Jernite et al. (2017). In Figure 2, we show the shape of this piecewise function as a function of the distance. The attention weights from Eq. 2 are then computed on the masked span, i.e.,

where $\lambda>0$ is the regularization hyper-parameter, and $M$ is the number of heads in each layer. Our formulation is differentiable in the parameters $z_{i}$ and we learn them jointly with the rest of the model.

As an extension, we consider a dynamic computation approach (Graves, 2016) where the attention span dynamically change based on the current input (Luong et al., 2015; Shu and Nakayama, 2017). At a time step $t$ , the span parameter $z_{t}$ of an attention head is then a function of the input parametrized by a vector $\mathbf{v}$ and a scalar $b$ , i.e., $z_{t}=S\sigma(\mathbf{v}^{T}\mathbf{x}_{t}+b)$ . We penalize $z_{t}$ in the same way as before and learn the parameters $\mathbf{v}$ , $b$ jointly with the rest of the parameters.

Experiments

In this section, we evaluate the impact of our adaptive attention mechanism in the experimental setting of Al-Rfou et al. (2019) for character level language modeling.

We use the text8 and enwik8 datasets of Mahoney (2011). The both dataset have $100$ M tokens. We report bit per character (bpc) on dev and test set.

Implementation details.

We experiment with two sizes of models. Our small models have $12$ layers and a hidden size of $d_{h}=512$ , except for the feedforward ReLU layers, which have $2048$ units. The large models have $24$ layers with a hidden size of $d_{h}=768$ , and a ReLU size of $4096$ . All models have $8$ attention heads in each layer. Token and position embedding parameters are initialized from $\mathcal{N}(0,1)$ , and the projection matrices $\mathbf{W}_{\{q,k,v,o\}}$ are initialized from $\mathcal{U}(-1/\sqrt{d_{h}},1/\sqrt{d_{h}})$ . A single set of position embeddings $\mathbf{p}_{t}$ is shared across all the heads.

In adaptive-span models, we reprameterized the span parameter $z$ by $z=Sz^{\prime}$ , where $z^{\prime}\in$ is initialized to . In dynamic-span models, the bias term $b$ is initialized $-4$ to make initial spans small. We set the hyperparameters $\lambda=2\times 10^{-6}$ and $R=32$ for the both type of models, except $\lambda$ is reduced to $0.5\times 10^{-6}$ when $S=8192$ because $z$ was not growing longer than 4000.

We use Adagrad with a batch size of $64$ and fixed learning rate of $0.07$ and $32$ k warm-up steps. Our warm-up strategy differs from Vaswani et al. (2017): we linearly increase learning rate from zero to the final learning rate. Gradients of each module are clipped at $0.03$ for better stability. At train time, we use a block of $512$ consecutive characters and compute the loss and gradient for each of those $512$ characters.

In small models, we apply dropout with a rate of $0.3$ to the attention and the feedforward ReLU activations. We train small models for $600K$ steps ( $900K$ steps when $S=8192$ ), which takes about $2\sim 3$ days on $8$ V100 GPUs depending on the attention span limit. Large models are trained with a dropout rate of $0.4$ until the validation performance stopped improving ( $250K$ steps for text8 and $150K$ steps for enwik8), and then further trained for $20K$ steps with a learning rate divided by $10$ .

Results.

In Table 1, we compare our sequential Transformer with the adaptive spans (“Adaptive-Span”) of Sec. 2.2 to models of Al-Rfou et al. (2019) and Dai et al. (2019). For small models, our model outperforms the other Transformers by $0.07$ bcp while significantly reducing the memory usage for large attention span. Interestingly, even with a limit on span sets to $8192$ , the average span is only $314$ . Similar results are obtained on enwik8 as shown in Table 2, where the adaptive-span model outperformed similar sized models with a significantly smaller average span. Our large models achieved state-of-the-art performances on both datasets with fewer parameters and FLOPS.

In Figure 3, we compare the fixed and adaptive span small Transformers as we increase the attention span limit $S$ . The performance of both models improve as the limit increase (see Figure 3(left)), but the adaptive-span model benefits more from longer span. As shown on the Figure 3(center), a Transformer with adaptive spans controls its average spans, leading to reduction of up to $70\%$ in the number of FLOPS for the inference with large spans (see Figure 3(right)).

Impact on the attention span.

In Figure 4, we show the final attention spans of every attention heads of our small adaptive-span model with $S=4096$ . Even though all the span sizes are initialized to the same value, we see large varieties in their final values. We can see that the lowest 5 layers have the smallest possible attention span, which is $R=32$ of the masking function. This indicates that lower layers in a Transformer model do not really require a long attention span in this particular task. In contrast, few attention heads in the higher layers have very long spans, exceeding several thousand. Although there is a general tendency of higher layers having longer attention spans, it is not a simple monotonic function of the layer height.

Impact on the number of FLOPS.

Having a smaller attention span has a direct impact on the total number of FLOPS necessary for computing one-step prediction. In a standard fixed-span model, the total number of FLOPS is mostly controlled by the feed-forward layer (accounting for 62% of FLOPS when $S=256$ ). However, as the span increase, the attention layer dominates the computation (82% of FLOPS when $S=8192$ ), making it hard to scale to longer sequences. In contrast, the learning of an attention span keeps computation at a relatively constant level even as $S$ increase as shown in Figure 3(right).

The memory usage is also dominated by the attention layer as the attention span increase. Thus, reducing the average span will also reduce the memory usage. However, because all heads in a single layer attend to common state vectors, the maximum span within each layer will determine the memory usage. The same is true for the number of FLOPS if all heads of a layer are computed together, as often done for better efficiency.

In practice, the largest fixed-span model that can fit in memory for training had a span of $S=2048$ (batches had to be split when $S=4096$ ), and it took about 550ms per batch. In contrast, an adaptive-span model with a 4 times longer span of $S=8192$ fit in memory and took about similar time per batch.

Dynamic span.

In Table 3, we show the adaptive and dynamic spans achieved the same performance with comparable average spans on text8. Figure 5 shows how the average dynamic span adapts to the input sequence. The span increases at the beginning of words and in the middle of composed words, e.g., to predict the “l” in “overlook”.

Conclusion

In this work, we present a novel self-attention layer with an adaptive span. This mechanism allows for models with longer context, and thus with the capability to catch longer dependencies. We have shown the importantce of this feature in the context of character level modeling where information is spread over great distances.