Mega: Moving Average Equipped Gated Attention

Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, Luke Zettlemoyer

Introduction

Designing a single unified model to capture long range dependencies in sequential data across a diverge range of modalities, such as language, audio, image and video, is a central and challenging problem in sequence modeling. A number of different archtectures have been developed, including convolutional neural networks (CNNs) (Kim, 2014; Strubell et al., 2017), recurrent neural networks (RNNs) (Goller and Kuchler, 1996; Hochreiter and Schmidhuber, 1997; Cho et al., 2014), Transformers (Vaswani et al., 2017) and recent state space models (SSMs) (Gu et al., 2022a; Mehta et al., 2022). Among these models, the Transformer architecture (Vaswani et al., 2017) has stood out for its impressive empirical success on a wide range of language and vision tasks, including machine translation (Vaswani et al., 2017; Ott et al., 2018), language understanding (Devlin et al., 2019; Liu et al., 2019), image recognition (Dosovitskiy et al., 2020; Touvron et al., 2021) and genetic sequence modeling (Madani et al., 2020; Jumper et al., 2021), mainly because of the conceptually attractive attention mechanism (Bahdanau et al., 2015; Luong et al., 2015; Vaswani et al., 2017) which directly models interactions between each pair of input tokens.

Attention provides the key mechanism that captures contextual information from the entire sequence by modeling pairwise interactions between the inputs at every timestep. However, there are two common drawbacks in the design of attention mechanism: i) weak inductive bias; and ii) quadratic computational complexity. First, the attention mechanism does not assume prior knowledge of the patterns of dependencies between tokens (e.g. positional inductive bias), instead learning to predict the pairwise attention weights directly from data. Second, the cost to compute and store the attention weights is quadratic in the length of the input sequences. Recent studies have shown the limitations of applying Transformers to long sequence tasks, w.r.t both accuracy and efficiency (Tay et al., 2020).

In this work, we propose a moving average equipped gated attention mechanism (Mega) to solve the two weaknesses simultaneously. The key idea is to incorporate inductive biases into the attention mechanism across the timestep dimension, by leveraging the classic exponential moving average (EMA) approach (Hunter, 1986). EMA captures local dependencies that exponentially decay over time (see Figure 1), and has been widely used in time series data modeling (§2). We introduce a multi-dimensional damped form of EMA with learnable coefficients (§3.1), and subsequently develop the moving average equipped gated attention mechanism by integrating the EMA with a variant of the single-head gated attention (Hua et al., 2022) (§3.2). Theoretically, we show that the single-head gated attention is as expressive as the most commonly used multi-head attention (§3.3). Benefiting from the incorporated moving average mechanism, we further propose a variant of Mega with linear complexity, named Mega-chunk, which simply chunks input sequences into fixed blocks with minimal loss of contextual information (§3.5).

Experimentally, through five sequence modeling tasks across various data types, including long-context sequence modeling, neural machine translation, auto-regressive language modeling, and image and speech classification, we demonstrate that Mega significantly outperforms a variety of strong baseline models, in terms of both effectiveness and efficiency (§4) (see Table 1). These improvements illustrate the importance of modeling long- and short-term dependencies via different patterns of inductive biases.

Background

In this section, we set up notations, briefly review two widely used approaches for sequence modeling—the self-attention mechanism and exponential moving average (EMA)—and discuss the motivation for combining them.

The traditional self-attention mechanism is a function:

2 Exponential Moving Average (EMA)

The moving average is a classic approach for sequential data modeling, which has been widely used in time series data to smooth out short-term fluctuations and highlight long-term trends or cycles. The Exponential Moving Average (EMA) (Winters, 1960; Hunter, 1986), a special case of moving average, applies weighting factors that decrease exponentially. Formally, an EMA recursively calculates the output sequence $\boldsymbol{Y}$ :

where $\boldsymbol{\alpha}\in(0,1)^{d}$ is the EMA coefficient representing the degree of weighting decrease, and $\odot$ is the element-wise product. A higher $\boldsymbol{\alpha}$ discounts older observations faster (see Figure 1).

Using an EMA places a strong inductive bias on the learning of pairwise dependencies: the dependency weight between two tokens decreases exponentially over time with an input-agnostic decay factor $\boldsymbol{\alpha}$ . This property favors local dependencies, and limits long-distance dependencies. Despite the recurrent formulation in (2), the computation of EMA can be represented as $n$ individual convolutions, which can be computed efficiently using fast Fourier transforms (FFTs) (see Appendix A for details).

3 Why Combine Attention with EMA?

As discussed in Sections 2.1 and 2.2, EMA and attention mechanisms each have their own limitations, despite their wide applications and impressive successes in sequence modeling. By leveraging their properties to complement each other, we propose to embed an EMA into the calculation of the attention matrix $\boldsymbol{A}$ . The resulting model enjoys the benefit from strong inductive bias, while maintaining the capacity to learn complex dependency patterns. Moreover, this integration enables the design of a computationally efficient chunk-wise attention mechanism with linear complexity w.r.t sequence length (§3.5).

Moving Average Equipped Gated Attention (Mega)

In this section, we describe in detail our proposed method, moving average equipped gated attention (Mega). We first introduce multi-dimensional damped EMA (§3.1), which is a key component combined with the single-head gated attention in Mega (§3.2), and discuss the relationship between Mega and three closely related models: GRU (Cho et al., 2014), Flash (Hua et al., 2022) and S4 (Gu et al., 2022a). We also provide theoretical justification for the design of single-head gated attention (§3.3). Then, we describe the detailed architecture of each Mega block, including feed-forward and normalization layers (§3.4). At last, we present Mega-chunk, a variant of Mega that simply splits input sequences into fixed chunks, reducing time and space complexity from quadratic to linear (§3.5).

Mega introduces a modification of the standard EMA, named multi-dimensional damped EMA, to improve its flexibility and capacity.

Previous studies (McKenzie and Gardner Jr, 2010; Svetunkov, 2016) have shown that relaxing the coupled weights of the previous and current observations ( $\boldsymbol{\alpha}$ vs. $1-\boldsymbol{\alpha}$ in (2)) produces robust dependency modeling. Inspired by this, Mega allows the damping of the influence of the previous time step:

where $\boldsymbol{\delta}\in(0,1)^{d}$ is the damping factor.

Multi-dimensional Damped EMA.

2 Moving Average Equipped Gated Attention

The gated attention mechanism in Mega adopts the Gated Recurrent Unit (GRU; Cho et al. (2014)) and Gated Attention Unit (GAU; Hua et al. (2022)) as the backbone architectures, with an EMA-based sub-layer embedded into the calculation of the attention matrix. Formally, we first use the output from (3.1) to compute the shared representation in GAU:

Subsequently, Mega introduces the reset gate $\boldsymbol{\gamma}$ , the update gate $\boldsymbol{\varphi}$ , and computes the candidate activation output $\boldsymbol{\hat{H}}$ :

The final output $\boldsymbol{Y}$ is computed with the update gate $\boldsymbol{\varphi}$ :

The graphical architecture of a Mega sub-layer is visualized in Figure 2 (b).

Relation to and Differences from GRU, Flash and S4.

The computation of the the reset gate $\boldsymbol{\gamma}$ , the update gate $\boldsymbol{\varphi}$ , and the candidate activation output $\boldsymbol{\hat{H}}$ in (12-14) is reminiscent of GRUs (Cho et al., 2014). The main difference is that in a GRU the two gates are applied between the hidden states of the current and previous timesteps, while in Mega they are applied between the outputs from EMA and gated attention sub-layers. In addition, the output gating mechanism in (15) is similar to the gated residual connection proposed in Parisotto et al. (2020); Xu et al. (2020) to reduce the variance of output $\boldsymbol{Y}$ .

The computation of the shared representation $\boldsymbol{Z}$ , together with the sequences of queries, keys and values in (7-10) are inspired from GAU in Flash (Hua et al., 2022). Mega integrates EMA into GAU by computing $\boldsymbol{Z}$ in (7) from the EMA output $\boldsymbol{X}^{\prime}$ rather than the original input $\boldsymbol{X}$ , and combining the GAU output with $\boldsymbol{X}^{\prime}$ for the candidate activation $\boldsymbol{\hat{H}}$ in (14). Experimental gains over Flash demonstrate the effectiveness of this design chice (§4.1).

The multi-dimensional damped EMA can be seen as a simplified variant of a state space model. From this perspective, Mega is also closely related to S4 (Gu et al., 2022a), a state space model with structured state matrices. S4 leverages the HiPPO framework (Gu et al., 2020) to initialize its low-rank structured state matrices, and the computation of the convolutional kernel in S4 requires complex fast Fourier transformers (FFTs). The EMA sub-layer in Mega applies diagonalization on the state matrix and restricts the diagonal elements in the range of $(0,1)$ . Thus, the convolution kernel would be a Vandermonde product, which can be computed in an efficient and numerically stable way. Similar diagonalization has been used in a concurrent work S4D (Gu et al., 2022b). Moreover, unlike S4 and S4D, the parameter initialization in Mega does not rely on the HiPPO framework.

3 Theoretical Justification of Single-head Gated Attention

Single-head gated attention has been empirically shown as performant as vanilla multi-head attention Liu et al. (2021); Hua et al. (2022), without any discussions on its theoretical insights. In this section, we provide theoretical justifications of the expressiveness of single-head gated attention. To facilitate subsequent analysis, we simplify the notations of the multi-head attention. Specifically, we denote the sequences of queries, keys and values as the outputs of three transformations of the input sequence:

For multi-head attention, a common implementation is to split the query into $h$ heads across the model dimension:

Suppose the transformation $\mathcal{G}$ is a universal approximator. Then, for each $\boldsymbol{X}$ there exists $\boldsymbol{\gamma}=\mathcal{G}(\boldsymbol{X})$ such that

4 Mega Blocks

The Mega layer (moving average equipped gated attention) is used as a drop-in-replacement for regular attention in Transformer. It is followed by position-wise feed-forward networks (FFNs) and normalization layers to compose one Mega block. As the gated residual connection has already been included in (15), we omit the original residual connection and directly apply a normalization layer to $\boldsymbol{Y}.$ Concretely,

5 Mega-chunk: Mega with Linear Complexity

So far we have only focused on introducing stronger inductive bias into the attention mechanism, which still has quadratic computational complexity. In this section, we propose Mega-chunk, a variant of Mega with linear complexity, which simply applies attention to each local chunk of fixed length.

Specifically, we first split the sequences of queries, keys and values in (8-10) into chunks of length $c$ . e.g. $\boldsymbol{Q}=\{\boldsymbol{Q}_{1},\ldots,\boldsymbol{Q}_{k}\}$ , where $k=n/c$ is the number of chunks.Keys and values are split in the same way. The attention operation in (11) is individually applied to each chunk, yielding linear complexity $O(kc^{2})=O(nc)$ w.r.t $n$ . However, this method suffers from the critical limitation of losing contextual information from other chunks. Fortunately, the EMA sub-layer in Mega mitigates this problem by capturing local contextual information near each token, whose outputs are used as the inputs to the attention sub-layer. As a result, the effective context being exploited by chunk-wise attention can go beyond the chunk boundary. Figure 3 illustrates the largest possible dependency length captured by one Mega-chunk block.

Experiments

To evaluate Mega, we conduct experiments on five benchmark sequence modeling tasks across various data types, comparing with current state-of-the-art models on each task. All the numbers with ${\ddagger}$ indicate results from the baseline models replicated by us. More detailed descriptions, results and analysis are provided in Appendix D.

We begin our experiments with an evaluation on the Long Range Arena (LRA) benchmark recently introduced by Tay et al. (2021), which is designed for the purpose of evaluating sequence models under the long-context scenario. They collect six tasks in this benchmark which are ListOps (Nangia and Bowman, 2018), byte-level text classification (Text; Maas et al. (2011)), byte-level document retrieval (Retrieval; Radev et al. (2013)), image classification on sequences of pixels (Image; Krizhevsky et al. (2009)), Pathfinder (Linsley et al., 2018) and its extreme long version (Path-X; Tay et al. (2021)). These tasks consist of input sequences ranging from 1K to 16K tokens and span across a variety of data types and modalities.

Table 2 compares Mega against several baselines, including Transformer and its efficient variants, and the state-of-the-art S4 models (both version 1 (Gu et al., 2022a) and version 2 (Gu et al., 2022b)).The S4-v2 used larger model sizes and better-tuned hyper-parameters than S4-v1. Note that our Mega has similar model size with S4-v1 on each task. We have also experimented with SRU++ (Lei, 2021) on Pathfinder but failed to converge on this dataset after tuning hyperparameters. To ensure fair comparison, we adjust the number of layers and model dimensions on each task so that Mega has similar number of parameters with S4-v1. For each experiment, we report the average over 5 runs with different random seeds. The tuning information and the model details are provided in the Appendix D.1.

On all the six tasks, Mega substantially outperforms all the baselines. We also evaluate Mega-chunk on each task, by setting the chunk size $c=128$ for all the tasks, except Path-X where $c=4096$ . We observe that Mega-chunk consistently performs well, particularly on the three language tasks. We also examine the speed and memory efficiency of Mega on the byte-level classification task with the input length of 4K. Mega-chunk is highly efficient, which is about $5.5$ times faster and consumes only $13$ % as much memory as the vanilla Transformer. It is interesting to see that Mega with full attention field is also much more efficient than Transformer, benefiting from single-head gated attention.

To demonstrate the effectiveness of the multi-dimensional damped EMA component in Mega, we performs ablation studies on two LRA tasks — byte-level text classification (Text) and image classification on sequences of pixels (Image). We train Mega models with EMA dimension $h\in\{0,1,2,4,8,16,32\}$ , where $h=0$ indicates removing the EMA component. From the left figure in Figure 4, we see that without the EMA component, model accuracy on both the two tasks declines rapidly. Meanwhile, with a single dimensional EMA ( $h=1$ ), Mega obtains significant improvements, demonstrating the importance of incorporating inductive bias via EMA.

Analysis of Chunk Size.

We further analyze the impact of chunk size $c$ on the same two tasks, by varying $c\in\{16,32,64,128,256,512,\infty\}$ , where $\infty$ indicates the original Mega without chunking. The right figure in Figure 4 shows that image data is more sensitive to chunk size than text data. On the Text task, Mega-chunk with even a small chunk size $c=16$ is able to achieve around 90% accuracy. On the Image task, Mega-chunk with $c=16$ achieves around 75% accuracy, which is still much better than the vanilla Transformer model.

Analysis of Attention Functions.

Finally, we evaluate performance with different attention functions. Table 3 shows the accuracy of the three attention functions on the same two tasks. On text data softmax obtains the best accuracy, while on image data it performs the worst. The laplace function achieves the best accuracy on image data and also competitive result on text data, being consistently better than relu2. In the following experiments we use softmax for language tasks and laplace for vision and speech ones.

2 Raw Speech Classification

To evaluate the capability of Mega on the long-range modeling of speech signals, we apply Mega to classify raw speech (with length 16000), rather than using traditional preprocessing (e.g. convert to MFCC features). Following Gu et al. (2022a), we perform speech classification on the SC10 subset of the Speech Commands dataset (Warden, 2018). We experiment with the Mega-chunk variant with $c=1000$ , since the computation of Mega and Transformer can not fit in GPU memory. As shown in Table 4, our Mega-chunk (base) model with 300K parameters is able to achieve an accuracy of 96.92 that is slightly worse than 97.50 from the state-of-the-art method S4, Our S4 number is obtained by directly running the official S4 code and is a bit worse than the original reported number (98.32), due to different data splits — the file reading order is not deterministic across machines with os.listdir. while by adding 0.18M parameters our Mega-chunk (big) model performs comparably well with S4.

3 Auto-regressive Language Modeling

We evaluate Mega on two established language modeling benchmarks — WikiText-103 (Merity et al., 2017) and enwik8 (Hutter, 2006), which are next-token prediction tasks. WikiText-103 is a word-level language modeling dataset containing 103M training tokens from Wikipedia articles. Following previous work (Baevski and Auli, 2018; Dai et al., 2019), we adopt adaptive softmax and input embeddings and use a vocabulary of 260K tokens. Enwik8 is a character-level language modeling benchmark that has 100M tokens of unprocessed Wikipedia articles and a vocabulary size of about 200. At test time, we split the test data into segments and process each segment sequentially. In Table 5, we compare with previous top-performing models that are designed to take advantage of longer context, including Transformers (Baevski and Auli, 2018; Al-Rfou et al., 2019) (XFM-adaptive), Transformer-XL (Dai et al., 2019) (XFM-XL) and S4 (Gu et al., 2022a). On both WikiText-103 and enwik8, we obtain very competitive results, outperforming baselines by a large margin while enjoying much faster (9 $\times$ ) inference speed compared to the Transformer model. Mega can also naturally achieve length extrapolation at inference time to any sequences that are longer than those seen during training due to the recurrent design of the EMA layer. In addition, we can extrapolate to a longer chunk size for Mega attention due to the use of rotary positional embeddings for training (Su et al., 2021). We describe them in details and provide complete results of using various test-time chunk sizes and segment lengths in Appendix D.3.

4 Neural Machine Translation

To evaluate Mega on sequence-to-sequence modeling, we conduct experiments on a standard machine translation benchmark, WMT 2016 English-German news translation (WMT’16), consisting of 4.5M sentence pairs of training data. Following Ott et al. (2018), we validate on newstest13 and test on newstest14. The Mega models closely follow the architecture of Transformer-base: 6 encoder and decoder layers with model dimension $d=512$ .

Table 6 presents the BLEU scores on the test sets of WMT’16 from two directions: EN $\rightarrow$ DE and DE $\rightarrow$ EN. For each experiment, we report the average of both tokenized and SacreBLEUsignature: nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:1.5.1 (Post, 2018) scores with 5 different random seeds. Mega-base significantly outperforms Transformer-base by over $1.1$ BLEU. We also report results of Mega with the Laplace attention function, which slightly but consistently underperforms Softmax.

5 Image Classification

To evaluate Mega on a large-scale image classification task, we conduct experiments on the Imagenet- $1k$ (Deng et al., 2009) dataset, which consists of 1.28M training images and 50K validation images from 1000 classes. Top-1 accuracy on the validation set is reported in Table 7 to assess various models. Mega obtains about $0.5$ % accuracy improvement over DeiT-B (Touvron et al., 2021). We mostly follow DeiT’s approach of applying several data augmentation and regularization methods that facilitate the training process, including Cutmix (Yun et al., 2019), Mixup (Zhang et al., 2017), stochastic depth (Huang et al., 2016), repeated augmentation (Hoffer et al., 2020), Rand-Augment (Cubuk et al., 2020), and random erasing (Zhong et al., 2020). These methods were highly tuned towards optimizing the performance of DeiT, which might be sub-optimal for Mega. Exploring the optimal data augmentation and regularization methods for Mega is an interesting direction for future work. More training details are presented in the Appendix D.5.

Related Work

A number of techniques have been recently introduced to address the two issues of Transformer models; we only mention a few here due to space limits.

To incorporate stronger inductive bias into the attention mechanism, one research direction focuses on injecting position information via advanced positional encoding methods, including absolute and relative positional embeddings (Vaswani et al., 2017; Huang et al., 2020; Ke et al., 2020), and relative positional biases (Su et al., 2021; Press et al., 2021). Another line of research combines the attention mechanism with other neural architectures with intrinsic strong inductive bias, such as convolutional (Gehring et al., 2017; Dai et al., 2021) and recurrence (Dai et al., 2019; Rae et al., 2020; Lei, 2021).

Computational Efficiency.

Many advanced variants of Transformer models (‘xformers’) (Tay et al., 2020, 2021) have recently emerged to improve the time and memory efficiency. Popular techniques include sparse attention patterns (Parmar et al., 2018; Beltagy et al., 2020; Kitaev et al., 2020), low-rank approximations of the attention matrix (Wang et al., 2020; Ma et al., 2021), and approximations through kernelization (Choromanski et al., 2020; Peng et al., 2021). Although these models demonstrate better asymptotic complexity for long sequences, their efficiency gains are less prominent for moderate length sequences and their performance remains behind that of Transformers with regular attention.

Convolutional Neural Networks with Continuous Kernels.

As EMA and more general state space models such as S4 can be regarded as a convolution transform with kernel size equal to the sequence length, Mega is also relevant with CNNs with continuous kernels, including CKConv (Romero et al., 2021), FlexConv (Romero et al., 2022a) and CCNN (Romero et al., 2022b).

Conclusion

We have introduced Mega, a simple, efficient and effective neural architecture used as a drop-in replacement for regular multi-head attention. By leveraging the classic exponential moving average (EMA) approach, Mega is capable of incorporating stronger inductive biases into the attention mechanism. Moreover, the EMA approach enables the design of Mega-chunk, an efficient variant of Mega with linear complexity. On five sequence modeling tasks across various data types, Mega achieves impressive improvements over a variety of strong baselines, including previous state-of-the-art systems. These improvements lead to a potential direction of future work to apply Mega for multi-modality modeling.

References

Appendix: Mega: Moving Average Equipped Gated Attention

A Efficient Computation of Multi-dimensional Damped EMA

Note that the computation of the multi-dimensional damped EMAs of different dimensions are entirely independent of each other. Without loss of generality, we set $d=1$ and omit the dimension index $j$ in the following formulations. We denote the initial hidden state as $\boldsymbol{h}_{0}$ . The multi-dimensional damped EMA defined in (3.1) can be vectorized into the following formulation:

Let’s denote $\boldsymbol{\phi}=1-\boldsymbol{\alpha}\odot\boldsymbol{\delta}$ . Then, unrolling the above two equations explicitly yields:

This can be written into a vectorized formula:

In the proposed multi-dimensional damped EMA, $\mathcal{K}$ can be efficiently computed by the Vandermonde product. With $K$ provided, the output $\mathbf{y}$ in (25) can be computed efficiently with FFTs.

B Proof of Theorem 1

Proof We split $\boldsymbol{\gamma}$ into $h$ heads in the same way as $\boldsymbol{Q}$ , $\boldsymbol{K}$ , and $\boldsymbol{V}$ :

To prove Theorem 1, we need to find $\boldsymbol{\gamma}$ such that

where $\oslash$ is the element-wise divide operation. Since $\mathcal{G}(\boldsymbol{X})$ is a universal approximator and $\boldsymbol{Q}$ , $\boldsymbol{K}$ , $\boldsymbol{V}$ and $\boldsymbol{a}$ are all transformed from $\boldsymbol{X}$ , $\boldsymbol{\gamma}$ can theoretically recover ${\boldsymbol{a}^{(i)}}^{T}\boldsymbol{V}^{(i)}\oslash\boldsymbol{a}^{T}\boldsymbol{V}^{(i)},\,\,\forall\boldsymbol{X}$ .

C Laplace Attention Function

To approximate the squared ReLU function with the Laplace function in (16), we need to select proper coefficients $\mu$ and $\sigma$ . We derive the values of $\mu$ and $\sigma$ by solving the following two equations at $x=\sqrt{2}$ :

The Eq. (27) delivers $\mu=\sqrt{1/2}$ and Eq. 28 subsequently provides $\sigma=\sqrt{1/4\pi}$ . Figure 5 visualizes the two functions.

Besides performance improvements, we also investigate the stability of the two attention functions. We conduct experiments on the LRA Pathfinder task with Mega models with the two functions. Figure 5 presents the accuracy on the validation set across training epochs. We observe that Laplace is much more stable than ReLU2.

D Experimental Details

For all tasks, we closely follow Tay et al. (2020) for details such as data preprocessing, data split, etc. The hyper-parameters of Mega models on these tasks are listed in Table 8.

D.2 Raw Speech Classification

Following Gu et al. (2022a), we perform speech classification on the SC10 subset of the Speech Commands dataset (Warden, 2018), which is a 10-class classification task. The chunk size of Mega-chunk is 1000. Other hyper-parameters are listed in Table 8.

D.3 Language Modeling

We use the data of WikiText-103 and enwik8 and their splits provided by Dai et al. (2019). At training time, we split the training data into segments; each segment contains $m$ consecutive chunks, where the chunk size is the effective attention length. $m$ is a random integer variable uniformly sampled from $[cl,ch]$ . We use $[cl,ch]=$ for WikiText-103 and $[cl,ch]=$ for enwik8. Other training hyperparameters including optimizer, learning rate scheduler and architecture are presented in Table 9.

Length extrapolation at inference time

We employ Mega-chunk (§3.5) for training and set the attention chunk size to be 1024 and 2048 for WikiText-103 and enwik8 respectively. To use a longer Mega attention length at inference time than the one used at training time (i.e. 1024 or 2048), we apply rotary positional embedding (Su et al., 2021) to the attention sublayer. At test time, we split the test data into $K$ segments and sequentially process each segment by $m$ chunks, i.e. the maximum context length of each segment is $\frac{\#\text{test tokens}}{K}$ . In Table 5, we report test results that use longer chunk sizes (attention lengths) of 2048 and 4096 for WikiText-103 and enwik8 respectively. Mega can naturally extrapolate at inference time to sequences longer than those seen during training due to the recurrent design of the EMA layer. That design enables the inputs of each chunk to access the historic context through EMA as illustrated in Figure 3. On the other hand, due to the use of rotary positional embeddings, attention can be performed on longer chunk sizes at test time than those seen during training. We hope these two types of length extrapolation are clear to readers. We provide the ablation studies on these two types of length extrapolation below, i.e. extrapolation to longer context by increasing input sequence lengths and extrapolation to longer attention lengths by increasing the chunk size.

Ablations on context lengths

First, we fix the chunk size to be 2048 and vary $K$ within $ $corresponding to maximum context tokens of$ [2.5\text{K},3.3\text{K},4.9\text{K},9.8\text{K},\\ 16\text{K},25\text{K},49\text{K}]$. We plot the test PPL as we increase the context length in the left of Figure 6. Although at training time, the maximum context length the model has seen is 6144, Mega can extrapolate to longer context lengths. The plot shows that PPL decreases as the context length is increased and the improvements saturate when the context length is longer than 25K. This is consistent with the observations in Press et al. (2021).

Ablations on attention chunk sizes

Next, we fix the context length to be 25K and increase the chunk size from 512 to 3072. As shown in the right side of Figure 6, Mega consistently improves as we increase the attention length although it only uses an attention length of 1024 during training. This contradicts with the findings in Alibi (Press et al., 2021), which finds that rotary embeddings don’t generalize to longer lengths and result in higher PPL.

D.4 Machine Translation

The WMT 2016 English-German dataset contains 4.5M parallel sentence pairs for training. We following the standard setting (Ott et al., 2018), using Newstest2013 as the validation set and Newstest2014 as the test set. The dataset is pre-processed following (Ma, 2020), using the scripts from FairSeq package (Ott et al., 2019).https://github.com/pytorch/fairseq We share the source and target vocabularies within the language pair, with 32K byte pair encoding (BPE) types (Sennrich et al., 2016). The hyper-parameters of Transformer and Mega models are listed in Table 10.

D.5 Image Classification

Hyper-parameters are listed in Table 11. We closely follow Touvron et al. (2021) by reusing most of the their hyper-parameters.