Very Deep Transformers for Neural Machine Translation

Xiaodong Liu, Kevin Duh, Liyuan Liu, Jianfeng Gao

Introduction

The capacity of a neural network influences its ability to model complex functions. In particular, it has been argued that deeper models are conducive to more expressive features Bengio (2009). Very deep neural network models have proved successful in computer vision He et al. (2016); Srivastava et al. (2015) and text classification Conneau et al. (2017); Minaee et al. (2020). In neural machine translation (NMT), however, current state-of-the-art models such as the Transformer typically employ only 6-12 layers Bawden et al. (2019); Junczys-Dowmunt (2019); Ng et al. (2019).

Previous work has shown that it is difficult to train deep Transformers, such as those over 12 layers Bapna et al. (2018). This is due to optimization challenges: the variance of the output at each layer compounds as they get deeper, leading to unstable gradients and ultimately diverged training runs.

In this empirical study, we re-investigate whether deeper Transformer models are useful for NMT. We apply a recent initialization technique called ADMIN Liu et al. (2020a), which remedies the variance problem. This enables us train Transformers that are significantly deeper, e.g. with 60 encoder layers and 12 decoder layers.We choose to focus on this layer size since it results in the maximum model size that can fit within a single GPU system. The purpose of this study is to show that it is feasible for most researchers to experiment with very deep models; access to massive GPU budgets is not a requirement.

In contrast to previous research, we show that it is indeed feasible to train the standardNote there are architectural variants that enable deeper models Wang et al. (2019); Nguyen and Salazar (2019), discussed in Sec 2. We focus on the standard architecture here. Transformer Vaswani et al. (2017) with many layers. These deep models significantly outperform their 6-layer baseline, with up to 2.5 BLEU improvement. Further, they obtain state-of-the-art on the WMT’14 EN-FR and WMT’14 EN-DE benchmarks.

Background

We focus on the Transformer model Vaswani et al. (2017), shown in Figure 1. The encoder consists of $N$ layers/blocks of attention + feed-forward components. The decoder consists of $M$ layers/blocks of masked-attention, attention, and feed-forward components. To illustrate, the input tensor $\mathbf{x_{i-1}}$ at the encoder is first transformed by a multi-head attention mechanism to generate the tensor $f_{ATT}(\mathbf{x_{i-1}})$ . This result is added back with $\mathbf{x_{i-1}}$ as a residual connection, then layer-normalization ( $f_{LN}(\cdot)$ ) is applied to generate the output: $\mathbf{x_{i}}=f_{LN}(\mathbf{x_{i-1}}+f_{ATT}(\mathbf{x_{i-1}}))$ . Continuing onto the next component, $\mathbf{x_{i}}$ is passed through a feed-forward network $f_{FF}(\cdot)$ , and is again added and layer-normalized to generate the output tensor: $\mathbf{x_{i+1}}=f_{LN}(\mathbf{x_{i}}+f_{FF}(\mathbf{x_{i}}))$ . Abstractly, the output tensor at each Add+Norm component in the Transformer (Figure 1) can be expressed as:

where $f_{i}$ represents a attention, masked-attention, or feed-forward subnetwork. This process repeats $2\times N$ times for a $N$ -layer encoder and $3\times M$ times for a $M$ -layer decoder. The final output of the decoder is passed through a softmax layer which predicts the probabilities of output words, and the entire network is optimized via back-propagation.

Optimization difficulty has been attributed to vanishing gradient, despite layer normalization Xu et al. (2019) providing some mitigation. The lack of gradient flow between the decoder and the lower layers of the encoder is especially problematic; this can be addressed with short-cut connections Bapna et al. (2018); He et al. (2018). An orthogonal solution is to swap the positions of layerwise normalization $f_{LN}$ and subnetworks $f_{i}$ within each block Nguyen and Salazar (2019); Domhan (2018); Chen et al. (2018) by: $\mathbf{x_{i}}=f_{i}(\mathbf{x_{i-1}}+f_{LN}(\mathbf{x_{i-1}}))$ This is known as pre-LN (contrasted with post-LN in Eq. 1), and has been effective in training networks up to 30 layers Wang et al. (2019).The 96-layer GPT-3 Brown et al. (2020) uses pre-LN.

However, it has been shown that post-LN, if trained well, can outperform pre-LN Liu et al. (2020a). Ideally, we hope to train a standard Transformer without additional architecture modifications. In this sense, our motivation is similar to that of Wu et al. (2019b), which grows the depth of a standard Transformer in a stage-wise fashion.

Initialization Technique

The initialization technique ADMIN Liu et al. (2020a) we will apply here reformulates Eq. 1 as:

where $\mathbf{\omega_{i}}$ is a constant vector that is element-wise multiplied to $\mathbf{x_{i-1}}$ in order to balance the contribution against $f_{i}(\mathbf{x_{i-1}})$ . The observation is that in addition to vanishing gradients, the unequal magnitudes in the two terms $\mathbf{x_{i-1}}$ and $f_{i}(\mathbf{x_{i-1})}$ is the main cause of instability in training. Refer to Liu et al. (2020a) for theoretical details.Note that paper presents results of 18-layer Transformers on the WMT’14 En-De, which we also use here. Our contribution is a more comprehensive evaluation.

ADMIN initialization involves two phases: At the Profiling phase, we randomly initialize the model parameters using default initialization, set $\mathbf{\omega_{i}}=1$ , and perform one step forward pass in order to compute the output variance of the residual branch $Var[f(\mathbf{x_{i-1})}]$ at each layer.We estimate the variance with one batch of 8k tokens. In the Training phase, we fix $\mathbf{\omega_{i}}~{}=~{}\sqrt{\sum_{j<i}{Var[f(\mathbf{x_{j-1}})]}}$ , and then train the model using standard back-propagation. After training finishes, $\mathbf{\omega_{i}}$ can be folded back into the model parameters to recover the standard Transformer architecture. This simple initialization method is effective in ensuring that training does not diverge, even in deep networks.

Experiments

Experiments are conducted on standard WMT’14 English-French (FR) and English-German (DE) benchmarks. For FR, we mimic the setuphttps://github.com/pytorch/fairseq/blob/master/examples/translation/prepare-WMT’14en2fr.sh of Ott et al. (2018), with 36M training sentences and 40k subword vocabulary. We use the provided ’valid’ file for development and newstest14 for test. For DE, we mimic the setuphttps://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/translate_ende.py of So et al. (2019), with 4.5M training sentences, 32K subword vocabulary, newstest2013 for dev, and newstest2014 for test.

We adopt the hyper-parameters of the Transformer-based model Vaswani et al. (2017) as implemented in FAIRSEQ Ott et al. (2019), i.e. 512-dim word embedding, 2048 feed-forward model size, and 8 heads, but vary the number of layers. RAdam Liu et al. (2019) is our optimizer.For FR, #warmup steps is 8000, max #epochs is 50, and peak learning rate is 0.0007. For DE, #warmup steps is 4000, max #epochs is 50, and learning rate is 0.001. Max #tokens in each batch is set to 3584 following Ott et al. (2019).

Our goal is to explore whether very deep Transformers are feasible and effective. We compare: (a) 6L-6L: a baseline Transformer Base with 6 layer encoder and 6 layer decoder, vs. (b) 60L-12L: A deep transformer with 60 encoder layers and 12 decoder layers.We use “(N)L-(M)L” to denote that a model has N encoder layers and M decoder layers. N & M are chosen based on GPU (16G) memory constraint. For reproducibility and simplicity, we focused on models that fit easily on a single GPU system. Taking FR as an example, it takes 2.5 days to train 60L-12L using one DGX-2 (16 V100’s), 2 days to train a 6L-6L using 4 V100’s. For each architecture, we train with either default initialization Glorot and Bengio (2010) or ADMIN initialization.

The results in terms of BLEU Papineni et al. (2002), TER Snover et al. (2006), and METEOR Lavie and Agarwal (2007) are reported in Table 1. Similar to previous work Bapna et al. (2018), we observe that deep 60L-12L Default diverges during training. But the same deep model with ADMIN successfully trains and impressively achieves 2.5 BLEU improvement over the baseline 6L-6L Default in both datasets. The improvements are also seen in terms of other metrics: in EN-FR, 60L-12L ADMIN outperforms the 6L-6L models in TER (40.3 vs 42.2) and in METEOR (62.4 vs 60.5). All results are statistically significant ( $p<0.05$ ) with a 1000-sample bootstrap test Clark et al. (2011).

These results indicate that it is feasible to train standard (post-LN) Transformers that are very deep.Note: the pre-LN version does train successively on 60L-12L and achieves 29.3 BLEU in DE & 43.2 in FR. It is better than 6L-6L but worse than 60L-12L ADMIN. These models achieve state-of-the-art results in both datasets. The top results in the literature are compared in Table 2.The table does not include systems that use extra data. We list BLEU scores computed with multi-bleu.perl on the tokenization of the downloaded data (commonly done in previous work), and with sacrebleu.py (version: tok.13a+version.1.2.10). which allows for a safer token-agnostic evaluation Post (2018).

Learning Curve:

We would like to understand why 60L-12L ADMIN is doing better from the optimization perspective. Figure 2 (a) plots the learning curve comparing ADMIN to Default initialization. We see that Default has difficulty decreasing the training perplexity; its gradients hit NaN, and the resulting model is not better than a random model. In Figure 2 (b), we see that larger models (60L-12L, 36L-36L) are able obtain lower dev perplexities than 6L-6L, implying that the increased capacity does lead to better generalization.

Fine-grained error analysis:

We are also interested in understanding how BLEU improvements are reflected in terms of more nuanced measures. For example, do the deeper models particularly improve translation of low frequency words? Do they work better for long sentences? The answer is that the deeper models appear to provide improvements generally across the board (Figure 3).Computed by compare-mt Neubig et al. (2019).

Ablation Studies:

We experimented with different number of encoder and decoder layers, given the constraint of a 16GB GPU. Table 3 shows the pairwise comparison of models. We observe that 60L-12L, 48L-12L, and 36L-36L are statistically tied for best BLEU performance. It appears that deeper encoders are more worthwhile than deeper decoders, when comparing 60L-12L to 12L-60L, despite the latter having more parameters.Recall from Figure 1 that each encoder layer has 2 subnetwork components and each decoder layer has 3 components.

We also experiment with wider networks, starting with a 6L-6L Transformer-Big (1024-dim word embedding, 4096 feed-forward size, 16 heads) and doubling its layers to 12L-12L. The BLEU score on EN-FR improved from 43.2 to 43.6 (statistically significant, $p<0.05$ ). A 24L-12L Transformer with BERT-Base like settings (768-dim word embedding, 3072 feed-forward size, 12 heads) obtain 44.0 BLEU score on WMT’14 EN-FR. This shows that increased depth also helps models that are already relatively wide.

Back-translation

We investigate whether deeper models also benefit when trained on the large but potentially noisy data such as back-translation. We follow the back-translation settings of Edunov et al. (2018) and generated additional 21.8M translation pairs for EN-FR. The hyperparameters are the same as the one without back-translation as introduced in Edunov et al. (2018), except for an up-sampling rate 1 for EN-FR.

Table 4 compares the ADMIN 60L-12L and ADMIN 36L-12L-768D model It is BERT-base setting with 768-dim word embedding, 3072 feed-froward size and 12 heads. with the default big transformer architecture (6L-6L) which obtains states-of-the-art results Edunov et al. (2018). We see that with back-translation, both ADMIN 60L-12L + BT and ADMIN 36L-12L-768D still significantly outperforms its baseline ADMIN 60L-12L. Furthermore, ADMIN 36L-12L-768D achieves new state-of-the-art benchmark results on WMT’14 English-French (46.4 BLEU and 44.4 sacreBLEU BLEU+case.mixed+lang.en-fr+numrefs.1+smooth.exp+test.wmt14+tok.13a+version.1.2.10).

Conclusion

We show that it is feasible to train Transformers at a depth that was previously believed to be difficult. Using ADMIN initialization, we build Transformer-based models of 60 encoder layers and 12 decoder layers. On WMT’14 EN-FR and WMT’14 EN-EN, these deep models outperform the conventional 6-layer Transformers by up to 2.5 BLEU, and obtain state-of-the-art results.

We believe that the ability to train very deep models may open up new avenues of research in NMT, including: (a) Training on extremely large but noisy data, e.g. back-translation Edunov et al. (2018) and adversarial training Cheng et al. (2019); Liu et al. (2020b), to see if it can be exploited by the larger model capacity. (b) Analyzing the internal representations, to see if deeper networks can indeed extract higher-level features in syntax and semantics Belinkov and Glass (2019). (c) Compressing the very deep model via e.g. knowledge distillation Kim and Rush (2016), to study the trade-offs between size and translation quality. (d) Analyzing how deep models work Allen-Zhu and Li (2020) in theory.

Acknowledgments

We thank Hao Cheng, Akiko Eriguchi, Hany Hassan Awadalla and Zeyuan Allen-Zhu for valuable discussions.