Learning Deep Transformer Models for Machine Translation

Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, Lidia S. Chao

Introduction

Neural machine translation (NMT) models have advanced the previous state-of-the-art by learning mappings between sequences via neural networks and attention mechanisms Sutskever et al. (2014); Bahdanau et al. (2015). The earliest of these read and generate word sequences using a series of recurrent neural network (RNN) units, and the improvement continues when 4-8 layers are stacked for a deeper model Luong et al. (2015); Wu et al. (2016). More recently, the system based on multi-layer self-attention (call it Transformer) has shown strong results on several large-scale tasks Vaswani et al. (2017). In particular, approaches of this kind benefit greatly from a wide network with more hidden states (a.k.a. Transformer-Big), whereas simply deepening the network has not been found to outperform the “shallow” counterpart Bapna et al. (2018). Do deep models help Transformer? It is still an open question for the discipline.

For vanilla Transformer, learning deeper networks is not easy because there is already a relatively deep model in useFor example, a standard Transformer encoder has 6 layers. Each of them consists of two sub-layers. More sub-layers are involved on the decoder side.. It is well known that such deep networks are difficult to optimize due to the gradient vanishing/exploding problem Pascanu et al. (2013); Bapna et al. (2018). We note that, despite the significant development effort, simply stacking more layers cannot benefit the system and leads to a disaster of training in some of our experiments.

A promising attempt to address this issue is Bapna et al. (2018)’s work. They trained a 16-layer Transformer encoder by using an enhanced attention model. In this work, we continue the line of research and go towards a much deeper encoder for Transformer. We choose encoders to study because they have a greater impact on performance than decoders and require less computational cost Domhan (2018). Our contributions are threefold:

We show that the proper use of layer normalization is the key to learning deep encoders. The deep network of the encoder can be optimized smoothly by relocating the layer normalization unit. While the location of layer normalization has been discussed in recent systems Vaswani et al. (2018); Domhan (2018); Klein et al. (2017), as far as we know, its impact has not been studied in deep Transformer.

Inspired by the linear multi-step method in numerical analysis Ascher and Petzold (1998), we propose an approach based on dynamic linear combination of layers (DLCL) to memorizing the features extracted from all preceding layers. This overcomes the problem with the standard residual network where a residual connection just relies on the output of one-layer ahead and may forget the earlier layers.

We successfully train a 30-layer encoder, far surpassing the deepest encoder reported so far Bapna et al. (2018). To our best knowledge, this is the deepest encoder used in NMT.

On WMT’16 English-German, NIST OpenMT’12 Chinese-English, and larger WMT’18 Chinese-English translation tasks, we show that our deep system (30/25-layer encoder) yields a BLEU improvement of 1.3\sim2.4 points over the base model (Transformer-Base with 6 layers). It even outperforms Transformer-Big by 0.4\sim0.6 BLEU points, but requires 1.6X fewer model parameters and 3X less training time. More interestingly, our deep model is 10% faster than Transformer-Big in inference speed.

Post-Norm and Pre-Norm Transformer

The Transformer system and its variants follow the standard encoder-decoder paradigm. On the encoder side, there are a number of identical stacked layers. Each of them is composed of a self-attention sub-layer and a feed-forward sub-layer. The attention model used in Transformer is multi-head attention, and its output is fed into a fully connected feed-forward network. Likewise, the decoder has another stack of identical layers. It has an encoder-decoder attention sub-layer in addition to the two sub-layers used in each encoder layer. In general, because the encoder and the decoder share a similar architecture, we can use the same method to improve them. In the section, we discuss a more general case, not limited to the encoder or the decoder.

For Transformer, it is not easy to train stacked layers on neither the encoder-side nor the decoder-side. Stacking all these sub-layers prevents the efficient information flow through the network, and probably leads to the failure of training. Residual connections and layer normalization are adopted for a solution. Let F\mathcal{F} be a sub-layer in encoder or decoder, and θl\theta_{l} be the parameters of the sub-layer. A residual unit is defined to be He et al. (2016b):

where xlx_{l} and xl+1x_{l+1} are the input and output of the ll-th sub-layer, and yly_{l} is the intermediate output followed by the post-processing function f()f(\cdot). In this way, xlx_{l} is explicitly exposed to yly_{l} (see Eq. (2)).

Moreover, layer normalization is adopted to reduce the variance of sub-layer output because hidden state dynamics occasionally causes a much longer training time for convergence. There are two ways to incorporate layer normalization into the residual network.

Post-Norm. In early versions of Transformer Vaswani et al. (2017), layer normalization is placed after the element-wise residual addition (see Figure 1(a)), like this:

where LN()\textrm{LN}(\cdot) is the layer normalization function, whose parameter is dropped for simplicity. It can be seen as a post-processing step of the output (i.e., f(x)=LN(x)f(x)=\textrm{LN}(x)).

Pre-Norm. In recent implementations Klein et al. (2017); Vaswani et al. (2018); Domhan (2018), layer normalization is applied to the input of every sub-layer (see Figure 1(b)):

Eq. (4) regards layer normalization as a part of the sub-layer, and does nothing for post-processing of the residual connection (i.e., f(x)=xf(x)=x).We need to add an additional function of layer normalization to the top layer to prevent the excessively increased value caused by the sum of unnormalized output.

Both of these methods are good choices for implementation of Transformer. In our experiments, they show comparable performance in BLEU for a system based on a 6-layer encoder (Section § 5.1).

2 On the Importance of Pre-Norm for Deep Residual Network

The situation is quite different when we switch to deeper models. More specifically, we find that pre-norm is more efficient for training than post-norm if the model goes deeper. This can be explained by seeing back-propagation which is the core process to obtain gradients for parameter update. Here we take a stack of LL sub-layers as an example. Let E\mathcal{E} be the loss used to measure how many errors occur in system prediction, and xLx_{L} be the output of the topmost sub-layer. For post-norm Transformer, given a sub-layer ll, the differential of E\mathcal{E} with respect to xlx_{l} can be computed by the chain rule, and we have

where k=lL1LN(yk)yk\prod_{k=l}^{L-1}\frac{\partial\textrm{LN}(y_{k})}{\partial y_{k}} means the backward pass of the layer normalization, and k=lL1(1+F(xk;θk)xk)\prod_{k=l}^{L-1}(1+\frac{\partial{\mathcal{F}(x_{k};\theta_{k})}}{\partial{x_{k}}}) means the backward pass of the sub-layer with the residual connection. Likewise, we have the gradient for pre-norm For a detailed derivation, we refer the reader to Appendix A.:

Obviously, Eq. (6) establishes a direct way to pass error gradient ExL\frac{\partial{\mathcal{E}}}{\partial{x_{L}}} from top to bottom. Its merit lies in that the number of product items on the right side does not depend on the depth of the stack.

In contrast, Eq. (5) is inefficient for passing gradients back because the residual connection is not a bypass of the layer normalization unit (see Figure 1(a)). Instead, gradients have to be passed through LN()\textrm{LN}(\cdot) of each sub-layer. It in turn introduces term k=lL1LN(yk)yk\prod_{k=l}^{L-1}\frac{\partial\textrm{LN}(y_{k})}{\partial y_{k}} into the right hand side of Eq. (5), and poses a higher risk of gradient vanishing or exploring if LL goes larger. This was confirmed by our experiments in which we successfully trained a pre-norm Transformer system with a 20-layer encoder on the WMT English-German task, whereas the post-norm Transformer system failed to train for a deeper encoder (Section § 5.1).

Dynamic Linear Combination of Layers

The residual network is the most common approach to learning deep networks, and plays an important role in Transformer. In principle, residual networks can be seen as instances of the ordinary differential equation (ODE), behaving like the forward Euler discretization with an initial value Chang et al. (2018); Chen et al. (2018b). Euler’s method is probably the most popular first-order solution to ODE. But it is not yet accurate enough. A possible reason is that only one previous step is used to predict the current value Some of the other single-step methods, e.g. the Runge-Kutta method, can obtain a higher order by taking several intermediate steps Butcher (2003). Higher order generally means more accurate.Butcher (2003). In MT, the single-step property of the residual network makes the model “forget” distant layers Wang et al. (2018b). As a result, there is no easy access to features extracted from lower-level layers if the model is very deep.

Here, we describe a model which makes direct links with all previous layers and offers efficient access to lower-level representations in a deep stack. We call it dynamic linear combination of layers (DLCL). The design is inspired by the linear multi-step method (LMM) in numerical ODE Ascher and Petzold (1998). Unlike Euler’s method, LMM can effectively reuse the information in the previous steps by linear combination to achieve a higher order. Let {y0,...,yl}\{y_{0},...,y_{l}\} be the output of layers 0l0\sim l. The input of layer l+1l+1 is defined to be

where G()\mathcal{G}(\cdot) is a linear function that merges previously generated values {y0,...,yl}\{y_{0},...,y_{l}\} into a new value. For pre-norm Transformer, we define G()\mathcal{G}(\cdot) to be

Comparison to LMM. DLCL differs from LMM in two aspects, though their fundamental model is the same. First, DLCL learns weights in an end-to-end fashion rather than assigning their values deterministically, e.g. by polynomial interpolation. This offers a more flexible way to control the model behavior. Second, DLCL has an arbitrary size of the past history window, while LMM generally takes a limited history into account Lóczi (2018). Also, recent work shows successful applications of LMM in computer vision, but only two previous steps are used in their LMM-like system Lu et al. (2018).

Comparison to existing neural methods. Note that DLCL is a very general approach. For example, the standard residual network is a special case of DLCL, where Wll+1=1W_{l}^{l+1}=1, and Wkl+1=0W_{k}^{l+1}=0 for k<lk<l. Figure (2) compares different methods of connecting a 3-layer network. We see that the densely residual network is a fully-connected network with a uniform weighting schema Britz et al. (2017); Dou et al. (2018). Multi-layer representation fusion Wang et al. (2018b) and transparent attention (call it TA) Bapna et al. (2018) methods can learn a weighted model to fuse layers but they are applied to the topmost layer only. The DLCL model can cover all these methods. It provides ways of weighting and connecting layers in the entire stack. We emphasize that although the idea of weighting the encoder layers by a learnable scalar is similar to TA, there are two key differences: 1) Our method encourages earlier interactions between layers during the encoding process, while the encoder layers in TA are combined until the standard encoding process is over; 2) For an encoder layer, instead of learning a unique weight for each decoder layer like TA, we make a separate weight for each successive encoder layers. In this way, we can create more connections between layersLet the encoder depth be MM and the decoder depth be NN (M>NM>N for a deep encoder model). Then TA newly adds O(M×N)\mathcal{O}(M\times N) connections, which are fewer than ours of O(M2)\mathcal{O}(M^{2}).

Experimental Setup

We first evaluated our approach on WMT’16 English-German (En-De) and NIST’12 Chinese-English (Zh-En-Small) benchmarks respectively. To make the results more convincing, we also experimented on a larger WMT’18 Chinese-English dataset (Zh-En-Large) with data augmentation by back-translation Sennrich et al. (2016a).

For the En-De task, to compare with Vaswani et al. (2017)’s work, we use the same 4.5M pre-processed data https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8, which has been tokenized and jointly byte pair encoded (BPE) Sennrich et al. (2016b) with 32k merge operations using a shared vocabulary The tokens with frequencies less than 5 are filtered out from the shared vocabulary.. We use newstest2013 for validation and newstest2014 for test.

For the Zh-En-Small task, we use parts of the bitext provided within NIST’12 OpenMTLDC2000T46, LDC2000T47, LDC2000T50, LDC2003E14, LDC2005T10, LDC2002E18, LDC2007T09, LDC2004T08. We choose NIST MT06 as the validation set, and MT04, MT05, MT08 as the test sets. All the sentences are word segmented by the tool provided within NiuTrans Xiao et al. (2012). We remove the sentences longer than 100 and end up with about 1.9M sentence pairs. Then BPE with 32k operations is used for both sides independently, resulting in a 44k Chinese vocabulary and a 33k English vocabulary respectively.

For the Zh-En-Large task, we use exactly the same 16.5M dataset as Wang et al. (2018a), composing of 7.2M-sentence CWMT corpus, 4.2M-sentence UN and News-Commentary combined corpus, and back-translation of 5M-sentence monolingual data from NewsCraw2017. We refer the reader to Wang et al. (2018a) for the details.

For evaluation, we first average the last 5 checkpoints, each of which is saved at the end of an epoch. And then we use beam search with a beam size of 4/6 and length penalty of 0.6/1.0 for En-De/Zh-En tasks respectively. We measure case-sensitive/insensitive tokenized BLEU by multi-bleu.perl for En-De and Zh-En-Small respectively, while case-sensitive detokenized BLEU is reported by the official evaluation script mteval-v13a.pl for Zh-En-Large. Unless noted otherwise we run each experiment three times with different random seeds and report the mean of the BLEU scores across runsDue to resource constraints, all experiments on Zh-En-Large task only run once..

2 Model and Hyperparameters

All experiments run on fairseq-pyhttps://github.com/pytorch/fairseq with 8 NVIDIA Titan V GPUs. For the post-norm Transformer baseline, we replicate the model setup of Vaswani et al. (2017). All models are optimized by Adam Kingma and Ba (2014) with β1\beta_{1} = 0.9, β2\beta_{2} = 0.98, and ϵ\epsilon = 10-8. In training warmup (warmup = 4000 steps), the learning rate linearly increases from 10-7 to lr =7×\times10-4/5×\times10-4 for Transformer-Base/Big respectively, after which it is decayed proportionally to the inverse square root of the current step. Label smoothing εls\varepsilon_{ls}=0.1 is used as regularization.

For the pre-norm Transformer baseline, we follow the setting as suggested in tensor2tensorhttps://github.com/tensorflow/tensor2tensor. More specifically, the attention dropout Patt = 0.1 and feed-forward dropout Pff = 0.1 are additionally added. And some hyper-parameters for optimization are changed accordingly: β2\beta_{2} = 0.997, warmup = 8000 and lr = 10-3/7×\times10-4 for Transformer-Base/Big respectively.

For both the post-norm and pre-norm baselines, we batch sentence pairs by approximate length and restrict input and output tokens per batch to batch = 4096 per GPU. We set the update steps according to corresponding data sizes. More specifically, the Transformer-Base/Big is updated for 100k/300k steps on the En-De task as Vaswani et al. (2017), 50k/100k steps on the Zh-En-Small task, and 200k/500k steps on the Zh-En-Large task.

In our model, we use the dynamic linear combination of layers for both encoder and decoder. For efficient computation, we only combine the output of a complete layer rather than a sub-layer. It should be noted that for deep models (e.g. LL \geq 20), it is hard to handle a full batch in a single GPU due to memory size limitation. We solve this issue by accumulating gradients from two small batches (e.g. batch = 2048) before each update Ott et al. (2018). In our primitive experiments, we observed that training with larger batches and learning rates worked well for deep models. Therefore all the results of deep models are reported with batch = 8192, lr = 2×\times10-3 and warmup = 16,000 unless otherwise stated. For fairness, we only use half of the updates of baseline (e.g. update = 50k) to ensure the same amount of data that we actually see in training. We report the details in Appendix B.

In Table 4, we first report results on WMT En-De where we compare to the existing systems based on self-attention. Obviously, while almost all previous results based on Transformer-Big (marked by Big) have higher BLEU than those based on Transformer-Base (marked by Base), larger parameter size and longer training epochs are required.

As for our approach, considering the post-norm case first, we can see that our Transformer baselines are superior to Vaswani et al. (2017) in both Base and Big cases. When increasing the encoder depth, e.g. LL = 20, the vanilla Transformer failed to train, which is consistent with Bapna et al. (2018). We attribute it to the vanishing gradient problem based on the observation that the gradient norm in the low layers (e.g. embedding layer) approaches 0. On the contrary, post-norm DLCL solves this issue and achieves the best result when LL = 25.

The situation changes when switching to pre-norm. While it slightly underperforms the post-norm counterpart in shallow networks, pre-norm Transformer benefits more from the increase in encoder depth. More concretely, pre-norm Transformer achieves optimal result when LL=20 (see Figure 3(a)), outperforming the 6-layer baseline by 1.8 BLEU points. It indicates that pre-norm is easier to optimize than post-norm in deep networks. Beyond that, we successfully train a 30-layer encoder by our method, resulting in a further improvement of 0.4 BLEU points. This is 0.6 BLEU points higher than the pre-norm Transformer-Big. It should be noted that although our best score of 29.3 is the same as Ott et al. (2018), our approach only requires 3.5X fewer training epochs than theirs.

To fairly compare with transparent attention (TA) Bapna et al. (2018), we separately list the results using a 16-layer encoder in Table 4.2. It can be seen that pre-norm Transformer obtains the same BLEU score as TA without the requirement of complicated attention design. However, DLCL in both post-norm and pre-norm cases outperform TA. It should be worth that TA achieves the best result when encoder depth is 16, while we can further improve performance by training deeper encoders.

Seen from the En-De task, pre-norm is more effective than the post-norm counterpart in deep networks. Therefore we evaluate our method in the case of pre-norm on the Zh-En task. As shown in Table 4.2, firstly DLCL is superior to the baseline when the network’s depth is shallow. Interestingly, both Transformer and DLCL achieve the best results when we use a 25-layer encoder. The 25-layer Transformer can approach the performance of Transformer-Big, while our deep model outperforms it by about 0.5 BLEU points under the equivalent parameter size. It confirms that our approach is a good alternative to Transformer no matter how deep it is.

3 Results on the Zh-En-Large Task

While deep Transformer models, in particular the deep pre-norm DLCL, show better results than Transformer-Big on En-De and Zh-En-Small tasks, both data sets are relatively small, and the improved performance over Transformer-Big might be partially due to over-fitting in the wider model. For a more challenging task , we report the results on Zh-En-Large task in Table 5.1. We can see that the 25-layer pre-norm DLCL slightly surpassed Transformer-Big, and the superiority is bigger when using a 30-layer encoder. This result indicates that the claiming of the deep network defeating Transformer-Big is established and is not affected by the size of the data set.

Analysis

In Figure 3, we plot BLEU score as a function of encoder depth for pre-norm Transformer and DLCL on En-De and Zh-En-Small tasks. First of all, both methods benefit from an increase in encoder depth at the beginning. Remarkably, when the encoder depth reaches 20, both of the two deep models can achieve comparable performance to Transformer-Big, and even exceed it when the encoder depth is further increased in DLCL. Note that pre-norm Transformer degenerates earlier and is less robust than DLCL when the depth is beyond 20. However, a deeper network (>>30 layers) does not bring more benefits. Worse still, deeper networks consume a lot of memory, making it impossible to train efficiently.

We also report the inference speed on GPU in Figure 4. As expected, the speed decreases linearly with the number of encoder layers. Nevertheless, our system with a 30-layer encoder is still faster than Transformer-Big, because the encoding process is independent of beam size, and runs only once. In contrast, the decoder suffers from severe autoregressive problems.

2 Effect of Decoder Depth

Table 5 shows the effects of decoder depth on BLEU and inference speed on GPU. Different from encoder, increasing the depth of decoder only yields a slight BLEU improvement, but the cost is high: for every two layers added, the translation speed drops by approximate 500 tokens evenly. It indicates that exploring deep encoders may be more promising than deep decoders for NMT.

3 Ablation Study

We report the ablation study results in Table 6. We first observe a modest decrease when removing the introduced layer normalization in Eq. (8). Then we try two methods to replace learnable weights with constant weights: All-One (Wji=1W^{i}_{j}=1) and Average (Wji=1/(i+1)W^{i}_{j}=1/(i+1)). We can see that these two methods consistently hurt performance, in particular in the case of All-One. It indicates that making the weights learnable is important for our model. Moreover, removing the added layer normalization in the Average model makes BLEU score drop by 0.28, which suggests that adding layer normalization helps more if we use the constant weights. In addition, we did two interesting experiments on big models. The first one is to replace the base encoder with a big encoder in pre-norm Transformer-Base. The other one is to use DLCL to train a deep-and-wide Transformer (12 layers). Although both of them benefit from the increased network capacity, the gain is less than the “thin” counterpart in terms of BLEU, parameter size, and training efficiency.

4 Visualization on Learned Weights

We visually present the learned weights matrices of the 30-layer encoder (Figure 5(a)) and its 6-layer decoder (Figure 5(b)) in our pre-norm DLCL-30L model on En-De task. For a clearer contrast, we mask out the points with an absolute value of less than 0.1 or 5% of the maximum per row. We can see that the connections in the early layers are dense, but become sparse as the depth increases. It indicates that making full use of earlier layers is necessary due to insufficient information at the beginning of the network. Also, we find that most of the large weight values concentrate on the right of the matrix, which indicates that the impact of the incoming layer is usually related to the distance between the outgoing layer. Moreover, for a fixed layer’s output yiy_{i}, it is obvious that its contribution to successive layers changes dynamically (one column). To be clear, we extract the weights of y10y_{10} in Figure 5(c). In contrast, in most previous paradigms of dense residual connection, the output of each layer remains fixed for subsequent layers.

Related Work

Deep Models. Deep models have been explored in the context of neural machine translation since the emergence of RNN-based models. To ease optimization, researchers tried to reduce the number of non-linear transitions Zhou et al. (2016); Wang et al. (2017). But these attempts are limited to the RNN architecture and may not be straightforwardly applicable to the current Transformer model. Perhaps, the most relevant work to what is doing here is Bapna et al. (2018)’s work. They pointed out that vanilla Transformer was hard to train if the depth of the encoder was beyond 12. They successfully trained a 16-layer Transformer encoder by attending the combination of all encoder layers to the decoder. In their approach, the encoder layers are combined just after the encoding is completed, but not during the encoding process. In contrast, our approach allows the encoder layers to interact earlier, which has been proven to be effective in machine translation He et al. (2018) and text match Lu and Li (2013). In addition to machine translation, deep Transformer encoders are also used for language modeling Devlin et al. (2018); Al-Rfou et al. (2018). For example, Al-Rfou et al. (2018) trained a character language model with a 64-layer Transformer encoder by resorting to auxiliary losses in intermediate layers. This method is orthogonal to our DLCL method, though it is used for language modeling, which is not a very heavy task.

Densely Residual Connections. Densely residual connections are not new in NMT. They have been studied for different architectures, e.g., RNN Britz et al. (2017) and Transformer Dou et al. (2018). Some of the previous studies fix the weight of each layer to a constant, while others learn a weight distribution by using either the self-attention model Wang et al. (2018b) or a softmax-normalized learnable vector Peters et al. (2018). They focus more on learning connections from lower-level layers to the topmost layer. Instead, we introduce additional connectivity into the network and learn more densely connections for each layer in an end-to-end fashion.

Conclusion

We have studied deep encoders in Transformer. We have shown that the deep Transformer models can be easily optimized by proper use of layer normalization, and have explained the reason behind it. Moreover, we proposed an approach based on a dynamic linear combination of layers and successfully trained a 30-layer Transformer system. It is the deepest encoder used in NMT so far. Experimental results show that our thin-but-deep encoder can match or surpass the performance of Transformer-Big. Also, its model size is 1.6X smaller. In addition, it requires 3X fewer training epochs and is 10% faster for inference.

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (Grant Nos. 61876035, 61732005, 61432013 and 61672555), the Fundamental Research Funds for the Central Universities (Grant No. N181602013), the Joint Project of FDCT-NSFC (Grant No. 045/2017/AFJ), the MYRG from the University of Macau (Grant No. MYRG2017-00087-FST).

References

Appendix A Derivations of Post-Norm Transformer and Pre-Norm Transformer

A general residual unit can be expressed by:

where xlx_{l} and xl+1x_{l+1} are the input and output of the ll-th sub-layer, and yly_{l} is the intermediate output followed by the post-processing function f()f(\cdot).

We have known that the post-norm Transformer incorporates layer normalization (LN()\textrm{LN}(\cdot)) by:

where Fpost()=F()\mathcal{F}_{post}(\cdot)=\mathcal{F}(\cdot). Note that we omit the parameter in LN for clarity. Similarly, the pre-norm Transformer can be described by:

where Fpre()=F(LN())\mathcal{F}_{pre}(\cdot)=\mathcal{F}(\textrm{LN}(\cdot)). In this way, we can see that both post-norm and pre-norm are special cases of the general residual unit. Specifically, the post-norm Transformer is the special case when:

Here we take a stack of LL sub-layers as an example. Let E\mathcal{E} be the loss used to measure how many errors occur in system prediction, and xLx_{L} be the output of the top-most sub-layer. Then from the chain rule of back propagation we obtain:

To analyze it, we can directly decompose xLxl\frac{\partial{x_{L}}}{\partial{x_{l}}} layer by layer:

Consider two adjacent layers as Eq.10 and Eq. 11, we have:

For post-norm Transformer, it is easy to know fpost(yl)yl=LN(yl)yl\frac{\partial{f_{post}(y_{l})}}{\partial{y_{l}}}=\frac{\partial{\textrm{LN}(y_{l})}}{\partial{y_{l}}} according to Eq.(14). Then put Eq.(17) and (18) into Eq.(16) and we can obtain the differential L\mathcal{L} w.r.t. xlx_{l}:

Eq.(19) indicates that the number of product terms grows linearly with LL, resulting in prone to gradient vanishing or explosion.

However, for pre-norm Transformer, instead of decomposing the gradient layer by layer in Eq. (17), we can use the good nature that xL=xl+k=lL1Fpre(xk;θk)x_{L}=x_{l}+\sum_{k=l}^{L-1}\mathcal{F}_{pre}(x_{k};\theta_{k}) by recursively using Eq. (13):

Due to fpre(yl)yl=1\frac{\partial{f_{pre}(y_{l})}}{\partial{y_{l}}}=1, we can put Eq. (21) into Eq. (16) and obtain: