On the Sub-Layer Functionalities of Transformer Decoder

Yilin Yang, Longyue Wang, Shuming Shi, Prasad Tadepalli, Stefan Lee, Zhaopeng Tu

Introduction

Transformer models have advanced the state-of-the-art on a variety of natural language processing (NLP) tasks, including machine translation Vaswani et al. (2017), natural language inference Shen et al. (2018), semantic role labeling Strubell et al. (2018), and language representation Devlin et al. (2019). However, so far not much is known about the internal properties and functionalities it learns to achieve its superior performance, which poses significant challenges for human understanding of the model and potentially designing better architectures.

Recent efforts on interpreting Transformer models mainly focus on assessing the encoder representations Raganato et al. (2018); Yang et al. (2019); Tang et al. (2019a) or interpreting the multi-head self-attentions Li et al. (2018); Voita et al. (2019); Michel et al. (2019). At the same time, there have been few attempts to interpret the decoder side, which we believe is also of great interest, and should be taken into account while explaining the encoder-decoder networks. The reasons are threefold: (a) the decoder takes both source and target as input, and implicitly performs the functionalities of both alignment and language modeling, which are at the core of machine translation; (b) the encoder and decoder are tightly coupled in that the output of the encoder is fed to the decoder and the training signals for the encoder are back-propagated from the decoder; and (c) recent studies have shown that the boundary between the encoder and decoder is blurry, since some of the encoder functionalities can be substituted by the decoder cross-attention modules Tang et al. (2019b).

In this study, we interpret the Transformer decoder by investigating when and where the decoder utilizes source or target information across its stacking modules and layers. Without loss of generality, we focus on the representation evolutionBy “evolution”, we denote the progressive trend from the first layer till the last. within a Transformer decoder. To this end, we introduce a novel sub-layerThroughout this paper, we use the terms “sub-layer” and “module” interchangeably. split with respect to their functionalities: Target Exploitation Module (TEM) for exploiting the representation from translation history, Source Exploitation Module (SEM) for exploiting the source-side representation, and Information Fusion Module (IFM) to combine representations from the other two (§2.2).

Further, we design a universal probing scheme to quantify the amount of specific information embedded in network representations. By probing both source and target information from decoder sub-layers, and by analyzing the alignment error rate (AER) and source coverage rate, we arrive at the following findings:

SEM guides the representation evolution within NMT decoder (§3.1).

Higher-layer SEMs accomplish the functionality of word alignment, while lower-layer ones construct the necessary contexts (§3.2).

TEMs are critical to helping SEM build word alignments, while their stacking order is not essential (§3.2).

Last but not least, we conduct a fine-grained analysis on the information fusion process within IFM. Our key contributions in this work are:

We introduce a novel sub-layer split of Transformer decoder with respect to their functionalities.

We introduce a universal probing scheme from which we derive aforementioned conclusions about the Transformer decoder.

Surprisingly, we find that the de-facto usage of residual FeedForward operations are not efficient, and could be removed in totality with minimal loss of performance, while significantly boosting the training and inference speeds.

Preliminaries

NMT models employ an encoder-decoder architecture to accomplish the translation process in an end-to-end manner. The encoder transforms the source sentence into a sequence of representations, and the decoder generates target words by dynamically attending to the source representations. Typically, this framework can be implemented with a recurrent neural network (RNN) Bahdanau et al. (2015), a convolutional neural network (CNN) Gehring et al. (2017), or a Transformer Vaswani et al. (2017). We focus on the Transformer architecture, since it has become the state-of-the-art model on machine translation tasks, as well as various text understanding Devlin et al. (2019) and generation Radford et al. (2019) tasks.

Specifically, the decoder is composed of a stack of $N$ identical layers, each of which has three sub-layers, as illustrated in Figure 1. A residual connection He et al. (2016) is employed around each of the three sub-layers, followed by layer normalization Ba et al. (2016) (“Add & Norm”). The first sub-layer is a self-attention module that performs self-attention over the previous decoder layer:

where $\textsc{Att}(\cdot)$ and $\textsc{Ln}(\cdot)$ denote the self-attention mechanism and layer normalization. ${\bf Q}_{d}^{n}$ , ${\bf K}_{d}^{n}$ , and ${\bf V}_{d}^{n}$ are query, key and value vectors that are transformed from the (n-1)-th layer representation ${\bf L}_{d}^{n-1}$ . The second sub-layer performs attention over the output of the encoder representation:

where ${\bf K}_{e}^{N}$ and ${\bf V}_{e}^{N}$ are transformed from the top encoder representation ${\bf L}^{N}_{e}$ . The final sub-layer is a position-wise fully connected feed-forward network with ReLU activations:

The top decoder representation ${\bf L}_{d}^{N}$ is then used to generate the final prediction.

2 Sub-Layer Partition

In this work, we aim to reveal how a Transformer decoder accomplishes the translation process utilizing both source and target inputs. To this end, we split each decoder layer into three modules with respect to their different functionalities over the source or target inputs, as illustrated in Figure 1:

Target Exploitation Module (TEM) consists of the self-attention operation and a residual connection, which exploits the target-side translation history from previous layer representations.

Source Exploitation Module (SEM) consists only of the encoder attention, which dynamically selects relevant source-side information for generation.

Information Fusion Module (IFM) consists of the rest of the operations, which fuse source and target information into the final layer representation.

Compared with the standard splits Vaswani et al. (2017), we associate the “Add&Norm” operation after encoder attention with the IFM, since it starts the process of information fusion by a simple additive operation. Consequently, the functionalities of the three modules are well-separated.

3 Research Questions

Modern Transformer decoder is implemented as multiple identical layers, in which the source and target information are exploited and evolved layer-by-layer. One research question arises naturally:

RQ1. How do source and target information evolve within the decoder layer-by-layer and module-by-module?

In Section 3.1, we introduce a universal probing scheme to quantify the amount of information embedded in decoder modules and explore their evolutionary trends. The general trend we find is that higher layers contain more source and target information, while the sub-layers behave differently. Specifically, the amount of information contained by SEMs would first increase and then decrease. In addition, we establish that SEM guides both source and target information evolution within the decoder.

Since SEMs are critical to the decoder representation evolution, we conduct a more detailed study into the internal behaviors of the SEMs. The exploitation of source information is also closely related to the inadequate translation problem – a key weakness of NMT models Tu et al. (2016). We try to answer the following research question:

RQ2. How does SEM exploit the source information in different layers?

In Section 3.2, we investigate how the SEMs transform the source information to the target side in terms of alignment accuracy and coverage ratio Tu et al. (2016). Experimental results show that higher layers of SEM modules accomplish word alignment, while lower layer ones exploit necessary contexts. This also explains the representation evolution of source information: lower layers collect more source information to obtain a global view of source input, and higher layers extract less aligned source input for accurate translation.

Of the three sub-layers, IFM modules conceptually appear to play a key role in merging source and target information – raising our final question:

RQ3. How does IFM fuse source and target information on the operation level?

In Section 3.3, we first conduct a fine-grained analysis of the IFM module on the operation level, and find that a simple “Add&Norm” operation performs just as well at fusing information. Thus, we simplify the IFM module to be only one Add&Norm operation. Surprisingly, this performs similarly to the full model while significantly reducing the number of parameters and consequently boosting both training and inference speed.

Experiments

To make our conclusions compelling, all experiments and analysis are conducted on three representative language pairs. For English $\Rightarrow$ German (En $\Rightarrow$ De), we use WMT14 dataset that consists of 4.5M sentence pairs. The English $\Rightarrow$ Chinese (En $\Rightarrow$ Zh) task is conducted on WMT17 corpus, consisting of 20.6M sentence pairs. For English $\Rightarrow$ French (En $\Rightarrow$ Fr) task, we use WMT14 dataset that comprises 35.5M sentence pairs. English and French have many aspects in common while English and German differ in word order, requiring a significant amount of reordering in translation. Besides, Chinese belongs to a different language family compared to the others.

Models

We conducted the experiments on the state-of-the-art Transformer Vaswani et al. (2017), and implemented our approach with the open-source toolkit FairSeq Ott et al. (2019). We follow the setting of Transformer-Base in Vaswani et al. (2017), which consists of 6 stacked encoder/decoder layers with the model size being 512. We train our models on 8 NVIDIA P40 GPUs, where each is allocated with a batch size of 4,096 tokens. We use Adam optimizer Kingma and Ba (2015) with 4,000 warm-up steps.More implementation details are in Sec A.1.

1 Representation Evolution Across Layers

In order to quantify and visualize the representation evolution, we design a universal probing scheme to quantify the source (or target) information stored in network representations.

Intuitively, the more the source (or target) information stored in a network representation, the more probably a trained reconstructor could recover the source (or target) sequence. Since the lengths of source sequence and decoder representations are not necessarily the same, the widely-used classification-based probing approaches Belinkov et al. (2017); Tenney et al. (2019b) cannot be applied to this task. Accordingly, we cast this task as a generation problem – evaluating the likelihood of generating the word sequence conditioned on the input representation.

Figure 2 illustrates the architecture of our probing scheme. Given a representation sequence from decoder ${\bf H}=\{{\bf h}_{1},\dots,{\bf h}_{M}\}$ and the source (or target) word sequence to be recovered ${\bf x}=\{x_{1},\dots,x_{N}\}$ the recovery likelihood is calculated as the perplexity (i.e. negative log-likelihood) of forced-decoding the word sequence:

The lower the recovery perplexity, the more the source (or target) information stored in the representation. The probing model can be implemented as any architecture. For simplicity, we use a one-layer Transformer decoder. We train the probing model to recover both source and target sequence from all decoder sub-layer representations. During training, we fix the NMT model parameters and train the probing model on the MT training set to minimize the recovery perplexity in Equation 1.

Task Discussion

The above probing scheme is a general framework applicable to probing any given sequence from a network representation. When we probe for the source sequence, the probing model is analogous to an auto-encoder Bourlard and Kamp (1988); Vincent et al. (2010), which reconstructs the original input from the network representations. When we probe for the target sequence, we apply an attention mask to the probing decoder to avoid direct copying from the input of translation histories. Contrary to source probing, the target sequence is never seen by the model.

In addition, our proposed scheme can also be applied to probe linguistic properties that can be represented in a sequential format. For instance, we could probe source constituency parsing information, by training a probing model to recover the linearized parsing sequence Vinyals et al. (2015). Due to space limitations, we leave the linguistic probing to future work.

Probing Results

Figure 3 shows the results of our information probing conducted on the heldout set. We have a few observations:

The evolution trends of TEM and IFM are largely the same. Specifically, the curve of TEM is very close to that of IFM shifted up by one layer. Since TEM representations are two operations (self-attn. and Add&Norm) away from the previous layer IFM, this observation indicates TEMs do not significantly affect the amount of source/target information. TEM may change the order or distribution of source/target information, which are not captured by our probing experiments.

SEM guides both source and target information evolution. While closely observing the curves, the trend of layer representations (i.e. IFM) is always led by that of SEM. For example, as the PPL of SEM transitions from decreases to increases, the PPL of IFM slows down the decreases and starts increasing as an aftermath. This is intuitive: in machine translation, source and target sequences should contain equivalent information, thus the target generation should largely follow the lead of source information (from SEM representations) to guarantee its adequacy.

For IFM, the amount of target information consistently increases in higher layers – a consistent decrease of PPL in Figures 3(d-f). While source information goes up in the lower layers, it drops in the highest layer (Figures 3(a-c)).

Since SEM representations are critical to decoder evolution, we turn to investigate how SEM exploit source information, in the hope of explaining the decoder information evolution.

2 Exploitation of Source Information

Ideally, SEM should accurately and fully incorporate the source information for the decoder. Accordingly, we evaluate how well SEMs accomplish the expected functionality from two perspectives.

Previous studies generally interpret the attention weights of SEM as word alignments between source and target words, which can measure whether SEMs select the most relevant part of source information for each target token Tu et al. (2016); Li et al. (2019); Tang et al. (2019b). We follow previous practice to merge attention weights from the SEM attention heads, and to extract word alignments by selecting the source word with the highest attention weight for each target word. We calculate the alignment error rate (AER) scores Och and Ney (2003) for word alignments extracted from SEM of each decoder layer.

Cumulative Coverage.

Coverage is commonly used to evaluate whether the source words are fully translated Tu et al. (2016); Kong et al. (2019). We use the above extracted word alignments to identify the set of source words $A_{i}$ , which are covered (i.e., aligned to at least one target word) at each layer. We then propose a new metric cumulative coverage ratio $C_{\leq i}$ to indicate how many source words are covered by the layers $\leq i$ :

where $N$ is the number of total source words. This metric indicates the completeness of source information coverage till layer $i$ .

Dataset

We conducted experiments on two manually-labeled alignment datasets: RWTH En-Dehttps://www-i6.informatik.rwth-aachen.de/goldAlignment and En-Zh Liu and Sun (2015). The alignments are extracted from NMT models trained on the WMT En-De and En-Zh dataset.

Results

Figure 4 demonstrates our results on word alignment and cumulative coverage. We find that the lower-layer SEMs focus on gathering source contexts (rapid increase of cumulative coverage with poor word alignment), while higher-layer ones play the role of word alignment with the lowest AER score of less than 0.4 at the 5th layer. The $4^{th}$ layer and the $3^{rd}$ layer separate the two roles for En-De and En-Zh respectively. Correspondingly, they are also the turning points (PPL from decreases to increases) of source information evolution in Figure 3 (a,b). Together with conclusions from Sec. 3.1, we demonstrate the general pattern of SEM: SEM tends to cover more source content and gain increasing amount of source information up to a turning point of $3^{rd}$ or $4^{th}$ layer, after which it starts only attending to the most relevant source tokens and contains decreasing amount of total source information.

TEM Modules

Since TEM representations serve as the query vector for encoder attention operations (shown in Figure 1), we naturally hypothesize that TEM is helping SEM on building alignments.

To verify that, we remove TEM from the decoder (“SEM $\Rightarrow$ IFM”), which significantly increases the alignment error from 0.37 to 0.54 (in Figure 5), and leads to a serious decrease of translation performance (BLEU: 27.45 $\Rightarrow$ 22.76, in Table 1) on En-De, while results on En-Zh also confirms it (in Figure 6). This indicates that TEM is essential for building word alignment.

However, reordering the stacking of TEM and SEM (“SEM $\Rightarrow$ TEM $\Rightarrow$ IFM”) does not affect the alignment or translation qualities (BLEU: 27.45 vs. 27.61). These results provide empirical support for recent work on merging TEM and SEM modules Zhang et al. (2019).

Robustness to Decoder Depth

To verify the robustness of our conclusions, we vary the depth of NMT decoder and train it from scratch. Table 2 demonstrates the results on translation quality, which generally show that more decoder layers bring better performance. Figure 7 shows that SEM behaves similarly regardless of depth. These results demonstrate the robustness of our conclusions.

3 Information Fusion in Decoder

We now turn to the analysis of IFM. Within the Transformer decoder, IFM plays the critical role of fusing the source and target information by merging representations from SEM and TEM. To study the information fusion process, we conduct a more fine-grained analysis on IFM at the operation level.

As shown in Figure 8(a), IFM contains three operations:

Add-NormI linearly sums and normalizes the representations from SEM and TEM;

Feed-Forward non-linearly transforms the fused source and target representations;

Add-Norm ${}^{I\!I}$ again linearly sums and normalizes the representations from the above two.

IFM Analysis Results

Figures 8 (b) and (c) respectively illustrate the source and target information evolution within IFM.

Surprisingly, Add-NormI contains a similar amount of, if not more, source (and target) information than Add-Norm ${}^{I\!I}$ , while the Feed-Forward curve deviates significantly from both. This indicates that the residual Feed-Forward operation may not affect the source (and target) information evolution, and one Add&Norm operation may be sufficient for information fusion.

Simplified Decoder

To empirically demonstrate whether one Add&Norm operation is already sufficient, we remove all other operations, leaving just one Add&Norm operation for the IFM. The architectural change is illustrated in Figure 9(b), and we dub it the “simplified decoder”.

Simplified Decoder Results

Table 4 reports the translation performance of both architectures on all three major datasets, while Figure 10 illustrates the information evolution of both on WMT En-De. We find the simplified model reaches comparable performance with only a minimal drop of 0.1-0.3 BLEU on En-De and En-Fr, while observing 0.9 BLEU gains on En-Zh.Simplified models are trained with the same hyper-parameters as standard ones, which may be suboptimal as the number of parameters is significantly reduced. To further assess the translation performance, we manually evaluate 100 translations sampled from the En-Zh test set. On the scale of 1 to 5, we find that the simplified decoder obtains a fluency score of 4.01 and an adequacy score of 3.87, which is approximately equivalent to that of the standard decoder, i.e. 4.00 for fluency and 3.86 for adequacy (in Table 5).

On the other hand, since the simplified decoder drops the operations (FeedForward) with most parameters (shown in Table 7), we also expect a significant increase on training and inference speeds. From Table 4, we confirm a consistent boost of both training and inference speeds by approximately 11-14%. To demonstrate the robustness, we also confirm our findings under Transformer big settings Vaswani et al. (2017), whose results are shown in Section A.2. The lower PPL in Figure 10 suggests that the simplified model also contains consistently more source and target information across its stacking layers.

Our results demonstrate that a single Add&Norm is indeed sufficient for IFM, and the simplified model reaches comparable performance with a significant parameter reduction and a noticable 11-14% boost on training and inference speed.

Related Work

Previous studies generally focus on interpreting the encoder representations by evaluating how informative they are for various linguistic tasks Conneau et al. (2018); Tenney et al. (2019b), for both RNN models Shi et al. (2016); Belinkov et al. (2017); Bisazza and Tump (2018); Blevins et al. (2018) and Transformer models Raganato et al. (2018); Tang et al. (2019a); Tenney et al. (2019a); Yang et al. (2019). Although they found that a certain amount of linguistic information is captured by encoder representations, it is still unclear how much encoded information is used by the decoder. Our work bridges this gap by interpreting how the Transformer decoder exploits the encoded information.

Interpreting Encoder Self-Attention

In recent years, there has been a growing interest in interpreting the behaviors of attention modules. Previous studies generally focus on the self-attention in the encoder, which is implemented as multi-head attention. For example, Li et al. (2018) showed that different attention heads in the encoder-side self-attention generally attend to the same position. Voita et al. (2019) and Michel et al. (2019) found that only a few attention heads play consistent and often linguistically-interpretable roles, and others can be pruned. Geng et al. (2020) empirically validated that a selective mechanism can mitigate the problem of word order encoding and structure modeling of encoder-side self-attention. In this work, we investigated the functionalities of decoder-side attention modules for exploiting both source and target information.

Interpreting Encoder Attention

The encoder-attention weights are generally employed to interpret the output predictions of NMT models. Recently, Jain and Wallace (2019) showed that attention weights are weakly correlated with the contribution of source words to the prediction. He et al. (2019) used the integrated gradients to better estimate the contribution of source words. Related to our work, Li et al. (2019) and Tang et al. (2019b) also conducted word alignment analysis on the same De-En and Zh-En datasets with Transformer modelsWe find our results are more similar to that of Tang et al. (2019b). Also, our results are reported on the En $\Rightarrow$ De and En $\Rightarrow$ Zh directions, while they report results in the inverse directions.. We use similar techniques to examine word alignment in our context; however, we also introduce a forced-decoding-based probing task to closely examine the information flow.

Understanding and Improving NMT

Recent work started to improve NMT based on the findings of interpretation. For instance, Belinkov et al. (2017, 2018) pointed out that different layers prioritize different linguistic types, based on which Dou et al. (2018) and Yang et al. (2019) simultaneously exposed all of these signals to the subsequent process. Dalvi et al. (2017) explained why the decoder learns considerably less morphology than the encoder, and then explored to explicitly inject morphology in the decoder. Emelin et al. (2019) argued that the need to represent and propagate lexical features in each layer limits the model’s capacity, and introduced gated shortcut connections between the embedding layer and each subsequent layer. Wang et al. (2020) revealed that miscalibration remains a severe challenge for NMT during inference, and proposed a graduated label smoothing that can improve the inference calibration. In this work, based on our information probing analysis, we simplified the decoder by removing the residual feedforward module in totality, with minimal loss of translation quality and a significant boost of both training and inference speeds.

Conclusions

In this paper, we interpreted NMT Transformer decoder by assessing the evolution of both source and target information across layers and modules. To this end, we investigated the information functionalities of decoder components in the translation process. Experimental results on three major datasets revealed several findings that help understand the behaviors of Transformer decoder from different perspectives. We hope that our analysis and findings could inspire architectural changes for further improvements, such as 1) improving the word alignment of higher SEMs by incorporating external alignment signals; 2) exploring the stacking order of SEM, TEM and IFM sub-layers, which may provide a more effective way to transform information; 3) further pruning redundant sub-layers for efficiency.

Since our analysis approaches are not limited to the Transformer model, it is also interesting to explore other architectures such as RNMT Chen et al. (2018), ConvS2S Gehring et al. (2017), or on document-level NMT Wang et al. (2017, 2019). In addition, our analysis methods can be applied to other sequence-to-sequence tasks such as summarization and grammar error correction, whose source and target sides are in the same language. We leave those tasks for future work.

Acknowledgments

Tadepalli acknowledges the support of DARPA under grant number N66001-17-2-4030. The authors thank the anonymous reviewers for their insightful and helpful comments.

References

Appendix A Additional Results

All transformer models are selected based on their loss on validation set, while evaluated and reported on the test set. For En-De and En-Fr models, we used newstest2013 as validation set and newstest2014 as test set. For En-Zh models, we used newsdev2016 as validation set and newstest2017 as test set.

All three datasets follow the prepossessing steps from FairSeqhttps://github.com/pytorch/fairseq/blob/master/examples/translation/prepare-wmt14en2de.sh, which uses Moses tokenizerhttps://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/mosestokenizer/tokenizer.py, with a joint BPE of 40000 steps, while does not include lower-casing nor true-casing.

All models are evaluated with a beam size of 10. Before evaluating the BLEU score, we apply a postprocessing step, where En-De and En-Fr generations apply compound word splittinghttps://gist.github.com/myleott/da0ea3ce8ee7582b034b9711698d5c16, and En-Zh generations apply Chinese word splitting (into Chinese characters). All generations are then evaluated with Moses multi-bleu.perl scripthttps://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl against the golden references.

A.2 Transformer Big Results

We also compare the performance of the standard and simplified decoder under Transformer Big setting. Big models are trained on 4 NVIDIA V100 chips, where each is allocated with a batch size of 8,192 tokens. Other training schedules and hyper-parameters are the same as standard Vaswani et al. (2017). Also, our Transformer Base models are all trained with full precision (FP32), while Big models are all trained with half precision (FP16) for faster training.

Transformer Big results are shown in Table. 6. We could observe a more severe BLEU score drop with a more significant speed boosting under Big setting. This is very intuitive, compared to Base setting, the simplified decoder drops more parameters, while still trained under the same schedule as standard, thus escalating the training discrepancy. Unfortunately due to the resource limitation, we could not afford hyper-parameter tuning for Transformer.

A.3 Additional En-Zh and En-Fr Plots

All experiments are conducted on three datasets (En-De, En-Zh and En-Fr), where we have similar findings. Due to space limits, we mainly demonstrate results on En-De task in our paper. In this section, we provide additional results on En-Zh and En-Fr if applicable.