Adding Interpretable Attention to Neural Translation Models Improves Word Alignment
Thomas Zenkel, Joern Wuebker, John DeNero
Introduction
Self-attention-based sequence-to-sequence models Vaswani et al. (2017) have recently emerged as the state of the art in neural machine translation. The decoder of the network typically consists of multiple layers, each with several attention heads. This makes it hard to interpret the attention activations and extract meaningful word-alignments. As a result, the most widely used tools to obtain word alignments are still based on the statistical IBM word alignment models which were introduced more than two decades ago. In this work we describe a simple modeling extension as well as a novel inference procedure that are capable of extracting alignments with a quality comparable to Giza++ from the self-attentional Transformer architecture.
Word alignments extracted as a by-product of machine translation have a number of applications. For example, they can be used to inject an external lexicon into the inference process to improve the translations of low-frequency content words Arthur et al. (2016). Another use of word alignments is to project annotations from the source sentence to the target sentence. For example, if part of the source sentence is highlighted in the source document, the corresponding part of the target should be highlighted as well. In localization, all formatting and annotation is stored as tags over spans of the source, and word alignments can serve to place those tags automatically into the target.
The most widely used tools for word alignment are Giza++ Och and Ney (2003), which uses the IBM Model 4 in its default setting, and FastAlign Dyer et al. (2013), which relies on a re-parameterized version of the IBM Model 2. Previous work on neural models for word alignment either require complicated training approaches Tamura et al. (2014) or rely on supervised training, the training data for which is generated with one of the tools mentioned above Alkhouli et al. (2018); Peter et al. (2017).
This work extends the Transformer architecture with a separate alignment layer on top of the decoder sub-network. It does not contain skip connections around the encoder-attention module, so that it is trained to predict the next target word based on a linear combination of the encoder information. This encourages the alignment layer to focus its attention activations on relevant source words for a given target word. As a result we have two separate output layers, each of which defines a probability distribution over the next target word. During inference the add-on output layer is ignored and only the attention activations are computed, which are interpreted as a distribution over word alignments.
However, the attention mechanism still ignores all future target information, including the target token for which it computes the attention activations. In order to query the model w.r.t. the aligned target word to further improve the resulting alignments, we directly optimize the attention activations to maximize the likelihood of the target word using stochastic gradient descent (SGD).
Our approach has a number of desirable properties:
The model can be trained on the same data as the translation model in an unsupervised fashion.
Both the alignment and the translation model are incorporated into a single network.
Training is done by fine-tuning an existing translation model, significantly reducing overall training cost.
The extension is straightforward and easy to implement.
We validate our approach on three hand-aligned, publicly available data sets and compare the alignment error rate (AER) to a naïve baseline, FastAlign and Giza++. On the French-English task our method improves the alignment quality by a factor of three over the naïve baseline. With the application of the grow-diagonal heuristic for bidirectional alignment merging Koehn et al. (2005), we achieve results that are comparable to Giza++ on two of the three data sets.
Machine Translation Model
All neural machine translation (NMT) models in this work are based on the Transformer architecture introduced by Vaswani et al. (2017) . It follows the encoder-decoder paradigm (Sutskever et al., 2014) and is composed of two sub-networks. The encoder network transforms the source sentence into a high-dimensional continuous space representation of the sentence. The decoder network uses the output of the encoder to compute a probability distribution over target language sentences. Different from the previously most widely used recurrent networks, it relies entirely on the attention mechanism to incorporate context. The attention function (Bahdanau et al., 2015) is a major factor in the recent success of NMT and provides a mechanism to create a fixed-length context vector as a weighted sum of a variable number of input vectors.
In the Transformer architecture, each layer of the encoder network consists of two sub-layers, namely a self-attention module and a feed-forward module. Decoder layers are made up of a self-attention module, an encoder-attention module and a feed-forward module. In the decoder the self-attention sub-network is masked so that only left-hand context is incorporated. In contrast to Vaswani et al. (2017), we mask the attention to not attend to the end-of-sentence token. The encoder-attention always uses the output of the final encoder layer as input.
With the resulting attention activations we calculate the weighted sum of the values:
That is for each head the query and the set of key-value pairs get linearly projected before getting fed to the individual attention heads. The resulting output of the individual heads gets concatenated, linearly projected and fed into the next layer.
We use an encoder with 6 and a decoder with 3 layers. The decoder sublayers are simplified versions of those described by Vaswani et al. (2017): The filter sub-layers perform only a single linear transformation, and layer normalization is only applied once per decoder layer after the filter sub-layer.
Related Work
The most commonly used statistical alignment models directly build on the lexical translation models of Brown et al. (1993), which are also referred to as IBM models. A popular tool is FastAlign (Dyer et al., 2013), a reparameterization of IBM Model 2, which is known for its usability and speed. Giza++ (Och and Ney, 2003), which is based on IBM Model 4, provides a solid benchmark in terms of AER results. In our experiments we run the MGIZA++ Gao and Vogel (2008), a parallel implementation of Giza++, and FastAlign toolkits with default parameters as a baseline.
The IBM models are composed of an alignment and a lexical model component. The alignment component is unlexicalized. The lexical component models the likelihood for each source word based on a single target word, i.e. it is conditionally independent of the source and target context. We argue that this assumption is a disadvantage over neural approaches, which commonly encode the content and the context of each word in a continuous representation. In this work we make use of both the source and the target context to infer word alignments.
2 Neural Models
The neural approaches that affect the attention of the model and therefore influence the alignments can be categorized into two groups depending on their goal: Improving translation quality of the machine translation system or solely focusing on the generation of alignments.
Nguyen and Chiang (2018) add a linear combination of source embeddings to the decoder output to improve the prediction of the next target word in an attention-based translation model. This encourages the model to attend to a useful source word and avoids that the resulting alignments are shifted by one word compared to human judgement. Alkhouli et al. (2018) train a single alignment head of the Transformer with supervised data generated with Giza++, so that its attention directly corresponds to alignments. This improved attention is then used during inference to separate the translation objective into a lexical and an alignment component and improve dictionary-guided translations. Arthur et al. (2016) add a lexical probability vector, a vector generated based on the information of a discrete lexicon, and use the attention vector to decide which source word’s lexical probabilities the model should focus on.
Tamura et al. (2014) directly predict alignments based on a recurrent neural network which is conditioned on both the source and the target sequences. By using noise contrastive estimation and tying weights of a forward and a backward model during training they are able to train this network while only relying on IBM Model 1 to generate negative examples. Peter et al. (2017) build on an attention-based neural network to extract alignments. They achieve their best results by bootstrapping the attention with Giza++ alignments and by using target foresight, a technique that uses the target word during training to improve its attention.
Alignment Layer
In order to train our alignment component in an unsupervised fashion, i.e. without word-aligned training data, we want to design it with the following property: A source token should be aligned to a target token if we can predict the target token based on a continuous representation of the corresponding source token.
We achieve this by adding an alignment layer to the top of the decoder. As depicted in Figure 1 the complete model predicts the next target word twice: once with the original decoder, once based on the alignment layer.
The alignment layer uses a single multi-head attention submodule as in Equation 2 and is focusing its attention on the encoder. As the query we use the decoder output, for the key-value pairs we use the same encoder input, i.e. .
For now let us denote the vectors based on the hidden representation of the encoder, which we use as the keys and values, as . We calculate the probability vector of the next word as follows:
with as the output projection matrix and being the output of the multi-head attention of Equation 2:
In contrast to a decoder layer of the Transformer, the alignment layer does not use a self-attention sub-layer and we do not apply any skip connections. Thus the target word prediction of the alignment layer is forced to rely solely on the context vector , a linear combination of the encoder-side representations.
Figure 1 summarizes the whole architecture. Note that the decoder output is masked, it only encodes the left-hand context.
For the encoder representation we want to encode both the content of the source words and the context in which they appear. Therefore, we experiment with the following options: Directly using the word embeddings, using the encoder output and using the average of the word embeddings and the decoder output as the encoder representation . Table 1 summarizes these options.
We train the alignment layer by fine-tuning a fully trained translation model, keeping the parameters of the underlying Transformer network fixed. For the alignment layer we use multi-head attention with a single head throughout this paper.We verified that this leads to slightly better results than two attention heads, while four heads performed considerably worse.
Attention Optimization based on Target Word
Using the attention activations of the forward pass means that the word alignment to the -th target word does not depend on the identity of . However, it can be argued that this word is the most relevant information needed to select the correct alignment. When performing the task of word alignment given both source and target sentence we already now the target word . If we produce the alignment as a by-product of translation inference, it is likely that the prediction of the alignment layer and the actual target sentence are different.
We hypothesize that we can improve the alignment if we can find attention activations that lead to a correct prediction of the target sentence. Given attention activations and the linearly transformed values , we can rephrase the equations of Section 4:
Therefore, the probability distribution only depends on , which we extract from the forward pass, and the attention activations . We can optimize while evaluating the attention optimization sub-network of Figure 1 with its only input . We treat as a weight matrix for the remainder of this section and will refer to it as the attention weights.
The -th entry of the probability vector denotes the probability of the target word . Therefore, we can formulate our objective of maximizing the probability of the target word with respect to :
We optimize the attention weights for each word of the target sequence while keeping all other parameters of the alignment layer fixed. By applying gradient descent, we iteratively update the attention weights towards the goal of maximizing the probability of the correct target word. This can be done in parallel for each word of the target sentence using the cross entropy loss.
During optimization we relax the constraint for to be a valid probability, i.e. to sum up to 1. While we experimented with using the softmax function during optimization, we found that only applying the rectified linear unit (RLU, ) to guarantee non-negative activations before passing it to the function is easier to optimize.
This optimization procedure is related to feature visualization applied in image recognition Erhan et al. (2009); Olah et al. (2017), which maximizes the output of a neuron with respect to the input image.
The question remains how we initialize the attention weights . A straightforward option is to initialize them randomly. While it is probably useful to initialize them with valid probability vectors, we used a uniform distribution between 0.0 and 1.0 as our first option.
However, intuitively it seems more reasonable to start with attention weights that correspond to a good alignment and that might be closer to a good local minimum. It is possible to convert alignments between subwords directly to attention weights and therefore using a hypothesis of an arbitrary alignment modelThat can be done by setting weights that represent alignments between a source and a target word to 1.0 and all other weights to 0.0.. However, we restrict our experiments to improve the attention weights of the forward pass. Therefore, we run a forward pass of the complete Transformer network, extract the attention weights of the alignment layer and start the optimization process with these weights.
While tuning the parameters on the validation set, we found that applying three gradient descent steps with a learning rate of 1 leads to surprisingly good results, both in terms of predicting the correct word and the quality of the resulting alignments.
Experimental Setup
Our goal for the evaluation is to compare statistical alignment methods, namely FastAlign and Giza++, with the neural approach introduced in this paper. We attempt to do a fair comparison by using the same training data for all methods and standardize the pre-processing. We use publicly available training and test data for the following language pairs: German-English, Romanian-English and French-English. All approaches are evaluated for both forward and reverse direction as well as by combining these with the grow-diagonal heuristic Koehn et al. (2005), i.e. without using the finalize step. All hyper-parameters are tuned on the German-English task. We open-source our preprocessing pipeline and the baseline experiments using FastAlign and MGIZA++https://github.com/lilt/alignment-scripts.
The training data we use has between 0.4 and 1.9 million parallel sentences. This makes it possible to train both statistical and neural methods in a reasonable amount of time. For German-English, we use the Europarl v8 corpus. For Romanian-English and English-French we follow Mihalcea and Pedersen (2003). As the only exception we additionally use Europarl data for Romanian-English to increase the training data from 49k to 0.4M parallel sentences.
We preprocess the training data with the tokenizer from the Moses toolkithttps://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl and consistently lowercase all training data. For all neural approaches we apply byte pair encoding Sennrich et al. (2016) with 10k merges, see Table 2 for an example. For FastAlign and Giza++ we concatenate the training and test data. Table 3 summarizes the training data we use.
2 Test Data
As test sets we use hand-aligned parallel sentences. For some of these data sets annotators were instructed to align target words that do not have a corresponding source word to a special null token. We always use the version without null tokens, i.e. a target word may not be aligned at all. The data sets are publicly available for German-Englishhttps://www-i6.informatik.rwth-aachen.de/goldAlignment/ and both for Romanian-English and English-Frenchhttp://web.eecs.umich.edu/~mihalcea/wpt/index.html. Additionally, annotators made a distinction between probable and sure alignments. We evaluate the outputs of various systems based on the alignment error rate (AER) introduced by Och and Ney (2000).
Results
We tune the hyperparameters of the approach presented in this paper on the German-English data set, which we choose as our development data (see Table 4). The naïve approach of averaging all the attention activations of the Transformer network yields sub-optimal results. For the unidirectional models it produces alignment error rates well above 50%, combing two directions leads to 50.9%. The additional alignment layer always improves results. While using the word embeddings (31.4%) and the encoder output (28.6%) as keys and values yields good improvements, using a combination of both works best (27.1%). We speculate that the vectors representing the source tokens should both contain context and have a strong relation to the original word embedding. This procedure leads to results roughly as good as FastAlign (27.0%).
Optimizing the attention matrix from a random initialization to predict the reference translation with the SGD settings described in Section 5.1 proves unsuccessful in terms of alignment quality. However, using the attention activations of the forward pass noticeably improves the quality of the symmetrized alignments, yielding AER results similar to Giza++.
2 Qualitative Analysis
We will analyse the resulting attention activations and alignments based on the example presented in Table 2. This parallel sentence is the first sentence of our development set and highlights the following challenges: In both source (“wir”) and target (“we”) a word appears twice. Some words are not very common (“cherry-pick”) and get split into multiple subword units. The translation is non-literal, as “Rosinen herauspicken” (“pick raisins”) is translated as “cherry-pick”.
We plot multiple average attention activations for this example in Figure 2 with the a visualization tool provided by Rikters et al. (2017). The Transformer mainly focuses its attention on the punctuation mark at the end of the sentence.We never attend to the end of sentence symbol of the source sequence, because we mask the attention to it consistently during training and inference. In contrast, the alignment layer attends to more meaningful source words.
Figure 3 shows alignments before and after applying SGD. When starting with a random initialization, no meaningful alignments can be generated. Note that different random initializations do not converge to similar alignments. However, when initializing with the attention activations of the forward pass, the resulting alignments improve in most of the cases.
3 English-French and Romanian-English
We now test our approach on the English-French and Romanian-English test sets of Mihalcea and Pedersen (2003). Similar to the German-English experiments, AER is consistently improved by adding the alignment layer and with optimization of the attention activations.
Interestingly, the neural approaches in this paper seem to profit more from symmetrizing both directions compared to the statistical approaches. The neural alignment models always use the full source context, but not the full target context, i.e. when generating an alignment we do not look at future target words. We speculate that this might be a contributing factor to the strong reduction in error rates by combining two unidirectional models.
We argue that the superior results of Giza++ on the English-French test set are mainly due to the large portion of probable alignment links (13,400 out of 17,438). This is favourable for Giza++, as it predicts the smallest number of alignments (20,200), while Add+SGD predicts considerably more (26,430). In contrast to the English-French test set, the Romanian-English set does not contain any probable alignments in its reference.
Conclusion
This paper addresses the problem of extracting meaningful word alignments from the self-attentive Transformer neural machine translation model. We extend the network with an alignment layer that contains no skip connections around the encoder-attention sub-layer and thus is encouraged to learn to attend to source words that correspond to the current target word. We further introduce a novel inference procedure to query the model with a given target word. By symmetrizing alignments extracted from models for both translation directions we achieve an alignment quality that is comparable to IBM Model 4 as implemented in Giza++ on two of the three tasks. Different from previous work our model is trained in an unsupervised fashion and does not require injecting external knowledge from the IBM models into the training pipeline.