Learning to Remember Translation History with a Continuous Cache

Zhaopeng Tu, Yang Liu, Shuming Shi, Tong Zhang

Introduction

Neural machine translation (NMT) has advanced the state of the art in recent years [Kalchbrenner et al. (2014, Cho et al. (2014, Sutskever et al. (2014, Bahdanau et al. (2015]. However, existing models generally treat documents as a list of independent sentence pairs and ignore cross-sentence information, which leads to translation inconsistency and ambiguity arising from a single source sentence.

There have been few recent attempts to model cross-sentence context for NMT: ?) use a hierarchical RNN to summarize the previous $K$ source sentences, while ?) use an additional set of an encoder and attention model to dynamically select part of the previous source sentence. While these approaches have proven their ability to represent cross-sentence context, they generate the context from discrete lexicons, thus would cause errors propagated from generated translations. Accordingly, they only take into account source sentences but fail to make use of target-side information.111?) indicate that “considering target-side history inversely harms translation performance, since it suffers from serious error propagation problems.” Another potential limitation is that they are computationally expensive, which limits the scale of cross-sentence context.

In this work, we propose a very light-weight alternative that can both cover large-scale cross-sentence context as well as exploit bilingual translation history. Our work is inspired by recent successes of memory-augmented neural networks on multiple NLP tasks [Weston et al. (2015, Sukhbaatar et al. (2015, Miller et al. (2016, Gu et al. (2017], especially the efficient cache-like memory networks for language modeling [Grave et al. (2017, Daniluk et al. (2017]. Specifically, the proposed approach augments NMT models with a continuous cache (Cache), which stores recent hidden representations as history context. By minimizing the computation burden of the cache-like memory, we are able to use larger memory and scale to longer translation history. Since we leverage internal representations instead of output words, our approach is more robust to the error propagation problem, and thus can incorporate useful target-side context.

Experimental results show that the proposed approach significantly and consistently improves translation performance over a strong NMT baseline on multiple domains with different topics and styles. We found the introduced cache is able to remember translation patterns at different levels of matching and granularity, ranging from exactly matched lexical patterns to fuzzily matched patterns, from word-level patterns to phrase-level patterns.

Neural Machine Translation

Suppose that ${\bf x}=x_{1},\dots x_{j},\dots x_{J}$ represents a source sentence and ${\bf y}=y_{1},\dots y_{t},\dots y_{T}$ a target sentence. NMT directly models the probability of translation from the source sentence to the target sentence word by word:

As shown in Figure 2 (a), the probability of generating the t-th word $y_{t}$ is computed by

where $g(\cdot)$ first linearly transforms its input and then applies a softmax function, $y_{t-1}$ is the previously generated word, ${\bf s}_{t}$ is the $t$ -th decoding hidden state, and ${\bf c}_{t}$ is the $t$ -th source representation. The decoder state ${\bf s}_{t}$ is computed as follows:

where $f(\cdot)$ is an activation function, which is implemented as GRU [Cho et al. (2014] in this work. ${\bf c}_{t}$ is a dynamic vector that selectively summarizes certain parts of the source sentence at each decoding step:

where $\alpha_{t,j}$ is alignment probability calculated by an attention model [Bahdanau et al. (2015, Luong et al. (2015a], and ${\bf h}_{j}$ is the encoder hidden state of the $j$ -th source word $x_{j}$ .

Since the continuous representation of a symbol (e.g., ${\bf h}_{j}$ and ${\bf s}_{t}$ ) encodes multiple meanings of a word, NMT models need to spend a substantial amount of their capacity in disambiguating source and target words based on the context defined by a source sentence [Choi et al. (2016]. Consistency is another critical issue in document-level translation, where a repeated term should keep the same translation throughout the whole document [Xiao et al. (2011]. Nevertheless, current NMT models still process a document by translating each sentence alone, suffering from inconsistency and ambiguity arising from a single source sentence, as shown in Table 1. These problems can be alleviated by the proposed approach via modeling translation history, as described below.

Approach

The proposed approach augments neural machine translation models with a cache-like memory, which has proven useful for capturing longer history for the language modeling task [Grave et al. (2017, Daniluk et al. (2017]. The cache-like memory is essentially a key-value memory [Miller et al. (2016], which is an array of slots in the form of (key, value) pairs. The matching stage is based on the key records while the reading stage uses the value records. From here on, we use cache to denote the cache-like memory.

Since modern NMT models generate translation in a word-by-word manner, translation information is generally stored at word level, including source-side context that embeds content being translated and target-side context that corresponds to the generated word. With the goal of remembering translation history in mind, the key should be designed with features to help match it to the source-side context, while the value should be designed with features to help match it to the target-side context. To this end, we define the cache slots as pairs of vectors $\{({\bf c}_{1},{\bf s}_{1}),\dots,({\bf c}_{i},{\bf s}_{i}),\dots,({\bf c}_{I},{\bf s}_{I})\}$ where ${\bf c}_{i}$ and ${\bf s}_{i}$ are the attention context vector and its corresponding decoder state at time step $i$ from the previous translations. The two types of representation vectors correspond well to the source- and target-side contexts [Tu et al. (2017a].

Figure 2(b) illustrates the model architecture. At each decoding step $t$ , the current attention context ${\bf c}_{t}$ serves as a query, which is used to match and read from the cache looking for relevant information to generate the target word. The retrieved vector ${\bf m}_{t}$ , which embeds target-side contexts of generating similar words in the translation history, is combined with the current decoder state ${\bf s}_{t}$ to subsequently produce the target word $y_{t}$ (Section 3.2). When the full translation is generated, the decoding contexts are stored in the cache as a history for future translations (Section 3.3).

2 Reading from Cache

Cache reading involves the following three steps:

The goal of key matching is to retrieve similar records in the cache. To this end, we exploit the attention context representations ${\bf c}_{t}$ to define a probability distribution over the records in the cache. Using context representations as keys in the cache, the cache lookup operator can be implemented with simple dot products between the stored representations and the current one:

where ${\bf c}_{t}$ is the attention context representation at the current step $t$ , ${\bf c}_{i}$ is the stored representation at the $i$ -th slot of the cache, and $I$ is the number of slots in the cache. In contrast to existing memory-augmented neural networks, the proposed cache avoids the need to learn the memory matching parameters, such as those related to parametric attention models [Sukhbaatar et al. (2015, Daniluk et al. (2017], transformations between the query and keys [Miller et al. (2016, Gu et al. (2017], or human-defined scalars to control the flatness of the distribution [Grave et al. (2017].222We tried these matching implementations in our preliminary experiments, but found no improvements for this task.

Value Reading

The values of the cache is read by taking a sum over the stored values ${\bf s}_{i}$ , weighted by the matching probabilities from the keys, and the retrieved vector ${\bf m}_{t}$ is returned:

From the view of memory-augmented neural networks, the matching probability $P_{m}({\bf c}_{i}\ |{\bf c}_{t})$ can be interpreted as the probability to retrieve similar target-side information ${\bf m}_{t}$ from the cache given the source-side context ${\bf c}_{t}$ , where the desired answer is the contexts related to similar target words generated in past translations.

Representation Combining

The final decoder state that is used to generate the next-word distribution is computed from a linear combination of the original decoder state ${\bf s}_{t}$ and the output vector ${\bf m}_{t}$ retrieved from the cache:333We tried the strategy of “Gating Auxiliary Context” used in [Wang et al. (2017a] in our preliminary experiments, and found similar performance.

tensor-product1subscript𝝀𝑡subscript𝐬𝑡tensor-productsubscript𝝀𝑡subscript𝐦𝑡\displaystyle\tilde{\bf s}_{t}=(\mathbf{1}-{\bm{\lambda}}_{t})\otimes{\bf s}_{t}+{\bm{\lambda}}_{t}\otimes{\bf m}_{t} (6) $\displaystyle P(y_{t}|y_{<t},{\bf x})=g(y_{t-1},{\bf c}_{t},\tilde{\bf s}_{t})$ (7) where $\otimes$ is an element-wise multiplication, and ${\bm{\lambda}}_{t}\in\mathbb{R}^{d}$ is a dynamic weight vector calculated at each decoding step. This strategy is inspired by the concept of update gate from GRU [Cho et al. (2014], which takes a linear sum between the previous hidden state and the candidate new hidden state. The starting point for this strategy is an observation: generating target words at different steps has the different needs of the translation history. For example, translation history representation is more useful if a similar slot is retrieved in the cache, while less by other cases. To this end, we calculate the dynamic weight vector by

subscript𝐔𝐬𝑡subscript𝐕𝐜𝑡subscript𝐖𝐦𝑡{\bm{\lambda}}_{t}=\sigma({\bf U}{\bf s}_{t}+{\bf V}{\bf c}_{t}+{\bf W}{\bf m}_{t}) (8) Here $\sigma(\cdot)$ is a logistic sigmoid function, and $\{{\bf U}\in\mathbb{R}^{d\times d},{\bf V}\in\mathbb{R}^{d\times l},{\bf W}\in\mathbb{R}^{d\times d}\}$ are the new introduced parameter matrices with $d$ and $l$ being the number of units of decoder state and attention context vector, respectively. Note that ${\bm{\lambda}}_{t}$ has the same dimensionality as ${\bf s}_{t}$ and ${\bf m}_{t}$ , and thus each element in the two vectors has a distinct interpolation weight. In this way, we offer a more precise control to combine the representations, since different elements retain different information.

The addition of the continuous cache to a NMT model inherits the advantages of cache-like memories: the probability distribution over generated words is updated online depending on the translation history, and consistent translations can be generated when they have been seen in the history. The neural cache also inherits the ability of the decoder hidden states to model longer-term cross-sentence contexts than intra-sentence context, and thus allows for a finer modeling of the document-level context.

3 Writing to Cache

The cache component is an external key-value memory structure which stores $I$ elements of recent histories, where the key at position $i\in[1,M]$ is ${\bf k}_{i}$ and its value is ${\bf v}_{i}$ . For each key-value pair, we also store the corresponding target word $y_{t}$ as an indicator for the following updating operator. 444In the writing phrase, the cache component works like a standard cache, in which the target word $y_{t}$ serves as the “key” to address the “value” ( ${\bf k}_{t}$ , ${\bf v}_{t}$ ) for updating the cache.

In this work, we focus on learning to remember and exploit cross-sentence translation history. Accordingly, different from [Grave et al. (2017, Kawakami et al. (2017] where the cache is updated after each generation of target word, we write to the cache after a translation sentence is fully generated. Given a generated translation sentence ${\bf y}=\{y_{1},\dots,y_{t},\dots,y_{T}\}$ , its corresponding attention vector sequence is $\{{\bf c}_{1},\dots,{\bf c}_{t},\dots,{\bf c}_{T}\}$ and the decoder state sequence is $\{{\bf s}_{1},\dots,{\bf s}_{t},\dots,{\bf s}_{T}\}$ . Each triple $\langle{\bf c}_{t},{\bf s}_{t},y_{t}\rangle$ is written to the cache as follows:

If $y_{t}$ does not exist in the cache, an empty slot is chosen or the least recently used slot is overwritten, where the key slot is ${c}_{t}$ , the value slot is ${\bf s}_{t}$ and the indicator is $y_{t}$ .

If $y_{t}$ already exists in the cache at some position $i$ , the key and value are updated: ${\bf k}_{i}=({\bf k}_{i}+{\bf c}_{t})/2$ and ${\bf v}_{i}=({\bf v}_{i}+{\bf s}_{t})/2$ .

From the perspective of “general cache policy”, it can be regarded as a sort of exponential decay, since at each update the previous keys and values are halved. From the perspective of continuous cache, on the other hand, the intuition behind is to model temporal order for the same word – the more recent histories serve as more important roles.

Some researchers may worry about that the key ${\bf k}_{i}$ and the attention vector ${\bf c}_{t}$ could be fully unrelated, since they “align” the same word $y_{t}$ to the source words of different source sentences. We believe that such case would rarely happen. When ${\bf c}_{t}$ is aligned to a target word $y_{t}$ (we assume that the aligns are always correct and align error problem is beyond the focus of this work), we expect that a certain portion of ${\bf c}_{t}$ and the embedding of $y_{t}$ are semantically equivalent (that is how the information of the source side is transformed to the target side). Therefore, there should be always a certain relation among attention vectors, which are aligned to the same target word. Averaging the attention vectors in different source sentences is expected to highlight the shared portion (i.e., corresponds to $y_{t}$ ) and dilute the unshared parts (i.e., correspond to the contexts of different source sentences).

4 Training and Inference

Two pass strategies have proven useful to ease training difficulty when the model is relatively complicated [Shen et al. (2016, Wang et al. (2017b, Tu et al. (2017b]. Inspired by this, we add the cache to a pre-trained NMT model with fine training of only the new parameters related to the cache.

First, we pre-train a standard NMT model which is able to generate reasonable representations (i.e., ${\bf c}_{t}$ and ${\bf s}_{t}$ ) to interact with the cache. Formally, the parameters ${\bm{\theta}}$ of the standard NMT model are trained to maximize the likelihood of a set of training examples $\{\left[{\bf x}^{n},{\bf y}^{n}\right]\}_{n=1}^{N}$ :

where the probabilities of generating target words are computed by Equation 2.

Second, we fix the trained parameters $\hat{\bm{\theta}}$ and only fine train the new parameters $\bm{\gamma}=\{{\bf U},{\bf V},{\bf W}\}$ related to the cache (i.e., Equation 8):

where the probabilities of generating target words are computed by Equation 7, and $\hat{\bm{\theta}}$ is trained parameters via Equation 9. During training, the representations ${\bf c}_{t}$ and ${\bf s}_{t}$ remain the same for a given sentence pair with the fixed NMT parameters, thus the cache can be explicitly trained to learn when to exploit translation history to maximize the overall translation performance.

Inference

Once a model is trained, we use a beam search to find a translation that approximately maximizes the likelihood, which is the same as standard NMT models. After the beam search procedure is finished, we write to the cache the representations that correspond to the $1$ -best output. The reason why we do not use $k$ -best outputs or all hypotheses in the beam search is two-fold: (1) we want to improve the translation consistency for the final outputs; and (2) continuous representations suffer less from data sparsity problem, in the scenario of which $k$ -best outputs generally works better. Our premiliary experiments validate our assumption, in which $k$ -best outputs or hypotheses does not show improvement over their 1-best counterpart.

Experiment

We carried out Chinese-English translation experiments on multiple domains, each of which differs from others in topic, genre, style, level of formality, etc.

News: The News domain is extracted from LDC corpora.555LDC2002E18, LDC2003E07, LDC2003E14, LDC2004T07, LDC2004T08 and LDC2005T06. Most sentences in this corpora are formal articles with syntactic structures such as complicated conjuncted phrases, which make textual translation very difficult. We choose the NIST 2002 (MT02) dataset as tuning set, and the NIST 2003-2008 (MT03-08) datasets as test sets.

Subtitle: The subtitles are extracted from TV episodes, which are usually simple and short [Wang et al. (2018]. Most of the translations of subtitles do not preserve syntactic structures of their original sentences at all. We randomly select two episodes as the tuning set, and other two episodes as the test set.666The corpora are available at https://github.com/longyuewangdcu/tvsub.

TED: The corpora are from the MT track on TED Talks of IWSLT2015 [Cettolo et al. (2012].777https://wit3.fbk.eu/mt.php?release=2015-01 ?) point out that NMT systems have a steeper learning curve with respect to the amount of training data, resulting in worse quality in low-resource settings. The TED talks are difficult to translate for its variety of topics while small-scale training data. We choose the “dev2010” dataset as the tuning set, and the combination of “tst2010-2013” datasets as the test set.

The statistics of the corpora are listed in Table 1. As seen, the averaged lengths of the source sentences in News, Subtitle, and TED domains are 22.3, 5.6, and 19.5 words, respectively. We use the case-insensitive 4-gram NIST BLEU score [Papineni et al. (2002] as evaluation metric, and sign-test [Collins et al. (2005] for statistical significance test.

Models

The baseline is a re-implemented attention-based NMT system RNNSearch, which incorporates dropout [Hinton et al. (2012] on the output layer and improves the attention model by feeding the lastly generated word. For training RNNSearch, we limited the source and target vocabularies to the most frequent 30K words in Chinese and English, and employ an unknown replacement post-processing technique [Jean et al. (2015, Luong et al. (2015b]. We trained each model with the sentences of length up to 80 words in the training data. We shuffled mini-batches as we proceed and the mini-batch size is 80. The word embedding dimension is 620 and the hidden layer dimension is 1000. We trained for 15 epochs using Adadelta [Zeiler (2012], and selected the model that yields best performances on the validation set.

For our model, we used the same setting as RNNSearch if applicable. The parameters of our model that are related to the standard encoder and decoder were initialized by the baseline RNNSearch model and were fixed in the following step. We further trained the new parameters related to the cache for another 5 epochs. Again, the model that performs best on the tuning set was selected as the final model.

2 Effect of Cache Size

Inspired by the recent success of the continuous cache on language modeling [Grave et al. (2017], we thought it likely that a large cache would benefit from the long-range context, and thus outperforms a small one. This turned out to be false. Table 2 that lists translation performances of different cache sizes on the tuning set. As seen, small caches (e.g., size=25) generally achieve similar performances with larger caches (e.g., size=500). At the very start, we attributed this to the strength of the cache overwrite mechanism for slots that correspond to the same target word, which implicitly models long-range contexts by combining different context representations of the same target word in the translation history. As shown in Table 3, the overwrite mechanism contribute little to the good performance of smaller cache.

There are several more possible reasons. First, a larger cache is able to remember longer translation history (i.e., cache capacity) while poses difficulty to matching related records in the cache (i.e., matching accuracy). Second, the current caching mechanism fail to model long-range context well, which suggests a better modeling of long-term dependency for future work. Finally, neighbouring sentences are more correlated than long-distance sentences, and thus modeling short-range context properly works well [Daniluk et al. (2017]. In the following experiment, we try to validate the last hypothesis by visualizing which positions in the cache are attended most by the proposed model.

Following ?), we plot in Figure 3 the average matching probability the proposed model pays to specific positions in the history. As seen, the proposed approach indeed pays more attention to most recent history (e.g., the leftmost positions) in all domains. Specifically, the larger the cache, the more attention the model pays to most recent history.

Notably, there are still considerable differences among different domains. For example, the proposed model attends over records further in the past more often in the Subtitle and TED domains than in the News domain. This maybe because that a talk in the TED testset contains much more words than an article in the News testset (1.9K vs. 0.6K words). Though a scene in the Subtitle testset contains least words (i.e., 0.3K words), repetitive words and phrases are observed in neighbouring scenes of the same episode, which is generally related to a specific topic. Given that larger caches do not lead to any performance improvement, it seems to be notoriously hard to judge whether long-range contexts are not modelled well, or they are less useful than the short-range contexts. We leave the validation for future work.

For the following experiments, the cache size is set to 25 unless otherwise stated.

3 Main Results

Table 4 shows the translation performances on multiple domains with different textual styles. As seen, the proposed approach significantly outperforms the baseline system (i.e., Base) in all cases, demonstrating the effectiveness and university of our model. We reimplemented the models in [Wang et al. (2017a] and [Jean et al. (2017] on top of the baseline system, which also exploit cross-sentence context in terms of source-side sentences. Both approaches achieve significant improvements in the News and TED domains, while achieve marginal or no improvement in the Subtitle domain. Comparing with these two approaches, the proposed model consistently outperforms the baseline system in all domains, which confirms the robustness of our approach. We attribute the superior translation quality of our approach in the Subtitle domain to the exploitation of target-side information, since most of the translations of dialogues in this domain do not preserve syntactic structure of their original sentences at all. They are completely paraphrased in the target language and seem very hard to be improved with only source-side cross-sentence contexts.

Table 5 shows the model complexity. The cache model only introduces 4M additional parameters (i.e., related to Equation 8), which is small compared to both the numbers of parameters in the existing model (i.e., 84.2M) and newly introduced by ?) (i.e., 18.8M) and ?) (i.e., 20M). Our model is more efficient in training, which benefit from training cache-related parameters only. To minimize the waste of computation, the other models sort 20 mini-batches by their lengths before parameter updating [Bahdanau et al. (2015], while our model cannot enjoy the benefit since it depends on the hidden states of preceding sentences.888To make a fair comparison, which means our model is required to train all the parameters and the other models cannot use mini-batch sorting, the training speeds for the models listed in Table 5 are 728.8, 159.8, 572.4, and 627.3, respectively. Concerning decoding with additional attention models, our approach does not slow down the decoding speed, while ?) decreases decoding speed by 8.1%. We attribute this to the efficient strategies for cache key matching without any additional parameters.

4 Deep Fusion vs. Shallow Fusion

Some researchers would expect that storing the words may be a better way to encourage lexical consistency, as done in [Grave et al. (2017]. Following ?), we call this a Shallow Fusion at shallow word level, which is in contrast to deep fusion at deep representation level (i.e., our approach). We follow ?) to calculate the probability of generating $y_{t}$ in shallow fusion as

1subscript𝜆𝑡subscript𝑃𝑣𝑜𝑐𝑎𝑏subscript𝑦𝑡subscript𝜆𝑡subscript𝑃𝑐𝑎𝑐ℎ𝑒subscript𝑦𝑡\displaystyle(1-\lambda_{t})P_{vocab}(y_{t})+\lambda_{t}P_{cache}(y_{t}) $\displaystyle P_{cache}(y_{t})$ $\displaystyle=$ $\displaystyle\mathbbm{1}_{\left\{y_{t}=y_{i}\right\}}P_{m}({\bf c}_{i}|{\bf c}_{t})$ in which $P_{vocab}(y_{t})$ is the probability of NMT model (Equation 2) and $P_{m}({\bf c}_{i}|{\bf c}_{t})$ is the cache probability (Equation 5). We compute the interpolation weight $\lambda_{t}$ in the same way as Equation 8 except that $\lambda_{t}$ is a scalar instead of a vector.

Table 6 lists the results of comparing shallow fusion and deep fusion on the widely evaluated News domain [Tu et al. (2016, Li et al. (2017, Zhou et al. (2017, Wang et al. (2017c]. As seen, deep fusion significantly outperforms its shallow counterpart, which is consistent with the results in [Gu et al. (2017]. Different from ?), the shallow fusion does not achieve improvement over the baseline system. One possible reason is that the generated words are less repetitive with those in the translation history than those similar sentences retrieved from the training corpus. Accordingly, storing words in the cache encourages lexical consistency at the cost of introducing noises, while storing continuous vectors is able to improve this problem by doing fusion in a soft way. In addition, the continuous vectors can store useful information beyond a single word, which we will show later.

5 Translation Patterns Stored in the Cache

In this experiment, we present analysis to gain insight about what kinds of translation patterns are captured by the cache to potentially improve translation performance, as shown in Figure 4.

Consistency is a critical issue in document-level translation, where a repeated term should keep the same translation throughout the whole document [Xiao et al. (2011, Carpuat and Simard (2012]. Among all consistency cases, we are interested in the verb tense consistency. We found our model works well on improving tense consistency. For example, the baseline model translated the word “觉得” into present tense “feel” in present tense (Figure 4(a)), while from the translation history (Table 1) we can learn it should be translated into “felt” in past tense. The cache model can improve tense consistency by exploring document-level context. As shown in the left panel of Figure 4(b), the proposed model generates the correct word “felt” by attending to the desired slot in the cache. It should be emphasized that our approach is still likely to generate the correct word even without the cache slot “felt”, since the previously generated word “everyone” already attended to a slot “should”, which also contains information of past tense. The improvement of tense consistency may not lead to a significant increase of BLEU score, but is very important for user experience.

Fuzzily Matched Patterns

Besides exactly matched lexical patterns (e.g., the slot “felt”), we found that the cache also stores useful “fuzzy match” patterns, which can improve translation performance by acting as some kind of “indicator” context. Take the generation of “opportunity” in the left panel of Figure 4(b) as an example, although the attended slots “courses”, “training”, “pressure”, and “tasks” are not matched with “opportunity” at lexical level, they are still helpful for generating the correct word “opportunity” when working together with the attended source vector centering at “机遇”.

Patterns Beyond Word Level

By visualizing the cache during translation process, we found that the proposed cache is able to remember not only word-level translation patterns, but also phrase-level translation patterns, as shown in the right panel of Figure 4(b). The latter is especially encouraging to us, since phrases play an important role in machine translation while it is difficult to integrate them into current NMT models [Zhou et al. (2017, Wang et al. (2017c, Huang et al. (2017]. We attribute this to the fact that decoder states, which serve as cache values, stores phrasal information due to the strength of decoder RNN on memorizing short-term history (e.g., previous few words).

Related Work

Our research builds on previous work in the field of memory-augmented neural networks, exploitation of cross-sentence contexts and cache in NLP.

Neural Turing Machines [Graves et al. (2014] and Memory Networks [Weston et al. (2015, Sukhbaatar et al. (2015] are early models that augment neural networks with a possibly large external memory. Our work is based on the Memory Networks, which have proven useful for question answering and document reading tasks [Weston et al. (2016, Hill et al. (2016]. Specifically, we use Key-Value Memory Network [Miller et al. (2016], which is a simplified version of Memory Networks with better interpretability and has yielded encouraging results in document reading [Miller et al. (2016], question answering [Pritzel et al. (2017] and language modeling [Tran et al. (2016, Grave et al. (2017, Daniluk et al. (2017]. We use the memory to store information specific to the translation history so that this information is available to influence future translations. Closely related to our approach, ?) use a continuous cache to improve language modeling by capturing longer history. We generalize from the original model and adapt it to machine translation: we use the cache to store bilingual information rather than monolingual information, and release the hand-tuned parameters for cache matching.

In the context of neural machine translation, ?) use an external key-value memory to remember rare training events in test time, and ?) use a memory to store a set of sentence pairs retrieved from the training corpus given the source sentence. This is similar to our approach in exploiting more information than a current source sentence with a key-value memory. Unlike their approaches, ours aims to learning to remember translation history rather than incorporating arbitrary meta-data, which results in different sources of the auxiliary information (e.g., previous translations vs. similar training examples). Accordingly, due to the different availability of target symbols in the two scenarios, different strategies of incorporating the retrieved values from the key-value memory are adopted: hidden state interpolation [Gulcehre et al. (2016] performs better in our task while word probability interpolation [Gu et al. (2016] works better in [Gu et al. (2017].

Exploitation of Cross-Sentence Context

Cross-sentence context, which is generally encoded into a continuous space using a neural network, has a noticeable effect in various deep learning based NLP tasks, such as language modeling [Ji et al. (2015, Wang and Cho (2016], query suggestion [Sordoni et al. (2015], dialogue modeling [Vinyals and Le (2015, Serban et al. (2016], and machine translation [Wang et al. (2017a, Jean et al. (2017].

In statistical machine translation, cross-sentence context has proven useful for alleviating inconsistency and ambiguity arising from a single source sentence. Wide-range context is firstly exploited to improve statistical machine translation models [Gong et al. (2011, Xiao et al. (2012, Hardmeier et al. (2012, Hasler et al. (2014]. Closely related to our approach, ?) deploy a discrete cache to store bilingual phrases from the best translation hypotheses of previous sentences. In contrast, we use a continuous cache to store bilingual representations, which are more suitable for neural machine translation models.

Concerning neural machine translation, ?) and ?) are two early attempts to model cross-sentence context. ?) use a hierarchical RNN to summarize the previous $K$ (e.g., $K=3$ ) source sentences, while ?) use an additional set of an encoder and attention model to encode and select part of the previous source sentence for generating each target word. While their approaches only exploit source-side cross-sentence contexts, the proposed approach is able to take advantage of bilingual contexts by directly leveraging continuous vectors to represent translation history. As shown in Tables 4 and 5, comparing with their approaches, the proposed approach is more robust in improving translation performances across different domains, and is more efficient in both training and testing.

Cache in NLP

In NLP community, the concept of “cache” is firstly introduced by [Kuhn and Mori (1990], which augments a statistical language model with a cache component and assigns relatively high probabilities to words that occur elsewhere in a given text. The success of the cache language model in improving word prediction rests on capturing of “burstiness” of word usage in a local context. It has been shown that caching is by far the most useful technique for perplexity reduction over the standard $n$ -gram approach [Goodman (2001], and becomes a standard component in most LM toolkits, such as IRSTLM [Federico et al. (2008]. Inspired by the great success of caching on language modeling, ?) propose to use a cache model to adapt language and translation models for SMT systems, and ?) apply an exponentially decaying cache for the domain adaptation task. In this work, we have generalized and adapted from the original discrete cache model, and integrate a “continuous” variant into NMT models.

Conclusion

We propose to augment NMT models with a cache-like memory network, which stores translation history in terms of bilingual hidden representations at decoding steps of previous sentences. The cache component is an external key-value memory structure with the keys being attention vectors and values being decoder states collected from translation history. At each decoding step, the probability distribution over generated words is updated online depending on the history information retrieved from the cache with a query of the current attention vector. Using simply a dot-product for key matching, this history information is quite cheap to store and can be accessed efficiently.

In our future work, we expect several developments that will shed more light on utilizing long-range contexts, e.g., designing novel architectures, and employing discourse relations instead of directly using decoder states as cache values.

Acknowledgments

Yang Liu is supported by the National Key R&D Program of China (No. 2017YFB0202204) and National Natural Science Foundation of China (No. 61432013, No. 61522204).