Improving Neural Machine Translation Models with Monolingual Data

Rico Sennrich, Barry Haddow, Alexandra Birch

Introduction

Neural Machine Translation (NMT) has obtained state-of-the art performance for several language pairs, while only using parallel data for training. Target-side monolingual data plays an important role in boosting fluency for phrase-based statistical machine translation, and we investigate the use of monolingual data for NMT.

Language models trained on monolingual data have played a central role in statistical machine translation since the first IBM models [Brown et al., 1990]. There are two major reasons for their importance. Firstly, word-based and phrase-based translation models make strong independence assumptions, with the probability of translation units estimated independently from context, and language models, by making different independence assumptions, can model how well these translation units fit together. Secondly, the amount of available monolingual data in the target language typically far exceeds the amount of parallel data, and models typically improve when trained on more data, or data more similar to the translation task.

In (attentional) encoder-decoder architectures for neural machine translation [Sutskever et al., 2014, Bahdanau et al., 2015], the decoder is essentially an RNN language model that is also conditioned on source context, so the first rationale, adding a language model to compensate for the independence assumptions of the translation model, does not apply. However, the data argument is still valid in NMT, and we expect monolingual data to be especially helpful if parallel data is sparse, or a poor fit for the translation task, for instance because of a domain mismatch.

In contrast to previous work, which integrates a separately trained RNN language model into the NMT model [Gülçehre et al., 2015], we explore strategies to include monolingual training data in the training process without changing the neural network architecture. This makes our approach applicable to different NMT architectures.

The main contributions of this paper are as follows:

we show that we can improve the machine translation quality of NMT systems by mixing monolingual target sentences into the training set.

we investigate two different methods to fill the source side of monolingual training instances: using a dummy source sentence, and using a source sentence obtained via back-translation, which we call synthetic. We find that the latter is more effective.

we successfully adapt NMT models to a new domain by fine-tuning with either monolingual or parallel in-domain data.

Neural Machine Translation

We follow the neural machine translation architecture by ?), which we will briefly summarize here. However, we note that our approach is not specific to this architecture.

The neural machine translation system is implemented as an encoder-decoder network with recurrent neural networks.

The encoder is a bidirectional neural network with gated recurrent units [Cho et al., 2014] that reads an input sequence $x=(x_{1},...,x_{m})$ and calculates a forward sequence of hidden states $(\overrightarrow{h}_{1},...,\overrightarrow{h}_{m})$ , and a backward sequence $(\overleftarrow{h}_{1},...,\overleftarrow{h}_{m})$ . The hidden states $\overrightarrow{h}_{j}$ and $\overleftarrow{h}_{j}$ are concatenated to obtain the annotation vector $h_{j}$ .

The decoder is a recurrent neural network that predicts a target sequence $y=(y_{1},...,y_{n})$ . Each word $y_{i}$ is predicted based on a recurrent hidden state $s_{i}$ , the previously predicted word $y_{i-1}$ , and a context vector $c_{i}$ . $c_{i}$ is computed as a weighted sum of the annotations $h_{j}$ . The weight of each annotation $h_{j}$ is computed through an alignment model $\alpha_{ij}$ , which models the probability that $y_{i}$ is aligned to $x_{j}$ . The alignment model is a single-layer feedforward neural network that is learned jointly with the rest of the network through backpropagation.

A detailed description can be found in [Bahdanau et al., 2015]. Training is performed on a parallel corpus with stochastic gradient descent. For translation, a beam search with small beam size is employed.

NMT Training with Monolingual Training Data

In machine translation, more monolingual data (or monolingual data more similar to the test set) serves to improve the estimate of the prior probability $p(T)$ of the target sentence $T$ , before taking the source sentence $S$ into account. In contrast to [Gülçehre et al., 2015], who train separate language models on monolingual training data and incorporate them into the neural network through shallow or deep fusion, we propose techniques to train the main NMT model with monolingual data, exploiting the fact that encoder-decoder neural networks already condition the probability distribution of the next target word on the previous target words. We describe two strategies to do this: providing monolingual training examples with an empty (or dummy) source sentence, or providing monolingual training data with a synthetic source sentence that is obtained from automatically translating the target sentence into the source language, which we will refer to as back-translation.

The first technique we employ is to treat monolingual training examples as parallel examples with empty source side, essentially adding training examples whose context vector $c_{i}$ is uninformative, and for which the network has to fully rely on the previous target words for its prediction. This could be conceived as a form of dropout [Hinton et al., 2012], with the difference that the training instances that have the context vector dropped out constitute novel training data. We can also conceive of this setup as multi-task learning, with the two tasks being translation when the source is known, and language modelling when it is unknown.

During training, we use both parallel and monolingual training examples in the ratio 1-to-1, and randomly shuffle them. We define an epoch as one iteration through the parallel data set, and resample from the monolingual data set for every epoch. We pair monolingual sentences with a single-word dummy source side to allow processing of both parallel and monolingual training examples with the same network graph.111One could force the context vector $c_{i}$ to be 0 for monolingual training instances, but we found that this does not solve the main problem with this approach, discussed below. For monolingual minibatches222For efficiency, ?) sort sets of 20 minibatches according to length. This also groups monolingual training instances together., we freeze the network parameters of the encoder and the attention model.

One problem with this integration of monolingual data is that we cannot arbitrarily increase the ratio of monolingual training instances, or fine-tune a model with only monolingual training data, because different output layer parameters are optimal for the two tasks, and the network ‘unlearns’ its conditioning on the source context if the ratio of monolingual training instances is too high.

2 Synthetic Source Sentences

To ensure that the output layer remains sensitive to the source context, and that good parameters are not unlearned from monolingual data, we propose to pair monolingual training instances with a synthetic source sentence from which a context vector can be approximated. We obtain these through back-translation, i.e. an automatic translation of the monolingual target text into the source language.

During training, we mix synthetic parallel text into the original (human-translated) parallel text and do not distinguish between the two: no network parameters are frozen. Importantly, only the source side of these additional training examples is synthetic, and the target side comes from the monolingual corpus.

Evaluation

We evaluate NMT training on parallel text, and with additional monolingual data, on English $\leftrightarrow$ German and Turkish $\to$ English, using training and test data from WMT 15 for English $\leftrightarrow$ German, IWSLT 15 for English $\to$ German, and IWSLT 14 for Turkish $\to$ English.

We use Groundhog333github.com/sebastien-j/LV_groundhog as the implementation of the NMT system for all experiments [Bahdanau et al., 2015, Jean et al., 2015a]. We generally follow the settings and training procedure described by ?).

For English $\leftrightarrow$ German, we report case-sensitive Bleu on detokenized text with mteval-v13a.pl for comparison to official WMT and IWSLT results. For Turkish $\to$ English, we report case-sensitive Bleu on tokenized text with multi-bleu.perl for comparison to results by ?).

?) determine the network vocabulary based on the parallel training data, and replace out-of-vocabulary words with a special UNK symbol. They remove monolingual sentences with more than 10% UNK symbols. In contrast, we represent unseen words as sequences of subword units [Sennrich et al., 2016], and can represent any additional training data with the existing network vocabulary that was learned on the parallel data. In all experiments, the network vocabulary remains fixed.

We use all parallel training data provided by WMT 2015 [Bojar et al., 2015]444http://www.statmt.org/wmt15/. We use the News Crawl corpora as additional training data for the experiments with monolingual data. The amount of training data is shown in Table 1.

Baseline models are trained for a week. Ensembles are sampled from the last 4 saved models of training (saved at 12h-intervals). Each model is fine-tuned with fixed embeddings for 12 hours.

For the experiments with synthetic parallel data, we back-translate a random sample of $3\,600\,000$ sentences from the German monolingual data set into English. The German $\to$ English system used for this is the baseline system (parallel). Translation took about a week on an NVIDIA Titan Black GPU. For experiments in German $\to$ English, we back-translate $4\,200\,000$ monolingual English sentences into German, using the English $\to$ German system +synthetic. Note that we always use single models for back-translation, not ensembles. We leave it to future work to explore how sensitive NMT training with synthetic data is to the quality of the back-translation.

We tokenize and truecase the training data, and represent rare words via BPE [Sennrich et al., 2016]. Specifically, we follow ?) in performing BPE on the joint vocabulary with $89\,500$ merge operations. The network vocabulary size is $90\,000$ .

We also perform experiments on the IWSLT 15 test sets to investigate a cross-domain setting.555http://workshop2015.iwslt.org/ The test sets consist of TED talk transcripts. As in-domain training data, IWSLT provides the WIT3 parallel corpus [Cettolo et al., 2012], which also consists of TED talks.

1.2 Turkish→→\toEnglish

We use data provided for the IWSLT 14 machine translation track [Cettolo et al., 2014], namely the WIT3 parallel corpus [Cettolo et al., 2012], which consists of TED talks, and the SETimes corpus [Tyers and Alperen, 2010].666http://workshop2014.iwslt.org/ After removal of sentence pairs which contain empty lines or lines with a length ratio above 9, we retain $320\,000$ sentence pairs of training data. For the experiments with monolingual training data, we use the English LDC Gigaword corpus (Fifth Edition). The amount of training data is shown in Table 2. With only $320\,000$ sentences of parallel data available for training, this is a much lower-resourced translation setting than English $\leftrightarrow$ German.

?) segment the Turkish text with the morphology tool Zemberek, followed by a disambiguation of the morphological analysis [Sak et al., 2007], and removal of non-surface tokens produced by the analysis. We use the same preprocessing777github.com/orhanf/zemberekMorphTR. For both Turkish and English, we represent rare words (or morphemes in the case of Turkish) as character bigram sequences [Sennrich et al., 2016]. The $20\,000$ most frequent words (morphemes) are left unsegmented. The networks have a vocabulary size of $23\,000$ symbols.

To obtain a synthetic parallel training set, we back-translate a random sample of $3\,200\,000$ sentences from Gigaword. We use an English $\to$ Turkish NMT system trained with the same settings as the Turkish $\to$ English baseline system.

We found overfitting to be a bigger problem than with the larger English $\leftrightarrow$ German data set, and follow ?) in using Gaussian noise (stddev 0.01) [Graves, 2011], and dropout on the output layer (p=0.5) [Hinton et al., 2012]. We also use early stopping, based on Bleu measured every three hours on tst2010, which we treat as development set. For Turkish $\to$ English, we use gradient clipping with threshold 5, following ?), in contrast to the threshold 1 that we use for English $\leftrightarrow$ German, following ?).

2 Results

Table 3 shows English $\to$ German results with WMT training and test data. We find that mixing parallel training data with monolingual data with a dummy source side in a ratio of 1-1 improves quality by 0.4–0.5 Bleu for the single system, 1 Bleu for the ensemble. We train the system for twice as long as the baseline to provide the training algorithm with a similar amount of parallel training instances. To ensure that the quality improvement is due to the monolingual training instances, and not just increased training time, we also continued training our baseline system for another week, but saw no improvements in Bleu.

Including synthetic data during training is very effective, and yields an improvement over our baseline by 2.8–3.4 Bleu. Our best ensemble system also outperforms a syntax-based baseline [Sennrich and Haddow, 2015] by 1.2–2.1 Bleu. We also substantially outperform NMT results reported by ?) and ?), who previously reported SOTA result.888?) report 20.9 Bleu (tokenized) on newstest2014 with a single model, and 23.0 Bleu with an ensemble of 8 models. Our best single system achieves a tokenized Bleu (as opposed to untokenized scores reported in Table 3) of 23.8, and our ensemble reaches 25.0 Bleu. We note that the difference is particularly large for single systems, since our ensemble is not as diverse as that of ?), who used 8 independently trained ensemble components, whereas we sampled 4 ensemble components from the same training run.

2.2 English→→\toGerman IWSLT 15

Table 4 shows English $\to$ German results on IWSLT test sets. IWSLT test sets consist of TED talks, and are thus very dissimilar from the WMT test sets, which are news texts. We investigate if monolingual training data is especially valuable if it can be used to adapt a model to a new genre or domain, specifically adapting a system trained on WMT data to translating TED talks.

Systems 1 and 2 correspond to systems in Table 3, trained only on WMT data. System 2, trained on parallel and synthetic WMT data, obtains a Bleu score of 25.5 on tst2015. We observe that even a small amount of fine-tuning999We leave the word embeddings fixed for fine-tuning., i.e. continued training of an existing model, on WIT data can adapt a system trained on WMT data to the TED domain. By back-translating the monolingual WIT corpus (using a German $\to$ English system trained on WMT data, i.e. without in-domain knowledge), we obtain the synthetic data set WIT ${}_{\text{synth}}$ . A single epoch of fine-tuning on WIT ${}_{\text{synth}}$ (system 4) results in a Bleu score of 26.7 on tst2015, or an improvement of 1.2 Bleu. We observed no improvement from fine-tuning on WIT ${}_{\text{mono}}$ , the monolingual TED corpus with dummy input (system 3).

These adaptation experiments with monolingual data are slightly artificial in that parallel training data is available. System 5, which is fine-tuned with the original WIT training data, obtains a Bleu of 28.4 on tst2015, which is an improvement of 2.9 Bleu. While it is unsurprising that in-domain parallel data is most valuable, we find it encouraging that NMT domain adaptation with monolingual data is also possible, and effective, since there are settings where only monolingual in-domain data is available.

The best results published on this dataset are by ?), obtained with an ensemble of 8 independently trained models. In a comparison of single-model results, we outperform their model on tst2013 by 1 Bleu.

2.3 German→→\toEnglish WMT 15

Results for German $\to$ English on the WMT 15 data sets are shown in Table 5. Like for the reverse translation direction, we see substantial improvements (3.6–3.7 Bleu) from adding monolingual training data with synthetic source sentences, which is substantially bigger than the improvement observed with deep fusion [Gülçehre et al., 2015]; our ensemble outperforms the previous state of the art on newstest2015 by 2.3 Bleu.

2.4 Turkish→→\toEnglish IWSLT 14

Table 6 shows results for Turkish $\to$ English. On average, we see an improvement of 0.6 Bleu on the test sets from adding monolingual data with a dummy source side in a 1-1 ratio101010We also experimented with higher ratios of monolingual data, but this led to decreased Bleu scores., although we note a high variance between different test sets.

With synthetic training data (Gigaword ${}_{\text{synth}}$ ), we outperform the baseline by 2.7 Bleu on average, and also outperform results obtained via shallow or deep fusion by ?) by 0.5 Bleu on average. To compare to what extent synthetic data has a regularization effect, even without novel training data, we also back-translate the target side of the parallel training text to obtain the training corpus parallel ${}_{\text{synth}}$ . Mixing the original parallel corpus with parallel ${}_{\text{synth}}$ (ratio 1-1) gives some improvement over the baseline (1.7 Bleu on average), but the novel monolingual training data (Gigaword ${}_{\text{mono}}$ ) gives higher improvements, despite being out-of-domain in relation to the test sets. We speculate that novel in-domain monolingual data would lead to even higher improvements.

2.5 Back-translation Quality for Synthetic Data

One question that our previous experiments leave open is how the quality of the automatic back-translation affects training with synthetic data. To investigate this question, we back-translate the same German monolingual corpus with three different German $\to$ English systems:

with our baseline system and greedy decoding

with our baseline system and beam search (beam size 12). This is the same system used for the experiments in Table 3.

with the German $\to$ English system that was itself trained with synthetic data (beam size 12).

Bleu scores of the German $\to$ English systems, and of the resulting English $\to$ German systems that are trained on the different back-translations, are shown in Table 7. The quality of the German $\to$ English back-translation differs substantially, with a difference of 6 Bleu on newstest2015. Regarding the English $\to$ German systems trained on the different synthetic corpora, we find that the 6 Bleu difference in back-translation quality leads to a 0.6–0.7 Bleu difference in translation quality. This is balanced by the fact that we can increase the speed of back-translation by trading off some quality, for instance by reducing beam size, and we leave it to future research to explore how much the amount of synthetic data affects translation quality.

We also show results for an ensemble of 3 models (the best single model of each training run), and 12 models (all 4 models of each training run). Thanks to the increased diversity of the ensemble components, these ensembles outperform the ensembles of 4 models that were all sampled from the same training run, and we obtain another improvement of 0.8–1.0 Bleu.

3 Contrast to Phrase-based SMT

The back-translation of monolingual target data into the source language to produce synthetic parallel text has been previously explored for phrase-based SMT [Bertoldi and Federico, 2009, Lambert et al., 2011]. While our approach is technically similar, synthetic parallel data fulfills novel roles in NMT.

To explore the relative effectiveness of back-translated data for phrase-based SMT and NMT, we train two phrase-based SMT systems with Moses [Koehn et al., 2007], using only WMT ${}_{\text{parallel}}$ , or both WMT ${}_{\text{parallel}}$ and WMT ${}_{\text{synth\_de}}$ for training the translation and reordering model. Both systems contain the same language model, a 5-gram Kneser-Ney model trained on all available WMT data. We use the baseline features described by ?).

Results are shown in Table 8. In phrase-based SMT, we find that the use of back-translated training data has a moderate positive effect on the WMT test sets (+0.7 Bleu), but not on the IWSLT test sets. This is in line with the expectation that the main effect of back-translated data for phrase-based SMT is domain adaptation [Bertoldi and Federico, 2009]. Both the WMT test sets and the News Crawl corpora which we used as monolingual data come from the same source, a web crawl of newspaper articles.111111The WMT test sets are held-out from News Crawl. In contrast, News Crawl is out-of-domain for the IWSLT test sets.

In contrast to phrase-based SMT, which can make use of monolingual data via the language model, NMT has so far not been able to use monolingual data to great effect, and without requiring architectural changes. We find that the effect of synthetic parallel data is not limited to domain adaptation, and that even out-of-domain synthetic data improves NMT quality, as in our evaluation on IWSLT. The fact that the synthetic data is more effective on the WMT test sets (+2.9 Bleu) than on the IWSLT test sets (+1.2 Bleu) supports the hypothesis that domain adaptation contributes to the effectiveness of adding synthetic data to NMT training.

It is an important finding that back-translated data, which is mainly effective for domain adaptation in phrase-based SMT, is more generally useful in NMT, and has positive effects that go beyond domain adaptation. In the next section, we will investigate further reasons for its effectiveness.

4 Analysis

$20<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mn>40</mn></mrow><annotation encoding="application/x-tex">40</annotation></semantics></math>4060<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mn>80</mn></mrow><annotation encoding="application/x-tex">80</annotation></semantics></math>802<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mn>4</mn></mrow><annotation encoding="application/x-tex">4</annotation></semantics></math>46$ $8$ training time (training instances $\cdot 10^{6}$ )cross-entropyWMT ${}_{\text{parallel}}$ (dev)WMT ${}_{\text{parallel}}$ (train)WMT ${}_{\text{synth}}$ (dev)WMT ${}_{\text{synth}}$ (train) Figure 2: English $\to$ German training and development set (newstest2013) cross-entropy as a function of training time (number of training instances) for different systems. We previously indicated that overfitting is a concern with our baseline system, especially on small data sets of several hundred thousand training sentences, despite the regularization employed. This overfitting is illustrated in Figure 1, which plots training and development set cross-entropy by training time for Turkish $\to$ English models. For comparability, we measure training set cross-entropy for all models on the same random sample of the parallel training set. We can see that the model trained on only parallel training data quickly overfits, while all three monolingual data sets (parallel ${}_{\text{synth}}$ , Gigaword ${}_{\text{mono}}$ , or Gigaword ${}_{\text{synth}}$ ) delay overfitting, and give better perplexity on the development set. The best development set cross-entropy is reached by Gigaword ${}_{\text{synth}}$ .

Figure 2 shows cross-entropy for English $\to$ German, comparing the system trained on only parallel data and the system that includes synthetic training data. Since more training data is available for English $\to$ German, there is no indication that overfitting happens during the first 40 million training instances (or 7 days of training); while both systems obtain comparable training set cross-entropies, the system with synthetic data reaches a lower cross-entropy on the development set. One explanation for this is the domain effect discussed in the previous section.

A central theoretical expectation is that monolingual target-side data improves the model’s fluency, its ability to produce natural target-language sentences. As a proxy to sentence-level fluency, we investigate word-level fluency, specifically words produced as sequences of subword units, and whether NMT systems trained with additional monolingual data produce more natural words. For instance, the English $\to$ German systems translate the English phrase civil rights protections as a single compound, composed of three subword units: Bürger|rechts|schutzes121212Subword boundaries are marked with ‘—’., and we analyze how many of these multi-unit words that the translation systems produce are well-formed German words.

We compare the number of words in the system output for the newstest2015 test set which are produced via subword units, and that do not occur in the parallel training corpus. We also count how many of them are attested in the full monolingual corpus or the reference translation, which we all consider ‘natural’. Additionally, the main authors, a native speaker of German, annotated a random subset ( $n=100$ ) of unattested words of each system according to their naturalness131313For the annotation, the words were blinded regarding the system that produced them., distinguishing between natural German words (or names) such as Literatur|klassen ‘literature classes’, and nonsensical ones such as *As|best|atten (a miss-spelling of Astbestmatten ‘asbestos mats’).

In the results (Table 9), we see that the systems trained with additional monolingual or synthetic data have a higher proportion of novel words attested in the non-parallel data, and a higher proportion that is deemed natural by our annotator. This supports our expectation that additional monolingual data improves the (word-level) fluency of the NMT system.

Related Work

To our knowledge, the integration of monolingual data for pure neural machine translation architectures was first investigated by [Gülçehre et al., 2015], who train monolingual language models independently, and then integrate them during decoding through rescoring of the beam (shallow fusion), or by adding the recurrent hidden state of the language model to the decoder state of the encoder-decoder network, with an additional controller mechanism that controls the magnitude of the LM signal (deep fusion). In deep fusion, the controller parameters and output parameters are tuned on further parallel training data, but the language model parameters are fixed during the finetuning stage. ?) also report on experiments with reranking of NMT output with a 5-gram language model, but improvements are small (between 0.1–0.5 Bleu).

The production of synthetic parallel texts bears resemblance to data augmentation techniques used in computer vision, where datasets are often augmented with rotated, scaled, or otherwise distorted variants of the (limited) training set [Rowley et al., 1996].

Another similar avenue of research is self-training [McClosky et al., 2006, Schwenk, 2008]. The main difference is that self-training typically refers to scenario where the training set is enhanced with training instances with artificially produced output labels, whereas we start with human-produced output (i.e. the translation), and artificially produce an input. We expect that this is more robust towards noise in the automatic translation. Improving NMT with monolingual source data, following similar work on phrase-based SMT [Schwenk, 2008], remains possible future work.

Domain adaptation of neural networks via continued training has been shown to be effective for neural language models by [Ter-Sarkisov et al., 2015], and in work parallel to ours, for neural translation models [Luong and Manning, 2015]. We are the first to show that we can effectively adapt neural translation models with monolingual data.

Conclusion

In this paper, we propose two simple methods to use monolingual training data during training of NMT systems, with no changes to the network architecture. Providing training examples with dummy source context was successful to some extent, but we achieve substantial gains in all tasks, and new SOTA results, via back-translation of monolingual target data into the source language, and treating this synthetic data as additional training data. We also show that small amounts of in-domain monolingual data, back-translated into the source language, can be effectively used for domain adaptation. In our analysis, we identified domain adaptation effects, a reduction of overfitting, and improved fluency as reasons for the effectiveness of using monolingual data for training.

While our experiments did make use of monolingual training data, we only used a small random sample of the available data, especially for the experiments with synthetic parallel data. It is conceivable that larger synthetic data sets, or data sets obtained via data selection, will provide bigger performance benefits.

Because we do not change the neural network architecture to integrate monolingual training data, our approach can be easily applied to other NMT systems. We expect that the effectiveness of our approach not only varies with the quality of the MT system used for back-translation, but also depends on the amount (and similarity to the test set) of available parallel and monolingual data, and the extent of overfitting of the baseline model. Future work will explore the effectiveness of our approach in more settings.

Acknowledgments

The research presented in this publication was conducted in cooperation with Samsung Electronics Polska sp. z o.o. - Samsung R&D Institute Poland. This project received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement 645452 (QT21).