Edinburgh Neural Machine Translation Systems for WMT 16

Rico Sennrich, Barry Haddow, Alexandra Birch

Introduction

We participated in the WMT 2016 shared news translation task by building neural translation systems for four language pairs: English $\leftrightarrow$ Czech, English $\leftrightarrow$ German, English $\leftrightarrow$ Romanian and English $\leftrightarrow$ Russian. Our systems are based on an attentional encoder-decoder [Bahdanau et al., 2015], using BPE subword segmentation for open-vocabulary translation with a fixed vocabulary [Sennrich et al., 2016b]. We experimented with using automatic back-translations of the monolingual News corpus as additional training data [Sennrich et al., 2016a], pervasive dropout [Gal, 2015], and target-bidirectional models.

Baseline System

Our systems are attentional encoder-decoder networks [Bahdanau et al., 2015]. We base our implementation on the dl4mt-tutorial333https://github.com/nyu-dl/dl4mt-tutorial, which we enhanced with new features such as ensemble decoding and pervasive dropout.

We use minibatches of size 80, a maximum sentence length of 50, word embeddings of size 500, and hidden layers of size 1024. We clip the gradient norm to 1.0 [Pascanu et al., 2013]. We train the models with Adadelta [Zeiler, 2012], reshuffling the training corpus between epochs. We validate the model every $10\,000$ minibatches via Bleu on a validation set (newstest2013, newstest2014, or half of newsdev2016 for EN $\leftrightarrow$ RO). We perform early stopping for single models, and use the 4 last saved models (with models saved every $30\,000$ minibatches) for the ensemble results. Note that ensemble scores are the result of a single training run. Due to resource limitations, we did not train ensemble components independently, which could result in more diverse models and better ensembles.

Decoding is performed with beam search with a beam size of 12. For some language pairs, we used the AmuNMT C++ decoder444https://github.com/emjotde/amunmt as a more efficient alternative to the theano implementation of the dl4mt tutorial.

To enable open-vocabulary translation, we segment words via byte-pair encoding (BPE)555https://github.com/rsennrich/subword-nmt [Sennrich et al., 2016b]. BPE, originally devised as a compression algorithm [Gage, 1994], is adapted to word segmentation as follows:

First, each word in the training vocabulary is represented as a sequence of characters, plus an end-of-word symbol. All characters are added to the symbol vocabulary. Then, the most frequent symbol pair is identified, and all its occurrences are merged, producing a new symbol that is added to the vocabulary. The previous step is repeated until a set number of merge operations have been learned.

BPE starts from a character-level segmentation, but as we increase the number of merge operations, it becomes more and more different from a pure character-level model in that frequent character sequences, and even full words, are encoded as a single symbol. This allows for a trade-off between the size of the model vocabulary and the length of training sequences. The ordered list of merge operations, learned on the training set, can be applied to any text to segment words into subword units that are in-vocabulary in respect to the training set (except for unseen characters).

To increase consistency in the segmentation of the source and target text, we combine the source and target side of the training set for learning BPE. For each language pair, we learn $89\,500$ merge operations.

Experimental Features

WMT provides task participants with large amounts of monolingual data, both in-domain and out-of-domain. We exploit this monolingual data for training as described in [Sennrich et al., 2016a]. Specifically, we sample a subset of the available target-side monolingual corpora, translate it automatically into the source side of the respective language pair, and then use this synthetic parallel data for training. For example, for EN $\to$ RO, the back-translation is performed with a RO $\to$ EN system, and vice-versa.

?) motivate the use of monolingual data with domain adaptation, reducing overfitting, and better modelling of fluency. We sample monolingual data from the News Crawl corpora666Due to recency effects, we expect last year’s corpus to be most relevant, and sampled from News Crawl 2015 for EN-RO, EN-RU and EN-CS; for EN-DE, we re-used data from [Sennrich et al., 2016a], which was randomly sampled from News Crawl 2007–2014., which is in-domain with respect to the test set.

The amount of monolingual data back-translated for each translation direction ranges from 2 million to 10 million sentences. Statistics about the amount of parallel and synthetic training data are shown in Table 1. With dl4mt, we observed a translation speed of about $200\,000$ sentences per day (on a single Titan X GPU).

2 Pervasive Dropout

For English $\leftrightarrow$ Romanian, we observed poor performance because of overfitting. To mitigate this, we apply dropout to all layers in the network, including recurrent ones.

Previous work dropped out different units at each time step. When applied to recurrent connections, this has the downside that it impedes the information flow over long distances, and ?) propose to only apply dropout to non-recurrent connections.

Instead, we follow the approach suggested by ?), and use the same dropout mask at each time step. Our implementation differs from the recommendations by ?) in one respect: we also drop words at random, but we do so on a token level, not on a type level. In other words, if a word occurs multiple times in a sentence, we may drop out any number of its occurrences, and not just none or all.

In our English $\leftrightarrow$ Romanian experiments, we drop out full words (both on the source and target side) with a probability of 0.1. For all other layers, the dropout probability is set to 0.2.

3 Target-bidirectional Translation

We found that during decoding, the model would occasionally assign a high probability to words based on the target context alone, ignoring the source sentence. We speculate that this is an instance of the label bias problem [Lafferty et al., 2001].

To mitigate this problem, we experiment with training separate models that produce the target text from right-to-left (r2l), and re-scoring the n-best lists that are produced by the main (left-to-right) models with these r2l models. Since the right-to-left model will see a complementary target context at each time step, we expect that the averaged probabilities will be more robust. In parallel to our experiments, this idea was published by ?).

We increase the size of the n-best-list to 50 for the reranking experiments.

A possible criticism of the l-r/r-l reranking approach is that the gains actually come from adding diversity to the ensemble, since we are now using two independent runs. However experiments in [Liu et al., 2016] show that a l-r/r-l reranking systems is stronger than an ensemble created from two independent l-r runs.

Results

Table 2 shows results for English $\leftrightarrow$ German. We observe improvements of 3.4–5.7 Bleu from training with a mix of parallel and synthetic data, compared to the baseline that is only trained on parallel data. Using an ensemble of the last 4 checkpoints gives further improvements (1.3–1.7 Bleu). Our submitted system includes reranking of the 50-best output of the left-to-right model with a right-to-left model – again an ensemble of the last 4 checkpoints – with uniform weights. This yields an improvements of 0.6–1.1 Bleu.

2 English↔↔\leftrightarrowCzech

For English $\rightarrow$ Czech, we trained our baseline model on the complete WMT16 parallel training set (including CzEng 1.6pre [Bojar et al., 2016]), until we observed convergence on our heldout set (newstest2014). This took approximately 1M minibatches, or 3 weeks. Then we continued training the model on a new parallel corpus, comprising 8.2M sentences back-translated from the Czech monolingual news2015, 5 copies of news-commentary v11, and 9M sentences sampled from Czeng 1.6pre. The model used for back-translation was a neural MT model from earlier experiments, trained on WMT15 data. The training on this synthetic mix continued for a further 400,000 minibatches.

The right-left model was trained using a similar process, but with the target side of the parallel corpus reversed prior to training. The resulting model had a slightly lower Bleu score on the dev data than the standard left-right model. We can see in Table 3 that back-translation improves performance by 2.2–2.8 Bleu, and that the final system (+r2l reranking) improves by 0.7–1.0 Bleu on the ensemble of 4, and 4.3–4.9 on the baseline.

For Czech $\rightarrow$ English the training process was similar to the above, except that we created the synthetic training data (back-translated from samples of news2015 monolingual English) in batches of 2.5M, and so were able to observe the effect of increasing the amount of synthetic data. After training a baseline model on all the WMT16 parallel set, we continued training with a parallel corpus consisting of 2 copies of the 2.5M sentences of back-translated data, 5 copies of news-commentary v11, and a matching quantity of data sampled from Czeng 1.6pre. After training this to convergence, we restarted training from the baseline model using 5M sentences of back-translated data, 5 copies of news-commentary v11, and a matching quantity of data sampled from Czeng 1.6pre. We repeated this with 7.5M sentences from news2015 monolingual, and then with 10M sentences of news2015. The back-translations were, as for English $\rightarrow$ Czech, created with an earlier NMT model trained on WMT15 data. Our final Czech $\rightarrow$ English was an ensemble of 8 systems – the last 4 save-points of the 10M synthetic data run, and the last 4 save-points of the 7.5M run. We show this as ensemble8 in Table 3, and the +synthetic results are on the last (i.e. 10M) synthetic data run.

We also show in Table 4 how increasing the amount of back-translated data affects the results. We see that most of the gain from back-translation comes with the first batch, but increasing the amount of back-translated data does gradually improve performance.

3 English↔↔\leftrightarrowRomanian

The results of our English $\leftrightarrow$ Romanian experiments are shown in Table 5. This language pair has the smallest amount of parallel training data, and we found dropout to be very effective, yielding improvements of 4–5 Bleu.777We also tested dropout for EN $\to$ DE with 8 million sentence pairs of training data, but found no improvement after 10 days of training. We speculate that dropout could still be helpful for datasets of this size with longer training times and/or larger networks.

We found that the use of diacritics was inconsistent in the Romanian training (and development) data, so for Romanian $\to$ English we removed diacritics from the Romanian source side, obtaining improvements of 1.3–1.4 Bleu.

Synthetic training data gives improvements of 4.1–5.1 Bleu. for English $\to$ Romanian, we found that the best single system outperformed the ensemble of the last 4 checkpoints on dev, and we thus submitted the best single system as primary system.

4 English↔↔\leftrightarrowRussian

For English $\leftrightarrow$ Russian, we cannot effectively learn BPE on the joint vocabulary because alphabets differ. We thus follow the approach described in [Sennrich et al., 2016b], first mapping the Russian text into Latin characters via ISO-9 transliteration, then learning the BPE operations on the concatenation of the English and latinized Russian training data, then mapping the BPE operations back into Cyrillic alphabet. We apply the Latin BPE operations to the English data (training data and input), and both the Cyrillic and Latin BPE operations to the Russian data.

Translation results are shown in Table 6. As for the other language pairs, we observe strong improvements from synthetic training data (4–4.4 Bleu). Ensembles yield another 1.1–1.7 Bleu.

Shared Task Results

Table 7 shows the ranking of our submitted systems at the WMT16 shared news translation task. Our submissions are ranked (tied) first for 5 out of 8 translation directions in which we participated: EN $\leftrightarrow$ CS, EN $\leftrightarrow$ DE, and EN $\to$ RO. They are also the (tied) best constrained system for EN $\to$ RU and RO $\to$ EN, or 7 out of 8 translation directions in total.

Our models are also used in QT21-HimL-SysComb [Peter et al., 2016], ranked 1–2 for EN $\to$ RO, and in AMU-UEDIN [Junczys-Dowmunt et al., 2016], ranked 2–3 for EN $\to$ RU, and 1–2 for RU $\to$ EN.

Conclusion

We describe Edinburgh’s neural machine translation systems for the WMT16 shared news translation task. For all translation directions, we observe large improvements in translation quality from using synthetic parallel training data, obtained by back-translating in-domain monolingual target-side data. Pervasive dropout on all layers was used for English $\leftrightarrow$ Romanian, and gave substantial improvements. For English $\leftrightarrow$ German and English $\to$ Czech, we trained a right-to-left model with reversed target side, and we found reranking the system output with these reversed models helpful.

Acknowledgments

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreements 645452 (QT21), 644333 (TraMOOC) and 644402 (HimL).