Facebook FAIR's WMT19 News Translation Task Submission

Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, Sergey Edunov

Introduction

We participate in the WMT19 shared news translation task in two language pairs and four language directions, English $\rightarrow$ German (En $\rightarrow$ De), German $\rightarrow$ English (De $\rightarrow$ En), English $\rightarrow$ Russian (En $\rightarrow$ Ru), and Russian $\rightarrow$ English (Ru $\rightarrow$ En). Our methods are based on techniques and approaches used in our submission from last year Edunov et al. (2018), including the use of subword models, Sennrich et al. (2016), large-scale back-translation, and model ensembling. We train all models using the fairseq sequence modeling toolkit Ott et al. (2019). Although document level context for En $\rightarrow$ De is now available, all our systems are pure sentence level systems. In the future, we expect better results from leveraging this additional context information.

Compared to our WMT18 submission, we also decide to compete in the En $\leftrightarrow$ Ru and De $\rightarrow$ En translation directions. Although all four directions are considered high resource settings where large amounts of bitext data is available, we demonstrate that leveraging high quality monolingual data through back-translation is still very important. For all language directions, we back-translate the Newscrawl dataset using a reverse direction bitext system. In addition to back-translating the relatively clean Newscrawl dataset, we also experiment with back-translating portions of the much larger and noisier Commoncrawl dataset. For our final models, we apply a domain-specific fine-tuning process and decode using noisy channel model reranking (Anonymous, 2019).

Compared to our WMT18 submission in the En $\rightarrow$ De direction, we observe substantial improvements of 4.5 BLEU. Some of these gains can be attributed to differences in dataset quality, but we believe most of the improvement comes from larger models, larger scale back-translation, and noisy channel model reranking with strong channel and language models.

Data

For the En $\leftrightarrow$ De language pair we use all available bitext data including the bicleaner version of Paracrawl. For our monolingual data we use English and German Newscrawl. Although our language models were trained on document level data, we did not use document level boundaries in our final decoding step, so all our systems are purely sentence level systems.

For the En $\leftrightarrow$ Ru language pair we also use all available bitext data. For our monolingual data we use English and Russian Newscrawl as well as a filtered portion of Russian Commoncrawl. We choose to use Russian Commoncrawl to augment our monolingual data due to the relatively small size of Russian Newscrawl compared to English and German.

Similar to last year’s submission for En $\rightarrow$ De, we normalize punctuation and tokenize all data with the Moses tokenizer Koehn et al. (2007). For En $\leftrightarrow$ De we use joint byte pair encodings (BPE) with 32K split operations for subword segmentation Sennrich et al. (2016). For En $\leftrightarrow$ Ru, we learn separate BPE encodings with 24K split operations for each language. Systems trained with this separate BPE encoding performed significantly better than those trained with joint BPE.

2 Data Filtering

Large datasets crawled from the internet are naturally very noisy and can potentially decrease the performance of a system if they are used in their raw form. Cleaning these datasets is an important step to achieving good performance on any downstream tasks.

We apply language identification filtering (langid; Lui et al., 2012), keeping only sentence pairs with correct languages on both sides. Although not the most accurate method of language identification Joulin et al. (2016), one side effect of using langid is the removal of very noisy sentences consisting of mostly garbage tokens, which are classified incorrectly and filtered out.

We also remove sentences longer than 250 tokens as well as sentence pairs with a source/target length ratio exceeding 1.5. In total, we filter out about 30% of the original bitext data. See Table 1 for details on the bitext dataset sizes.

2.2 Monolingual

For monolingual Newscrawl data we also apply langid filtering. Since the monolingual Newscrawl corpus for Russian is significantly smaller than that of German or English, we augment our monolingual Russian data with data from the commoncrawl corpus. Commoncrawl is the largest monolingual corpus available for training but is also very noisy. In order to select a limited amount of high quality, in-domain sentences from the larger corpus, we adopt the method of Moore and Lewis (2010) for selecting in-domain data (§3.2.1).

System Overview

Our base system is based on the big Transformer architecture Vaswani et al. (2017) as implemented in fairseq. We experiment with increasing network capacity by increasing embed dimension, FFN size, number of heads, and number of layers. We find that using a larger FFN size (8192) gives a reasonable improvement in performance while maintaining a manageable network size. All subsequent models, including ensembles, use this larger FFN Transformer architecture.

We trained all our models using fairseq Ott et al. (2019) on 128 Volta GPUs, following the setup described in Ott et al. (2018)

2 Large-scale Back-translation

Back-translation is an effective and commonly used data augmentation technique to incorporate monolingual data into a translation system. Back-translation first trains an intermediate target-to-source system that is used to translate monolingual target data into additional synthetic parallel data. This data is used in conjunction with human translated bitext data to train the desired source-to-target system.

In this work we used back-translations obtained by sampling Edunov et al. (2018) from an ensemble of three target-to-source models. We found that models trained on data back-translated using an ensemble instead of a single model performed better (Table 2). Previous work also found that upsampling the bitext data can improve back-translation (Edunov et al., 2018). We adopt this method to tune the amount of bitext and synthetic data the model is trained on. We find a ratio of 1:1 synthetic to bitext data to perform the best.

The amount of monolingual Russian data available in the Newscrawl dataset is significantly smaller than that of English and German (Table 3). In order to increase the amount of monolingual Russian data for back-translation, we experiment with incorporating Commoncrawl data. Commoncrawl is a much larger and noisier dataset compared to Newscrawl, and is also non-domain specific. We experiment with methods to identify a subset of Commoncrawl that is most similar to Newscrawl. Specifically, we use the in-domain filtering method described in Moore and Lewis (2010).

Given an in domain corpus $I$ , in this case Newscrawl, and a non-domain specific corpus $N$ , in this case Commoncrawl, we would like the find the subcorpus $N_{I}$ that is drawn from the same distribution as $I$ . For any given sentence $s$ , we can calculate, using Bayes’ rule, the probability a sentence $s$ in $N$ is drawn from $N_{I}$

We ignore the $P(N_{I}|N)$ term, since it will be constant for any given $I$ and $N$ , and use $P(s|I)$ instead of $P(s|N_{I})$ , since $I$ and $N_{I}$ are drawn from the same distribution. Moving into the log domain, we can calculate the probability score for a sentence $s$ by $\log P(N_{I}|s,N)=\log P(s|I)-\log P(s|N)$ , or after normalizing for length, $H_{I}(s)-H_{N}(s)$ , where $H_{I}(s)$ and $H_{N}(s)$ are the word-normalized cross entropy scores for a sentence $s$ according to language models $L_{I}$ and $L_{N}$ trained on $I$ and $N$ respectively.

Our corpora are very large and we therefore use an $n$ -gram model (Heafield, 2011) rather than a neural language model which would be much slower to train and evaluate. We train two language models $L_{I}$ and $L_{N}$ on Newscrawl and Commoncrawl respectively, then score every sentence $s$ in Commoncrawl by $H_{I}(s)-H_{N}(s)$ . We select a cutoff of $0.01$ , and use all sentences that score higher than this value for back-translation, or about 5% of the entire dataset.

3 Fine-tuning

Fine-tuning with domain-specific data is a common and effective method to improve translation quality for a downstream task. After completing training on the bitext and back-translated data, we train for an additional epoch on a smaller in-domain corpus. For De $\rightarrow$ En, we fine-tune on test sets from previous years, including newstest2012, newstest2013, newstest2015, and newstest2017. For En $\rightarrow$ De, we fine-tune on previous test sets as well as the News-Commentary dataset. For En $\leftrightarrow$ Ru we fine-tune on a combination of News-Commentary, newstest2013, newstest2015, and newstest2017. The other test sets are held out for other tuning procedures and evaluation metrics.

4 Noisy Channel Model Reranking

$N$ -best reranking is a method of improving translation quality by scoring and selecting a candidate hypothesis from a list of $n$ -best hypotheses generated by a source-to-target, or forward model. For our submissions, we rerank using a noisy channel model approach.

Given a target sequence $y$ and a source sequence $x$ , the noisy channel approach applies Bayes’ rule to model

Since $P(x)$ is constant for a given source sequence $x$ , we can ignore it. We refer to the remaining terms $P(y|x)$ , $P(x|y)$ , and $P(y)$ , as the forward model, channel model, and language model respectively. In order to combine these scores for reranking, we calculate for every one of our $n$ -best hypotheses:

The weights $\lambda_{1}$ and $\lambda_{2}$ are determined by tuning them with a random search on a validation set and selecting the weights that give the best performance. In addition, we also tune a length penalty.

For all translation directions, our forward models are ensembles of fine-tuned and back-translated models. Since we compete in both directions for both language pairs, for any given translation direction we can use the forward model for the reverse direction as the channel model. Our language models for each of the target languages English, German, and Russian, are big Transformer decoder models with FFN 8192. We train the language models on the monolingual Newscrawl dataset, and use document level context for the English and German models. Perplexity scores for the language models on the bolded target language of each translation direction are shown in table 4. With a smaller amount of monolingual Russian data available, we observe that our Russian language model performs worse than the German and English language models.

To select the length penalty and weights, $\lambda_{1}$ and $\lambda_{2}$ , for decoding, we use random search, choosing values in the range $[0,2)$ for the weights and values in the range $[0,1)$ for the length penalty. For all language directions, we choose the weights that give the highest BLEU score on a combined dataset of newstest2014 and newstest2016.

To run our final decoding step, we first use the forward model with beam size $50$ to generate an $n$ -best list. We then use the channel and language models to score each of these hypotheses, using the weights and length penalty tuned previously. Finally, we select the hypothesis with the highest score as our output.

5 Postprocessing

For En $\rightarrow$ De and En $\rightarrow$ Ru, we also change the standard English quotation marks (“ … ”) to German-style quotation marks (” … “).

Results

Results and ablations for En $\rightarrow$ De are shown in Table 5, De $\rightarrow$ En in Table 6, En $\rightarrow$ Ru in Table 7 and Ru $\rightarrow$ En in Table 8. We report case-sensitive SacreBLEU scores using SacreBLEU Post (2018)SacreBLEU signatures: BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+ test.wmt{17/18}+tok.13a+version.1.2.11, BLEU+case.mixed+lang.de-en+numrefs.1+smooth.exp+ test.wmt{17/18}+tok.13a+version.1.2.11, BLEU+case.mixed+lang.ru-en+numrefs.1+smooth.exp+ test.wmt{17/18}+tok.13a+version.1.2.11, BLEU+case.mixed+lang.en-ru+numrefs.1+smooth.exp+ test.wmt{17/18}+tok.intl+version.1.2.11, using international tokenization for En $\rightarrow$ Ru. In the final row of each table we also report the case-sensitive BLEU score of our submitted system on this year’s test set. All single models and individual models within ensembles are averages of the last $10$ checkpoints of training. Our baseline systems are big Transformers as described in Vaswani et al. (2017). The baselines were trained with minimally filtered data, removing only those sentences longer than 250 words and exceeding a source/target length ratio of $1.5$ This setup gave us a reasonable baseline to evaluate data filtering.

For En $\rightarrow$ De, langid filtering, larger FFN, and ensembling improve our baseline performance on news2018 by about 1.5 BLEU. Note that our best bitext only systems already outperforms our system from last year by 1 BLEU point. This is perhaps due to the addition of higher quality bitext data and improved data filtering techniques. The addition of back-translated (BT) data improves single model performance by only 0.3 BLEU, but combining this with fine-tuning and ensembling gives us a total of 3 BLEU. Finally, applying reranking on top of these strong ensembled systems gives another 1.4 BLEU.

2 German→→\rightarrowEnglish

For De $\rightarrow$ En, as with En $\rightarrow$ De, we see similar improvements with langid filtering, larger FFN, and ensembling on the order of 1.4 BLEU. Compared to En $\rightarrow$ De however, we also observe that the addition of back-translated data is much more significant, improving single model performance by over 2.5 BLEU. Fine-tuning, ensembling, and reranking add an additional 2.4 BLEU, with reranking contributing 1.5 BLEU, a majority of the improvement.

3 English→→\rightarrowRussian

For En $\rightarrow$ Ru, we observe large improvements of 2.4 BLEU over a bitext-only model after applying langid filtering, larger FFN, and ensembling. Since we start with a lower quality initial En $\leftrightarrow$ Ru bitext dataset, we observe a large improvement of 3.5 BLEU by adding back-translated data. Augmenting this back-translated data with Commoncrawl adds an additional 0.2 BLEU. Finally, applying fine-tuning, ensembling, and reranking adds 2.2 BLEU, with reranking contributing 1 BLEU.

4 Russian→→\rightarrowEnglish

For Ru $\rightarrow$ En, we observe similar trends to En $\leftrightarrow$ De, with langid filtering, larger FFN, and ensembling improving performance of a bitext-only system by 1.6 BLEU. Backtranslation adds 3 BLEU, again most likely due to the lower quality bitext data available. Fine-tuning, ensembling, and reranking add almost 4 BLEU, with reranking contributing 1.2 BLEU.

5 Reranking

For every language direction, reranking gives a significant improvement, even when applied on top of an ensemble of very strong back-translated models. We also observe that the biggest improvement of 1.5 BLEU comes in the De $\rightarrow$ En language direction, and the smallest improvement of 1 BLEU in the En $\rightarrow$ Ru direction. This is perhaps due to the relatively weak Russian language model, which is trained on significantly less data compared to English and German. Improving our language models may lead to even greater improvements with reranking.

6 Human Evaluations

All our systems participated in the human evaluation campaign of WMT’19. For different systems, different styles of evaluations were used. All our systems except Ru $\rightarrow$ En were evaluated with document level context and had a document level rating collected. Source based direct assessment was used for systems translating from English, and target based direct assessment was used for systems translating to English. See Table 9 for more details.

Facebook-FAIR was ranked first in all four language directions we compete in. Table 10 shows that our En $\rightarrow$ De submission significantly outperforms other systems as well as human translations. Our submissions for De $\rightarrow$ En, En $\rightarrow$ Ru and Ru $\rightarrow$ En also achieve the highest score.

Although our systems are pure sentence-level models, they performed well irrespective of whether the evaluation method used document context or not. For document level rankings, our En $\rightarrow$ De system also ranked first and significantly outperformed human translations. Our En $\rightarrow$ Ru submission achieved the highest score among all submissions and is tied for the first place with human translations. The De $\rightarrow$ En system achieved the second highest score among constrained systems. See (Bojar et al., 2019) for details.

Conclusions

This paper describes Facebook FAIR’s submission to the WMT19 news translation task. For all four translation directions, En $\leftrightarrow$ De and En $\leftrightarrow$ Ru, we use the same strategy of filtering bitext data, performing sampling-based back-translation on monolingual data, then training strong individual models on a combination of this data. Each of these models is fine-tuned and ensembled into a final system that is used for decoding with noisy channel model reranking. We demonstrate the effectiveness of our noisy channel-based reranking approach even when applied on top of very strong systems, and rank first in all four directions of the human evaluation campaign.