Domain, Translationese and Noise in Synthetic Data for Neural Machine Translation

Nikolay Bogoychev, Rico Sennrich

Introduction

The quality of neural machine translation can be improved by leveraging additional monolingual resources in various different ways (Sennrich et al., 2016b; Zhang and Zong, 2016; Gulcehre et al., 2017; Ramachandran et al., 2017; Freitag et al., 2019). Among these, back-translation is the most widely used technique in shared translation tasks (Barrault et al., 2019, p. 15), and it has been reported that it outperforms self-training with forward translation Burlot and Yvon (2018). However, in the past year, attention was drawn to the fact that standard test sets are often shared between translation directions and thus contain both portions where the original text is on the source side (original), as well as portions where the original text is used as the reference translation, with the source text being a human translation (reverse) (See Figure 1). This use of “original“ and “reverse” test sets heavily affects empirical results for back-translation. When augmenting the model with back-translation, improvements in BLEU (Papineni et al., 2002) are a lot more evident if the sentence was translated in the reverse direction, that is to say with naturally produced reference and a human translation on the source side (Edunov et al., 2019). Freitag et al. (2019) explore automatic post editing (APE), which heavily relies on synthetic training data, and find that there is a loss in BLEU score on the original portion, despite humans perceiving improvement in the translation quality. Zhang and Toral (2019) show that the ranking of submissions to the news translation task changes when evaluating only the portion with original sources, or only that with translationese sources. Interestingly, systems that rely heavily on large-scale back-translation, such as that by Edunov et al. (2019), are more dominant on the reverse portion.

We focus on three factors that we hypothesise play a large role in explaining the observed differences in effectiveness between forward and back-translation, and between performance on the original and reverse portion of standard test sets: differences in language style between naturally produced text and translationese text, differences in the domains between source-side and target-side monolingual texts,here, we use domain in a broad sense to refer to various textual attributes such as subject, genre, and topics. and differences in how noise in the synthetic data, specifically translation errors due to the varying quality of MT systems producing it, will affect the final system, depending on whether it is on the source-side (back-translation) or target side (forward translation). We perform the following experiments to verify our claims:

We show that when the test sets are split according to original language, forward translation is generally better than backtranslation in terms of BLEU on the original portion, complementing the findings of (Edunov et al., 2019), who find that back-translation is better at improving BLEU on the reverse portion.

We perform human evaluation on a subset of our translation, comparing a baseline system, one augmented with back-translation and one augmented with forward translation. We see that despite the huge discrepancies in BLEU, humans measure adequacy to be pretty similar across all systems, especially on the original portion of the dataset. Humans, however tend to prefer the backtranslation’s system fluency a lot more than that of the other two systems.

We perform language model experiments where we contrast language style and language domain and evaluate on the test sets.

We show that the language between original and translationese French is sufficiently different to be reliably detected by a neural network on a document level.

We explore the effectiveness of forward and back-translation in a scenario where the quality of the synthetic data produced is poor, and find that forward translation is more sensitive to the quality of the initial translation system than back-translation.

Background

Statistical machine translation relies on the noisy channel model, which makes large-scale language models, and hence extensive monolingual target-language data, very valuable (e.g. Brants et al., 2007). In neural machine translation (Bahdanau et al., 2015; Vaswani et al., 2017) however, it is not immediately clear how to make use of monolingual target-language resources. This led to the development of different methods such as language model fusion (Gulcehre et al., 2017), language model pretraining (Ramachandran et al., 2017), back-translation (Sennrich et al., 2016b), but also the exploration of methods to incorporate source-language data via forward translation (Zhang and Zong, 2016). Out of these, back-translation is the most widely used (see Barrault et al., 2019), and has been reported to work better than forward translation in particular Burlot and Yvon (2018).

Given a translation task $L_{1}\rightarrow L_{2}$ , where large-scale monolingual $L_{2}$ data is available, back-translation refers to training a translation model $L_{2}\rightarrow L_{1}$ and using it to translate the $L_{2}$ data into $L_{1}$ , creating a synthetic parallel corpus that can be added to the true bilingual data for the purpose of training a $L_{1}\rightarrow L_{2}$ model.

While this technique was first explored for statistical machine translation (Bertoldi and Federico, 2009; Lambert et al., 2011; Bojar and Tamchyna, 2011), it has a different effect on training, and was found to be much more effective, in neural machine translation, particularly in low resource scenarios (Sennrich et al., 2016b, a). However, it is not entirely clear what causes the large improvement in translation quality. Previous work has analysed increases in fluency when training on back-translated data (e.g. Sennrich et al., 2016b; Edunov et al., 2019), and domain adaptation effects (e.g. Sennrich et al., 2016b; Chinea-Ríos et al., 2017a), which can be attributed to the target-side data, but the properties of synthetic source sentences have also been investigated. Burlot and Yvon (2018) have found that automatic translations tend to be more monotonic and simpler than natural parallel data, which could make learning easier, but these biases also make the training distribution less similar to natural input. While there is some evidence that the quality of the back-translation system matters (Burlot and Yvon, 2018), models are relatively robust to noise, and Edunov et al. (2018) even find that they obtain better models when using sampling rather than standard beam search for back-translation, or explicitly add noise, even if this reduces the quality of back-translations. Caswell et al. (2019) argue that if the model is given means to distinguish real from synthetic parallel data, either via noise or more simply a special tag, it can avoid learning detrimental biases from synthetic training data.

2 Forward translation

Given a translation task $L_{1}\rightarrow L_{2}$ , where large-scale monolingual $L_{1}$ data is present, forward translation refers to training a translation model $L_{1}\rightarrow L_{2}$ and using it to translate the $L_{1}$ data into $L_{2}$ , creating a synthetic parallel corpus that can be added to the true bilingual data for the purpose of training an improved $L_{1}\rightarrow L_{2}$ model.

Self-training with forward translation was also pioneered in statistical machine translation (Ueffing et al., 2007), but attracted new interest in neural machine translation, where improvements in BLEU were demonstrated (Zhang and Zong, 2016; Chinea-Ríos et al., 2017b). Compared to back-translation, biases and errors in synthetic data are intuitively more problematic in forward translation since they directly affect the gold labels. Also, there is no clear theoretical link between forward-translated synthetic training data and a model’s fluency, but other effects, such as domain adaptation and improved learnability of translation from synthetic data remain plausible.Also consider the effectiveness of sequence-level knowledge distillation (Kim and Rush, 2016), which is similar to forward translation, except the source side of the parallel training data is re-translated, while we focus on integrating additional monolingual data.

Burlot and Yvon (2018) perform a systematic study which shows that forward translation leads to some improvements in translation quality, but not nearly as much as back-translation. In very recent work, Wu et al. (2019) show large-scale experiments where a combination of synthetic data produced by both forward and backward translation delivers superior results to just using one or the other. The amount of research on forward translation is however significantly smaller than that on back-translation.

Domains and Translationese

Based on these studies, we consider how the original and reverse portion of standard test sets differ, and how this can partially explain the observed differences between forward and back-translation.

It has previously been shown that back-translation can be used for domain adaptation (Sennrich et al., 2016b; Chinea-Ríos et al., 2017a), and the effectiveness of back-translation and forward translation heavily depends on the availability of relevant, in-domain monolingual data. Even if we have both source-side and target-side data from the same general domain, we believe that there can be subtle differences between them. Even in restricted domain tasks, such WMT news translation (Barrault et al., 2019), newspaper articles in different languages talk about different topics.Obviously, there will also be differences between newspapers in the same language, but we expect that a large-scale corpus from the same language will better match topics at test time than one from another language For example, French news article cover subjects of local interest, such as the Quebec local elections. On the other hand, English language news in WMT test sets talk about mostly American or international topics. Therefore when performing back-translation, which is based on target-side data, this implicitly adapts systems to this target-side news domain, while forward translation would adapt systems to the source-side news domain.

2 Translationese

A second important distinction between the original and reverse portion of test sets comes from their creation, i.e. the process of translation. Human translations show systematic differences to natural text, and this dialect has been termed translationese. Translationese has been extensively studied in the context of natural language processing (Baroni and Bernardini, 2005; He et al., 2016). Translationese texts tend to have different word distribution than naturally produced text due to interference from the source language (Koppel and Ordan, 2011), and other translation strategies such as simplification and explicitation. While translationese is hard to spot for humans, machine learning methods can reliably identify it (Ilisei et al., 2010; Koppel and Ordan, 2011; Rabinovich and Wintner, 2015).

Translationese and its effect have been studied in the context of statistical machine translation: Kurokawa et al. (2009); Lembersky et al. (2012) observe that systems reach higher BLEU on test sets if the direction of the test set is the same as the direction of the training set, Stymne (2017) show how systems can be tuned specifically to translationese and Riley et al. (2020) even show how BLEU can be gamed by specifically producing translationese. Due to the directional nature of the WMT19 test sets (Barrault et al., 2019), research on translationese in the context of neural machine translation has been revitalized (Freitag et al., 2019; Edunov et al., 2019; Zhang and Toral, 2019; Graham et al., 2019; Bizzoni et al., 2020).

One of our goals in this paper is to isolate domain effects and translationese effects in the analysis of synthetic training corpora.

Experimental setup

We used the WMT 15 English-French news translation task dataset (Bojar et al., 2015), consisting of 35.8M parallel sentences. For back and forward translation we used 49.8M English monolingual sentences and 46.1M French monolingual sentences from the respective News Crawl corpora. For training the back-translation and forward translation systems we used a both a shallow RNN (Bahdanau et al., 2015), equivalent to the one used by (Sennrich et al., 2016a), as well as a transformer base system (Vaswani et al., 2017). Our shallow RNN was about 1 BLEU better than the transformer on fr-en (used for forward translation), and about 3 BLEU worse on en-fr (used for backtranslation) than the transformer. For producing the synthetic data we used sampling from the softmax distribution (Edunov et al., 2018). Byte pair encoding (BPE) (Sennrich et al., 2016c) was used to produce a shared vocabulary of 88k tokens.

We used all available datasets from the news translation task and split them by direction, based on the source language, equivalent to the way done by Post (2018), and we evaluated each dataset with all of our models.

Translation experiments

We present our experimental results on Table 1. On the original portion, the systems augmented with FWD ${}_{\text{rnn}}$ translated data performs the best on most test sets. The back-translation system is worse than the baseline on all test sets. Furthermore it is interesting to observe that the system trained on transformer-produced synthetic data is worse than that trained on RNN-produced synthetic data.

We observe the opposite on the reverse portion: a back-translation system (either BT ${}_{\text{rnn}}$ or BT ${}_{\text{transformer}}$ , with no clear winner between the two, despite a 3 BLEU difference between the quality of the baseline RNN and Transformer) is always the best, and the forward translation systems shows no improvement over the baseline.

On the full datasets, the overall trend is that forward translation does not improve the overall translation quality, which is not consistent with previous work (Burlot and Yvon, 2018). We note that RNN produced synthetic data mostly outperforms their transformer counterparts. We note that overall, the backtranslation augmented system produces the best BLEU, which is consistent with Burlot and Yvon (2018). It is tempting to conclude that forward translation works better for texts in the original translation direction, but we can’t do that without conducting human evaluation, as BLEU is known to not correspond directly to translation quality, especially for high quality systems (Ma et al., 2019; Freitag et al., 2020; Mathur et al., 2020). It does seem that forward translation is more sensitive to the quality of the system used to produce the synthetic data.

Human Evaluation

Table 1 shows big discrepancies in the BLEU scores based on the type of synthetic data and directionality of the datasets, but BLEU does not tell the full story. In order to get further insight on the effects of forward and backward translated data, we sampled uniformly 1008 sentences from all the newstest datasets, 504 in the forward direction and 504 in the reverse direction. We recruited 4 native English speakers to evaluate the translations of those sentences with three distinct systems (the baseline, BT ${}_{\text{rnn}}$ , and FWD ${}_{\text{rnn}}$ ). We followed the evaluation scheme of Callison-Burch et al. (2007) where we request our annotators to rate translations in terms of fluency and adequacy on a scale from 1 to 5. Annotators are only shown the three translations for the fluency evaluation; for the adequacy evaluation, they are additionally provided with a reference translation. Rating scales and instructions are shown in the Appendix.

Translations are blinded and given in random order to prevent biases. Each annotator was asked to annotate 377 sentences for fluency and adequacy each and the sets for fluency and adequacy are distinct. Among those, 50 sentences appear twice in order to measure intra-annotator agreement, and 100 sentences are common across all annotators in order to measure inter-annotator agreement. We report Kohen’s Kappa scores (Landis and Koch, 1977) for annotator agreement on Table 3. We test statistical significance with three-way $p$ -values computed using the ANOVA test (Heiberger and Neuwirth, 2009). We also report results of the $t$ -test, comparing the FWD and BT systems.

Our human evaluation results are presented in Table 2. In terms of adequacy on the original portion of the dataset, we see that all systems perform very similarly, with no significant differences between systems. On the reverse portion of the dataset, backtranslation has a slight edge over the baseline, and a more notable edge against the forward translation system which is consistent with related work. In terms of fluency the results are more clear: The backtranslation system clearly produces more fluent output than its competitors, regardless of the translation direction. This finding is consistent with the findings of Edunov et al. (2020) who also show that humans have a preference for backtranslation augmented systems due to their more fluent output.

Language model experiments

BLEU scores are insufficient to draw conclusions about the nature of the improvements both data augmentation methods bring. We previously touched upon two hypotheses:

translationese effects: the references in the reverse portion are native-produced text, those in the original portion may contain translationese artifacts. Training on backtranslations may improve language modelling and favour the production of more native-like text, while training on forward translations may bias the MT system towards producing more translationese text.

domain effects: there may be subtle domain differences in the synthetic data sets, mirroring differences between the two portions of the test set.

We designed a language modelling experiment in order to distinguish between the two explanations. Specifically, we measure the similarity between training and test sets by training language models on our training data, and measuring perplexity to variants of the test sets.

We used a transformer language model architecture with 8 layers and 8 heads, similar to the transformer-base machine translation systems. We used the same preprocessing and BPE settings as our translation experiments. We trained four language models using the data that we had prepared for forward and back-translation: two native English and French language models and two English and French translationese models (we denote the latter two EN ${}_{\text{MT}}$ and FR ${}_{\text{MT}}$ , respectively). The language models computed on the machine translated data exhibit specific features: They are trained on sampled data so we expect below average fluency, but good adaptation to the domain (source-side news or target-side news). Therefore we expect that the native French language model will perform better (i.e. have lower perplexity) on native French text compared to a translationese French language model, as the style and the domain of the native text match with those of the native language model. We expect that we will observe the same effect when evaluating native vs translationese English language model on native English text. When considering translated test sets, we will expect them to be closer to the translationese language models – this is both compatible with the interpretation that the two types of texts are similar because they are both translationese, as well as the interpretation that they are similar because they are from the same source-language domain.

But what if we have native English data that has been human translated into French and then automatically translated into English? In this case it will share the domain with native English, but after the intermediate human translation, we expect the style to be closer to the language model trained on the translationese text. This variant of the test set gives us the most direct answer as to what extent translationese or domain effects affect the similarity between training and test data.

Table 4 shows the language model performance of the native French language model and the translationese French language model. We observe that unsurprisingly, the language model trained on original French data shows lower perplexity on the original French data than the one trained on MT translated French. Somewhat surprisingly the trend is maintained in the translationese French dataset, even if the two perplexity scores are closer to each other. This is unlike the results on the English language models on Table 5, where the language model that performs better is always the one trained on the same original language as the original language of the dataset.

Of most interest are the result for HT ${}_{\text{EN}\to\text{FR}}$ , MT ${}_{\text{FR}\to\text{EN}}$ , i.e. the roundtrip translation of native English text. Based on our hypothesis that source-language and target-language domains are slightly different, we expect the EN ${}_{\text{native}}$ LM to perform better than EN ${}_{\text{MT}}$ . Based on the more established explanation that the main distinguishing feature of translated text are translationese artefacts, we would expect EN ${}_{\text{MT}}$ to perform better than EN ${}_{\text{native}}$ . In fact, perplexities are very close to each other, suggesting that domain effects and translationese effects both come into play, and roughly balance each other out.

Domain identification experiments

Inspired by the work of Caswell et al. (2019); Marie et al. (2020), who tag back-translated data on the source side to distinguish it from parallel data, we explore if translation models can learn whether training instances come from the source-language or target-language “domain”. To this end, we train a French $\to$ English translation model only using synthetic training data (both forward translations and back-translations), and we add a tag at the beginning of the target sentence indicating the original language. The resulting model correctly identifies the original language in 83% of training set sentences. When evaluating it on test sets, the model has a marked preference to identify the original language as French. On the originally French portion, the model found 89.4% of the sentences be native French, whereas on the human translated French portion, the model predicts 51% of the sentences to be native French.

Caswell et al. (2019) motivate source-side tags as a way to help the system distinguish back-translations from parallel text in lieu of noise. While we did not test the effectiveness of source-side tags, our experiment shows that even a model without them can predict the provenance of source sentences relatively well. The fact that prediction accuracy remains far above chance level (69%) on human translations shows that the high classification accuracy cannot be simply explained by the model learning to identify MT noise; the signal the model uses to correctly classify the test sets are either domain effects, or translationese effects shared between human translation and MT.

Other Language Pairs

To see if our findings generalise to other language pairs, we trained Estonian $\to$ English and Finnish $\to$ English translation models, following the procedure described in Section 4. In order to better control for domain and style, we only use the parallel news crawl data from the WMT18 (Bojar et al., 2018) translation task, which resulted in 3.1M sentence pairs for Finnish–English and 0.9M sentence pairs for Estonian–English.

For data augmentation, we use all the available news-crawl on the Estonian/Finnish side for forward translation and the equivalent amount of English newscrawl for back-translation. This resulted in 14.5M monolingual sentences for Finnish-English back/forward translation and 2.9M sentences for Estonian-English back/forward translation. We again produced an RNN and a transformer variant of the synthetic data.

We present our results in Tables 6 and 7. In the case of Estonian (Table 6), we have a scenario which produced particularly poor synthetic data: The RNN English–Estonian, reaches just 12 BLEU on the dev set, while the transformer—18 BLEU. On Estonian-English the RNN reaches 15 BLEU, while the transformer—17. We see that when BLEU is low, the quality of the synthetic data is much more important: The systems augmented with transformer back-translation gained 4.7 BLEU points on average against the RNN back-translation. Relatively, the forward translation system has improved significantly more: Just 2 BLEU points of difference between the RNN and transformer models used to create the synthetic data resulted in 3.2 points increase in BLEU. This suggests that data augmentation via forward translation is substantially more sensitive to the translation quality of the initial translation system than back-translation.

Our observations are confirmed in the slightly higher-resource experiment on Finnish $\to$ English (Table 7). The quality of the translation model used for back-translation was improved by 9 BLEU (from 17 to 26) when using a transformer instead of an RNN, but on the final system, this yielded just 1.1 BLEU increase on average. In contrast, the quality of the translation system used for forward translation was improved from 17 to 23 BLEU, improving the final system by 2 BLEU on average.

Conclusions

In this paper we reviewed the effect of directionality on machine translation results, focusing both on the direction of data augmentation (forward and back-translation), and the original language of test sets, focusing on French $\to$ English as a case study, with additional experiments on Estonian $\to$ English and Finnish $\to$ English. We confirm that the original language of parallel test sets affects BLEU scores, particularly when data augmentation approaches are compared. We find that back-translation is more effective than forward translation in the artifical setting where the input to the translation system is itself a human translation, and the original text is used as reference. In the natural setting where the input is native text, and the reference a human translation, forward translation can perform better in terms of BLEU, although it still trails behind back-translation if the forward translations in the synthetic data sets are very poor, indicating that forward translation is more sensitive to the quality of the system that produced it compared to backtranslation.

However, manual evaluation shows that better BLEU scores do not necessarily correspond to better translation quality according to human judgements. Despite wildly differing BLEU results depending on the original language of the test sets, humans evaluators prefer our backtranslated systems over our other systems in terms of fluency. Despite achieving higher BLEU on the original portion of the test set, our forward-translation system was rated worst in the human evaluation.

To better understand the differences between forward and back-translation, we consider both translationese effects and subtle domain differences between source-language and target-language monolingual data. Language model experiments indicate that both of these play a role, and partially explain why back-translation is so suitable for reverse test sets. Experiments with translation systems trained on only synthetic data (forward and back-translation) also show that the provenance of test set sentences is predictable with 69% accuracy.

Our findings are agree with concurrent and independent work by Shen et al. (2019), who perform low-resource translation experiments with back-translation and self-learning, an iterative form of forward translation. They also find that the original language of parallel test data determines whether back-translation or forward translation is a more effective strategy for data augmentation.

Based on our findings, we can make several recommendations for the use of forward translation and back-translation to augment neural machine translation. Firstly, while BLEU is very sensitive to the choice of data augmentation, with up to 6 BLEU difference between the two choices in our French $\to$ English experiments, depending on the make-up of the test set, human annotators were less sensitive to test set directionality. Human annotators favoured backtranslation over forward translation, mostly in terms of fluency, while adequacy was largely the same across all of them, especially on the original translation direction. Our results should serve as a warning to not over-rely on automatic evaluation when data augmentation is involved: The results of the WMT19 news translation task (Barrault et al., 2019), also show negative correlation between BLEU and human evaluation.

Secondly, we observe subtle domain differences between corpora in different languages, even if they cover the same general domain (news) and were collected with the same methods. Following the general heuristic to use training data that matches the test domain as closely as possible, this may be an argument for using forward translations in settings where (only) in-domain source data is available, but further study of this setting is necessary. Of course, the use of forward and back-translation is not mutually exclusive, and in settings with access to suitable monolingual corpora in both the source and target language, combining the two is another viable strategy (Wu et al., 2019).

References

Appendix A Human Evaluation Protocol

The annotators were given the following instructions:

Fluency is simply how natural a sentence sounds. You will have to rate the sentences produced by three different systems in terms of how good the English is. Use the following scale:

Adequacy on the other hand, tries to judge how good the meaning is conveyed, compared to a reference translation. For this task, you will be given a reference translation, and the translations produced by three different systems. You are required to rate each of them using the following scale:

Please pay special attention to sentences that are "almost" correct but a crucial word is missed or one with reverse meaning is used (eg "convicted" vs "acquitted").

Note that it is possible to have all systems produce equally good (or equally bad), yet different results. For this task, you should ignore how fluent (or disfluent) the English is and focus just on the meaning. You should not penalise a system for having a bad language, as long as the meaning is conveyed.

Appendix B Manual analysis

In this section we present manual analysis of sentences produced by all of our systems on Original French source (Figure 2) and on Translationese English source and (Figure 3). We noticed that in the case of original French source, sometimes the backtranslation system gets unfavourably penalised for producing a correct translation, that is more fluent than the reference (Figure 2, example 1). When the backtranslation is presented with an input that is confusing to it, it tends to just copy, instead of producing a translations (Figure 2, examples 2 and 3), whereas the forward translation system never suffers from this issue on original French source.

On Translationese French source, we can see that the forward translation struggles with undertranslation on some out of domain sequences (Figure 3, example 7). It is interesting to note that, rare named entities such as "Xindu" on example 5, get translated as "Xinhua" by the backtranslation system, because this token is much more common there than in the training set (and completely absent in the forward translated data). Strangely enough, the backtranslation system tends to hallucinate "he said" when it translates quotes (Figure 3, example 6).