Consistency by Agreement in Zero-shot Neural Machine Translation

Maruan Al-Shedivat, Ankur P. Parikh

Introduction

Machine translation (MT) has made remarkable advances with the advent of deep learning approaches (Bojar et al., 2016; Wu et al., 2016; Crego et al., 2016; Junczys-Dowmunt et al., 2016). The progress was largely driven by the encoder-decoder framework (Sutskever et al., 2014; Cho et al., 2014) and typically supplemented with an attention mechanism (Bahdanau et al., 2014; Luong et al., 2015b).

Compared to the traditional phrase-based systems (Koehn, 2009), neural machine translation (NMT) requires large amounts of data in order to reach high performance (Koehn and Knowles, 2017). Using NMT in a multilingual setting exacerbates the problem by the fact that given $k$ languages translating between all pairs would require $O(k^{2})$ parallel training corpora (and $O(k^{2})$ models).

In an effort to address the problem, different multilingual NMT approaches have been proposed recently. Luong et al. (2015a); Firat et al. (2016a) proposed to use $O(k)$ encoders/decoders that are then intermixed to translate between language pairs. Johnson et al. (2016) proposed to use a single model and prepend special symbols to the source text to indicate the target language, which has later been extended to other text preprocessing approaches (Ha et al., 2017) as well as language-conditional parameter generation for encoders and decoders of a single model (Platanios et al., 2018).

Johnson et al. (2016) also show that a single multilingual system could potentially enable zero-shot translation, i.e., it can translate between language pairs not seen in training. For example, given 3 languages—German (De), English (En), and French (Fr)—and training parallel data only for (De, En) and (En, Fr), at test time, the system could additionally translate between (De, Fr).

Zero-shot translation is an important problem. Solving the problem could significantly improve data efficiency—a single multilingual model would be able to generalize and translate between any of the $O(k^{2})$ language pairs after being trained only on $O(k)$ parallel corpora. However, performance on zero-shot tasks is often unstable and significantly lags behind the supervised directions. Moreover, attempts to improve zero-shot performance by fine-tuning Firat et al. (2016b); Sestorain et al. (2018) may negatively impact other directions.

In this work, we take a different approach and aim to improve the training procedure of Johnson et al. (2016). First, we analyze multilingual translation problem from a probabilistic perspective and define the notion of zero-shot consistency that gives insights as to why the vanilla training method may not yield models with good zero-shot performance. Next, we propose a novel training objective and a modified learning algorithm that achieves consistency via agreement-based learning (Liang et al., 2006, 2008) and improves zero-shot translation. Our training procedure encourages the model to produce equivalent translations of parallel training sentences into an auxiliary language (Figure 1) and is provably zero-shot consistent. In addition, we make a simple change to the neural decoder to make the agreement losses fully differentiable.

We conduct experiments on IWSLT17 (Mauro et al., 2017), UN corpus (Ziemski et al., 2016), and Europarl (Koehn, 2017), carefully removing complete pivots from the training corpora. Agreement-based learning results in up to +3 BLEU zero-shot improvement over the baseline, compares favorably (up to +2.4 BLEU) to other approaches in the literature (Cheng et al., 2017; Sestorain et al., 2018), is competitive with pivoting, and does not lose in performance on supervised directions.

Related work

A simple (and yet effective) baseline for zero-shot translation is pivoting that chain-translates, first to a pivot language, then to a target (Cohn and Lapata, 2007; Wu and Wang, 2007; Utiyama and Isahara, 2007). Despite being a pipeline, pivoting gets better as the supervised models improve, which makes it a strong baseline in the zero-shot setting. Cheng et al. (2017) proposed a joint pivoting learning strategy that leads to further improvements.

Lu et al. (2018) and Arivazhagan et al. (2018) proposed different techniques to obtain “neural interlingual” representations that are passed to the decoder. Sestorain et al. (2018) proposed another fine-tuning technique that uses dual learning (He et al., 2016), where a language model is used to provide a signal for fine-tuning zero-shot directions.

Another family of approaches is based on distillation (Hinton et al., 2014; Kim and Rush, 2016). Along these lines, Firat et al. (2016b) proposed to fine tune a multilingual model to a specified zero-shot-direction with pseudo-parallel data and Chen et al. (2017) proposed a teacher-student framework. While this can yield solid performance improvements, it also adds multi-staging overhead and often does not preserve performance of a single model on the supervised directions. We note that our approach (and agreement-based learning in general) is somewhat similar to distillation at training time, which has been explored for large-scale single-task prediction problems (Anil et al., 2018).

A setting harder than zero-shot is that of fully unsupervised translation (Ravi and Knight, 2011; Artetxe et al., 2017; Lample et al., 2017, 2018) in which no parallel data is available for training. The ideas proposed in these works (e.g., bilingual dictionaries (Conneau et al., 2017), backtranslation (Sennrich et al., 2015a) and language models (He et al., 2016)) are complementary to our approach, which encourages agreement among different translation directions in the zero-shot multilingual setting.

Background

We start by establishing more formal notation and briefly reviewing some background on encoder-decoder multilingual machine translation from a probabilistic perspective.

Corpora.

Translation.

2 Encoder-decoder framework

First, consider a purely bilingual setting, where we learn to translate from a source language, $L_{s}$ , to a target language, $L_{t}$ . We can train a translation model by optimizing the conditional log-likelihood of the bilingual data under the model:

where $\hat{\theta}$ are the estimated parameters of the model.

The encoder-decoder framework introduces a latent sequence, $\mathbf{u}$ , and represents the model as:

3 Multilingual neural machine translation

This approach has numerous advantages including: (a) simplicity of training and the architecture (by slightly changing the training data, we convert a bilingual NMT into a multilingual one), (b) sharing parameters of the model between different translation tasks that may lead to better and more robust representations. Johnson et al. (2016) also show that resulting models seem to exhibit some degree of zero-shot generalization enabled by parameter sharing. However, since we lack data for zero-shot directions, composite likelihood (3) misses the terms that correspond to the zero-shot models, and hence has no statistical guarantees for performance on zero-shot tasks.In fact, since the objective (3) assumes that the models are independent, plausible zero-shot performance would be more indicative of the limited capacity of the model or artifacts in the data (e.g., presence of multi-parallel sentences) rather than zero-shot generalization.

Zero-shot generalization & consistency

Multilingual MT systems can be evaluated in terms of zero-shot performance, or quality of translation along the directions they have not been optimized for (e.g., due to lack of data). We formally define zero-shot generalization via consistency.

where $\kappa(\varepsilon)\rightarrow 0$ as $\varepsilon\rightarrow 0$ .

In other words, we say that a machine translation system is zero-shot consistent if low error on supervised tasks implies a low error on zero-shot tasks in expectation (i.e., the system generalizes). We also note that our notion of consistency somewhat resembles error bounds in the domain adaptation literature (Ben-David et al., 2010).

In practice, it is attractive to have MT systems that are guaranteed to exhibit zero-shot generalization since the access to parallel data is always limited and training is computationally expensive. While the training method of Johnson et al. (2016) does not have guarantees, we show that our proposed approach is provably zero-shot consistent.

Approach

We propose a new training objective for multilingual NMT architectures with shared encoders and decoders that avoids the limitations of pure composite likelihoods. Our method is based on the idea of agreement-based learning initially proposed for learning consistent alignments in phrase-based statistical machine translation (SMT) systems (Liang et al., 2006, 2008). In terms of the final objective function, the method ends up being reminiscent of distillation (Kim and Rush, 2016), but suitable for joint multilingual training.

2 Consistency by agreement

where $\kappa(\varepsilon)\rightarrow 0$ as $\varepsilon\rightarrow 0$ .

For discussion of the assumptions and details on the proof of the bound, see Appendix A.2. Note that Theorem 2 is straightforward to extend from triplets of languages to arbitrary connected graphs, as given in the following corollary.

Agreement-based learning yields zero shot consistent MT models (with respect to the cross entropy loss) for arbitrary translation graphs as long as supervised directions span the graph.

Note that there are other ways to ensure zero-shot consistency, e.g., by fine-tuning or post-processing a trained multilingual model. For instance, pivoting through an intermediate language is also zero-shot consistent, but the proof requires stronger assumptions about the quality of the supervised source-pivot model.Intuitively, we have to assume that source-pivot model does not assign high probabilities to unlikely translations as the pivot-target model may react to those unpredictably. Similarly, using model distillation (Kim and Rush, 2016; Chen et al., 2017) would be also provably consistent under the same assumptions as given in Theorem 2, but for a single, pre-selected zero-shot direction. Note that our proposed agreement-based learning framework is provably consistent for all zero-shot directions and does not require any post-processing. For discussion of the alternative approaches and consistency proof for pivoting, see Appendix A.3.

3 Agreement-based learning algorithm

Having derived a new objective function (7), we can now learn consistent multilingual NMT models using stochastic gradient method with a couple of extra tricks (Algorithm 1). The computation graph for the agreement loss is given in Figure 3.

Computing agreement over all languages for each pair of sentences at training time would be quite computationally expensive (to agree on $k$ translations, we would need to encode-decode the source and target sequences $k$ times each). However, since the agreement lower bound is a sum over expectations (5.1), we can approximate it by subsampling: at each training step (and for each sample in the mini-batch), we pick an auxiliary language uniformly at random and compute stochastic approximation of the agreement lower bound (5.1) for that language only. This stochastic approximation is simple, unbiased, and reduces per step computational overhead for the agreement term from $O(k)$ to $O(1)$ .In practice, note that there is still a constant factor overhead due to extra encoding-decoding steps to/from auxiliary languages, which is about $\times 4$ when training on a single GPU. Parallelizing the model across multiple GPUs would easily compensate this overhead.

Overview of the agreement loss computation.

Finally, using these pairs, we can compute two log-probability terms (Figure 3B):

Greedy continuous decoding.

Protecting supervised directions.

Experiments

We evaluate agreement-based training against baselines from the literature on three public datasets that have multi-parallel evaluation data that allows assessing zero-shot performance. We report results in terms of the BLEU score (Papineni et al., 2002) that was computed using mteval-v13a.perl.

Following the setup introduced in Sestorain et al. (2018), we use two datasets, UNcorpus-1 and UNcorpus-2, derived from the United Nations Parallel Corpus (Ziemski et al., 2016). UNcorpus-1 consists of data in 3 languages, En, Es, Fr, where UNcorpus-2 has Ru as the 4th language. For training, we use parallel corpora between En and the rest of the languages, each about 1M sentences, sub-sampled from the official training data in a way that ensures no multi-parallel training data. The dev and test sets contain 4,000 sentences and are all multi-parallel.

Europarl v7http://www.statmt.org/europarl/.

We consider the following languages: De, En, Es, Fr. For training, we use parallel data between En and the rest of the languages (about 1M sentences per corpus), preprocessed to avoid multi-parallel sentences, as was also done by Cheng et al. (2017) and Chen et al. (2017) and described below. The dev and test sets contain 2,000 multi-parallel sentences.

IWSLT17https://sites.google.com/site/iwsltevaluation2017/TED-tasks.

Preprocessing.

To properly evaluate systems in terms of zero-shot generalization, we preprocess Europarl and IWSLT⋆ to avoid multi-lingual parallel sentences of the form source-pivot-target, where source-target is a zero-shot direction. To do so, we follow Cheng et al. (2017); Chen et al. (2017) and randomly split the overlapping pivot sentences of the original source-pivot and pivot-target corpora into two parts and merge them separately with the non-overlapping parts for each pair. Along with each parallel training sentence, we save information about source and target tags, after which all the data is combined and shuffled. Finally, we use a shared multilingual subword vocabulary (Sennrich et al., 2015b) on the training data (with 32K merge ops), separately for each dataset. Data statistics are provided in Appendix A.5.

2 Training and evaluation

Additional details on the hyperparameters can be found in Appendix A.4.

We use a smaller version of the GNMT architecture (Wu et al., 2016) in all our experiments: 512-dimensional embeddings (separate for source and target sides), 2 bidirectional LSTM layers of 512 units each for encoding, and GNMT-style, 4-layer, 512-unit LSMT decoder with residual connections from the 2nd layer onward.

Training.

We trained the above model using the standard method of Johnson et al. (2016) and using our proposed agreement-based training (Algorithm 1). In both cases, the model was optimized using Adafactor (Shazeer and Stern, 2018) on a machine with 4 P100 GPUs for up to 500K steps, with early stopping on the dev set.

Evaluation.

We focus our evaluation mainly on zero-shot performance of the following methods: (a) Basic, which stands for directly evaluating a multilingual GNMT model after standard training Johnson et al. (2016). (b) Pivot, which performs pivoting-based inference using a multilingual GNMT model (after standard training); often regarded as gold-standard. (c) Agree, which applies a multilingual GNMT model trained with agreement losses directly to zero-shot directions.

To ensure a fair comparison in terms of model capacity, all the techniques above use the same multilingual GNMT architecture described in the previous section. All other results provided in the tables are as reported in the literature.

Implementation.

All our methods were implemented using TensorFlow (Abadi et al., 2016) on top of tensor2tensor library (Vaswani et al., 2018). Our code will be made publicly available.www.cs.cmu.edu/~mshediva/code/

3 Results on UN Corpus and Europarl

Tables 2 and 2 show results on the UNCorpus datasets. Our approach consistently outperforms Basic and Dual-0, despite the latter being trained with additional monolingual data (Sestorain et al., 2018). We see that models trained with agreement perform comparably to Pivot, outperforming it in some cases, e.g., when the target is Russian, perhaps because it is quite different linguistically from the English pivot.

Furthermore, unlike Dual-0, Agree maintains high performance in the supervised directions (within 1 BLEU point compared to Basic), indicating that our agreement-based approach is effective as a part of a single multilingual system.

Europarl.

4 Analysis of IWSLT17 zero-shot tasks

Table 5 presents results on the original IWSLT17 task. We note that because of the large amount of data overlap and presence of many supervised translation pairs (16) the vanilla training method (Johnson et al., 2016) achieves very high zero shot performance, even outperforming Pivot. While our approach gives small gains over these baselines, we believe the dataset’s pecularities make it not reliable for evaluating zero-shot generalization.

On the other hand, on our proposed preprocessed IWSLT17⋆ that eliminates the overlap and reduces the number of supervised directions (8), there is a considerable gap between the supervised and zero-shot performance of Basic. Agree performs better than Basic and is slightly worse than Pivot.

5 Small data regime

Conclusion

In this work, we studied zero-shot generalization in the context of multilingual neural machine translation. First, we introduced the concept of zero-shot consistency that implies generalization. Next, we proposed a provably consistent agreement-based learning approach for zero-shot translation. Empirical results on three datasets showed that agreement-based learning results in up to +3 BLEU zero-shot improvement over the Johnson et al. (2016) baseline, compares favorably to other approaches in the literature (Cheng et al., 2017; Sestorain et al., 2018), is competitive with pivoting, and does not lose in performance on supervised directions.

We believe that the theory and methodology behind agreement-based learning could be useful beyond translation, especially in multi-modal settings. For instance, it could be applied to tasks such as cross-lingual natural language inference Conneau et al. (2018), style-transfer (Shen et al., 2017; Fu et al., 2017; Prabhumoye et al., 2018), or multilingual image or video captioning. Another interesting future direction would be to explore different hand-engineered or learned data representations, which one could use to encourage models to agree on during training (e.g., make translation models agree on latent semantic parses, summaries, or potentially other data representations available at training time).

Acknowledgments

We thank Ian Tenney and Anthony Platanios for many insightful discussions, Emily Pitler for the helpful comments on the early draft of the paper, and anonymous reviewers for careful reading and useful feedback.

References

Appendix A Appendices

Here, the outer sum iterates over available corpora. The middle sum iterates over parallel sentences in a corpus. The most inner sum marginalizes out unobservable sequences, denoted $\mathbf{z}:=\{\mathbf{x}_{l}\}_{l\neq i,j}$ , which are sentences equivalent under this model to $\mathbf{x}_{i}$ and $\mathbf{x}_{j}$ in languages other than $L_{i}$ and $L_{j}$ . Note that due to the inner-most summation, computing the log-likelihood is intractable.

Maximizing the full log-likelihood yields zero-shot consistent models (Definition 1).

To better understand why this is the case, let us consider example in Figure 5 and compute the log-likelihood of $(\mathbf{x}_{1},\mathbf{x}_{2})$ :

Note that the terms that encourage agreement on the translation into $L_{3}$ are colored in green (similarly, terms that encourage agreement on the translation into $L_{4}$ are colored in blue). Since all other terms are probabilities and bounded by 1, we have:

In other words, the full log likelihood lower-bounds the agreement objective (up to a constant $\log Z$ ). Since optimizing for agreement leads to consistency (Theorem 2), and maximizing the full likelihood would necessarily improve the agreement, the claim follows. ∎

A.2 Proof of agreement consistency

The statement of Theorem 2 mentions an assumption on the true distribution of the equivalent translations. The assumption is as follows.

This assumption means that, even though there might be multiple equivalent translations, there must be not too many of them (implied by the $\delta$ lower bound) and none of them must be much more preferable than the rest (implied by the $\xi$ upper bound). Given this assumption, we can prove the following simple lemma.

First, using Jensen’s inequality, we have:

The bound on the supervised direction implies that

To bound the second term, we use Assumption 6:

Putting these together yields the bound. ∎

Now, using Lemma 7, we can prove Theorem 2.

By assumption, the agreement-based loss is bounded by $\varepsilon$ . Therefore, expected cross-entropy on all supervised terms, $L_{1}\leftrightarrow L_{2}$ , is bounded by $\varepsilon$ . Moreover, the agreement term (which is part of the objective) is also bounded:

Since by Assumption 6, $\delta$ and $\xi$ are some constants, $\kappa(\varepsilon)\rightarrow 0$ as $\varepsilon\rightarrow 0$ . ∎

A.3 Consistency of distillation and pivoting

As we mentioned in the main text of the paper, distillation (Chen et al., 2017) and pivoting yield zero-shot consistent models. Let us understand why this is the case.

To prove consistency of pivoting, we need an additional assumption on the quality of the source-pivot model.

Given the conditions of Theorem 2 and Assumption 8, pivoting is zero-shot consistent.

We can bound the expected error on pivoting as follows (using Jensen’s inequality and the conditions from our assumptions):

A.4 Details on the models and training

All our NMT models used the GNMT (Wu et al., 2016) architecture with Luong attention (Luong et al., 2015b), 2 bidirectional encoder, and 4-layer decoder with residual connections. All hidden layers (including embeddings) had 512 units. Additionally, we used separate embeddings on the encoder and decoder sides as well as tied weights of the softmax that produced logits with the decoder-side (i.e., target) embeddings. Standard dropout of 0.2 was used on all hidden layers. Most of the other hyperparameters we set to default in the T2T (Vaswani et al., 2018) library for the text2text type of problems.

Training and hyperparameters.

We scaled agreement terms in the loss by $\gamma=0.01$ . The training was done using Adafactor (Shazeer and Stern, 2018) optimizer with 10,000 burn-in steps at 0.01 learning rate and further standard square root decay (with the default settings for the decay from the T2T library). Additionally, implemented agreement loss as a subgraph as a loss was not computed if $\gamma$ was set to 0. This allowed us to start training multilingual NMT models in the burn-in mode using the composite likelihood objective and then switch on agreement starting some point during optimization (typically, after the first 100K iterations; we also experimented with 0, 50K, 200K, but did not notice any difference in terms of final performance). Since the agreement subgraph was not computed during the initial training phase, it tended to accelerate training of agreement models.

A.5 Details on the datasets

Statistics of the IWSLT17 and IWSLT17⋆ datasets are summarized in Table 6. UNCorpus and and Europarl datasets were exactly as described by Sestorain et al. (2018) and Chen et al. (2017); Cheng et al. (2017), respectively.