Multilingual Translation with Extensible Multilingual Pretraining and Finetuning

Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan

Introduction

A multitude of datasets and models have been developed in natural language processing for a wide variety of tasks and applications. However, a large proportion of these have focused on English. Many works have contributed resources for other languages, developing specialized models for each language of interest is not scalable, not to mention difficult for low resource languages where labeled data is exceptionally scarce.

Recent work in multilingual NLP shows promise for incorporating many languages into one architecture. For example, the mBART Liu et al. (2020) model trains on twenty five different languages and can be finetuned for various different tasks. For translation, mBART was finetuned on bitext (bilingual finetuning). However, while mBART was trained on a variety of languages, the multilingual nature of the pretraining is not used during finetuning. Finetuning on bitext to translate from one language to another does not leverage the full capacity of the multilingual pretraining. Instead, we propose multilingual finetuning of pretrained models, and we demonstrate large improvements compared to bilingual finetuning.

Previous work Aharoni et al. (2019); Arivazhagan et al. (2019b); Zhang et al. (2020) has explored multilingual translation by training multiple directions within the same model from scratch, but this approach faces challenges for mid to low resource languages. In lower resource scenarios, bitext data is usually unavailable in large quantities, making it challenging to train from scratch. In contrast, monolingual data exists even for low resource languages, particularly in resources such as Wikipedia or Commoncrawl, a version of the web. Thus, leveraging this monolingual data through pretraining can provide a much stronger starting point for low resource machine translation tasks.

However, unlike training a multilingual model from scratch, pretrained models are limited to the choices made during pretraining. For example, mBART was only trained on 25 languages, so finetuning to translate on a model not part of these 25 languages is not possible. Thus, people are restricted to the languages selected to train the initial model, as it is incredibly computationally intensive to retrain from scratch. In this work, we show that existing pretrained models, such as mBART Liu et al. (2020) can be extended to additional languages. We demonstrate by doubling the number of languages supported by mBART — to 50 — without loss of performance on the original 25 languages and without starting from scratch. This allows languages to be added flexibly, while preserving the broader utility of the pretrained model, as it can be used for tasks beyond translation.

Further, working in a multilingual setting remains challenging, as various different datasets, evaluation settings, and preprocessing such as tokenization are used. Benchmarks for sentence embeddings Hu et al. (2020), natural language inference Conneau et al. (2018), and question answering Lewis et al. (2019b) exist, but there is not yet a setting for machine translation. To this end, we contribute the ML50 benchmark, a dataset of 50 languages with publicly available training and evaluation sets, including high, mid, and extremely low resource directions. We will open source this benchmark for the community.

An effective and novel approach for multilingual translation models with multilingual pretraining (with monolingual data) followed by multilingual finetuning (with parallel data). In the Many-to-English setting, multilingual finetuning achieves a 3.6 BLEU improvement over bilingual finetuning, and 2.6 BLEU improvement compared to multilingual models trained from scratch. On average, combining Many-to-English and English-to-Many, multilingual finetuning improves $1$ BLEU points over the strongest baseline.

We show that existing pretrained models, such as mBART, can be extended to incorporate additional languages without training from scratch and without performance loss on the original languages. We release mBART50 for the community to use, which has double the number of languages of the original mBART.

To facilitate reproducible research on multilingual translation with representative challenges of the real world, we create the ML50 benchmark covering high, mid, and low resource languages and consisting of 230M bitext.

Related work

This work is related to recent progress of pretraining techniques for NLP applications Peters et al. (2018); Radford et al. (2018); Devlin et al. (2019); Liu et al. (2019); Song et al. (2019); Lewis et al. (2019a). In particular, recent works explored pre-training on multilingual unlabeled corpus Lample and Conneau (2019); Conneau et al. (2019); Liu et al. (2020); Tran et al. (2020), and significantly improved the performance of fine-tuning on machine translation between two languages. We extend Liu et al. (2020) by allowing fine-tuning in multilingual settings.

2 Multilingual Neural Machine Translation

Training a universal translation system between multiple languages Firat et al. (2016); Johnson et al. (2017) has shown enormous improvement for translating low-resource languages Gu et al. (2018), and even enabling zero-shot translation Gu et al. (2019); Arivazhagan et al. (2019a). Arivazhagan et al. (2019b) indicates that it is essential to train gigantic models with enough capacity to fully leverage massive multilingual corpora.

A closely related concurrent work, Siddhant et al. (2020) shows it is possible to train a multilingual system jointly with monolingual datasets based on Song et al. (2019). It naturally enables translation for languages without parallel data. In contrast, this work focuses on fine-tuning multilingual translation systems given a pre-trained model.

Multilingual Translation from Denoising Pretraining

We briefly describe the pretrained multilingual BART model and present multilingual finetuning, a technique to convert pretrained models into multilingual machine translation systems.

multilingual BART (mBART) Liu et al. (2020) is a sequence-to-sequence generative pretraining scheme. The model incorporates $N$ languages by concatenating data: $\mathcal{D}=\{\mathcal{D}_{1},...,\mathcal{D}_{N}\}$ where each $\mathcal{D}_{i}$ is a collection of monolingual documents in language $i$ . mBART is trained as a denoising autoencoder, training to predict the original text $X$ given $g(X)$ where $g$ is a noising function that corrupts text. We maximize $\mathcal{L}_{\theta}$ :

where $x$ is an instance in language $i$ and the distribution $P$ is defined by the seq-to-seq model. This model is pretrained using two types of noise in $g$ — random span masking and order permutation — as described in Liu et al. (2020).

1 Multilingual Finetuning

To leverage multilingual pretraining to create translation systems, previous work Liu et al. (2020) used mBART as a starting point and then performed bilingual finetuning. Concretely, the seq-to-seq model was finetuned on language $i$ to language $j$ translation. However, bilingual finetuning does not leverage the full capacity of multilingual pretraining. Recent work on multilingual translation Aharoni et al. (2019); Arivazhagan et al. (2019b) displays that strong translation models can be created by doing multilingual training rather than using bilingual tranining. Instead of training a model from language $i$ to language $j$ , a model is trained to translate N languages to N other languages.

Thus, we propose to do multilingual finetuning (ML-FT) to adapt pretrained models to become multilingual models. This procedure creates one model capable of translating many languages to many other languages, which has efficiency and storage maintenance benefits. Further, multilingual finetuning retains several benefits of multilingual translation models in general, for example allowing languages of similar family to benefit each other.

To perform multilingual finetuning, we collect bitexts of different language pairs $(i,j)$ into a collection $\mathcal{B}_{i,j}=\{(x_{i},y_{j})\}$ for each direction $(i,j)$ . Following mBART Liu et al. (2020), we augment each bitext pair $(x_{i},y_{j})$ by adding a source language token and a target language token at the beginning of $x$ and $y$ respectively to form a target language token augmented pair $(x^{\prime},y^{\prime})$ . We then initialize a transformer based seq-to-seq model by the pretained mBART, and provide the multilingual bitexts $\mathcal{B}=\bigcup_{i,j}\mathcal{B}_{i,j}$ to finetune the pretrained model.

We explore $3$ configurations to create different versions of multilingual translation models: Many-to-one ( $N\rightarrow 1$ ), one-to-Many ( $1\rightarrow N$ ), and Many-to-Many ( $N\leftrightarrow N$ ) via a pivot language. Concretely, the Many-to-one model encodes $N$ languages and decodes to English, while the one-to-Many model encodes English and decodes into $N$ languages. Finally, the Many-to-Many model encodes and decodes $N$ languages. We follow Arivazhagan et al. (2019b) and use pivot data through English to create Many-to-Many models.

Temperature Sampling

When training multilingual models with many languages, the training dataset sizes are imbalanced as different languages have different quantities of bitext. Thus, we train with temperature upsampling, which upsamples lower resource pairs so that the high resource languages do not dominate the training data. We follow Arivazhagan et al. (2019b) and use the following temperature based sampling function with temperature $T$ to sample data for each direction:

Results from Multilingual Finetuning on 252525 Languages

We first examine the impact of multilingual finetuning directly on existing pretrained models. We present results on the 25 languages included in mBART, using the existing mBART model. First, we describe three strong baselines: bilingual finetuning, bilingual translation models from scratch, and multilingual translation models from scratch. Then, we describe our experimental setting. Finally, we present results on 25 languages, showing that on average, multilingual finetuning improves $0.2$ BLEU over the strongest baseline — 1.0 BLEU point improvement over the strongest to-English baseline while $-0.63$ difference to the strongest from-English baseline.

We compare our proposed multilingual finetuning to three strong baselines: bilingual training from scratch, bilingual finetuning, and multilingual models trained from scratch.

We train bilingual translation models with standard Transformer Vaswani et al. (2017) models 5 layers with 512 embedding dimension, 2048 FFN embedding dimension, and 8 heads for both encoder and decoder for translation into and from English to $49$ languages. For directions with more than 1 million bitext training data (de, cs, fr, ja, es, ru, pl, zh, fi, lv, lt, and hi ), we train Transformer Big models 6 layers with 1024 embedding dimension, 4096 FFN embedding dimension, and 16 heads for both encoder and decoder as there is more data to benefit from additional model capacity. For directions with more than 10 million bitext training data (de, cs, fr, ja, es, ru, pl, and zh), we train Transformer Large models 12 layers with 1024 embedding dimension, 4096 FFN embedding dimension, and 16 heads for both encoder and decoder as there is even more data to benefit from additional model capacity. The best performing bilingual model is selected as the Bilingual Train from Scratch baseline.

Bilingual Finetuning (BL-FT)

Bilingual finetuning adapts the mBART model into bilingual machine translation models by training for longer on translation bitext. For each language direction, we follow Liu et al. (2020) and finetune for $40$ K updates to obtain the Bilingual Finetuning baseline.

Multilingual Trained from Scratch (ML-SC)

We train $3$ different multlilingual models from scratch: Many-to-one (N $\rightarrow$ 1), one-to-Many (1 $\rightarrow$ N), and Many-to-Many (N $\leftrightarrow$ N) with English as pivot. We train for $500$ K updates and sweep through different batch sizes, learning rates, and upsampling temperature for best performing multilingual model on validation, using $32$ GPUs for each training instance. Following Arivazhagan et al. (2019b), we train with temperature upsampling.

2 Evaluation and Generation

We evaluate performance with tokenized BLEU, following the tokenization in mBART Liu et al. (2020). To generate, we decode using beam search with beam size $N=5$ with length penalty $=1.0$ on the validation set. We do not perform checkpoint averaging. To select the best performing model in a sweep, we compare BLEU on the validation set.

3 Performance on 252525 Languages

We first evaluate our proposed multilingual finetuning technique on $25$ languages using the existing mBART model. We compare bilingual finetuning from mBART (BL-FT), multilingual training from scratch (ML-SC), and multilingual finetuning (ML-FT) by quantifying the BLEU improvement over the bilingual training from scratch baseline. Results are displayed in Table 1, separated into three settings: Many-to-one (N $\rightarrow$ 1), one-to-Many (1 $\rightarrow$ N), and Many-to-Many (N $\leftrightarrow$ N).

Compared to the BL-FT and ML-SC baselines, multilingual finetuning has consistently stronger results in the Many-to-one setting, translating from 25 different languages into English. The improvement is 7.9 BLEU points stronger than the bilingual from scratch baseline, and 1.0 BLEU points stronger than the the strongest baseline, ML-SC.

However, in the one-to-Many setting, improvement of all multilingual methods against bilingual baselines is lower across the board. We hypothesize this is due to the challenge of needing to decode into many different languages (additional analysis is presented in Section 6.1). Multilingual finetuning method is $3$ BLEU points stronger than the bilingual from scratch baseline; it is also comparable to the strongest baseline — bilingual finetuning with $-0.6$ BLEU difference on average.

Finally, in the Many-to-Many setting, improvement of all many-to-many multilingual methods against bilingual baselines is lower across the board. Again we hypothesize this is due to the challenge of decoding into many different languages including English (additional analysis is presented in Section 6.1). Multilingual finetuning method is $3.98$ BLEU points stronger than the bilingual from scratch baseline for translation from and into English combined. Overall, it is lower than the strongest from-English and into-English baselines combined with $-1.3$ BLEU difference on average.

Performance by Resource Level

Comparing the languages by resource level, we see that the improvement from multilingual training is more significant as the quantity of training bitext decreases. For example, in the multilingual finetuning (ML-FT) Many-to-one setting, improvement over bilingual from scratch is 4.4 BLEU points for languages with more than 10M bitext, but is 18.0 BLEU points for languages with 7K-30K available bitext. The trend is less consistent in the one-to-Many setting, but low resource languages still see improvements. For example, with multilingual finetuning (ML-FT), improvement over bilingual from scratch is 2.2 BLEU for languages with more than 10M bitext, but 7.6 BLEU for languages with 7K-30K available bitext.

Results from Multilingual Finetuning on 505050 Languages

Multilingual finetuning showed strong improvements on $25$ languages in the Many-to-one setting and we subsequently extend to incorporate a greater number of languages — 50 instead of 25. However, the number of languages possible is limited by the initial selection of languages in mBART. To remedy this, we show that the number of languages in mBART can be easily extended with additional pretraining. Second, we build the ML50 benchmark, to standardize training data, evaluation data, and evaluation procedure across 50 different languages. Finally, we display results of multilingual finetuning from mBART on 50 languages and show strong improvements over the baselines.

We describe how we extend existing pretrained models to incorporate a greater number of languages. This technique allows existing models to be used on new languages, rather than needing to restart a computationally intensive pretraining method from scratch.

While multilingual pretrained models have shown strong performance in a variety of tasks Liu et al. (2020); Conneau et al. (2019), they remain limited as they are trained on a fixed number of languages. For example, mBART was trained on 25 languages, all fairly high resource. Pretraining fully from scratch is computationally intensive — mBART trained for 2.5 weeks on 256 Nvidia V100 GPUs Liu et al. (2020). However, there are hundreds of different languages in the world, so restarting pretraining from scratch to add any of them to mBART would be difficult. Instead, we take the existing mBART model, trained on $25$ languages, and show that it can be extend to more than $50$ languages. We take the public available pretrained mBART modelhttps://github.com/pytorch/fairseq/tree/master/examples/mbart which was pretrained on $25$ languages and extend its embedding layers with randomly initialized vectors for an extra set of $25$ language tokens. We then combine the monolingual data of original $25$ languages and the new $25$ languages together to continue pretraining this extended MBART model. We will release the mBART50 model as a general purpose multilingual pretrained model, which will be useful for a variety of generation tasks beyond machine translation.

Data and Training Details

We use the mBART.cc25 checkpoint Liu et al. (2020) available in the fairseq library Ott et al. (2019) to continue the pretraining process. We use the monolingual data from XLMR Conneau et al. (2019) to extend the pretraining to a set of $25$ languages in addition to the $25$ languages mBART model. To be consistent mBART, we reuse its $250$ K sentencepiece Kudo and Richardson (2018) model which was trained using monolingual data for $100$ languages from XLMR, and thus already supports languages beyond the original 25 mBART was trained on. For pre-training, we train mBART50 for an additional $300$ K updates with a batch size of $1700$ tokens. The sizes of the monolingual data for the additional 50 languages is provided in the appendix.

2 ML50 Benchmark

To demonstrate the impact of multilingual finetuning on additional languages, we create the ML50 Benchmark. ML50 standardizes the training and evaluation schemes across 50 different languages, from extremely low resource languages like Xhosa and Gujarati to high resource languages like French and German. The full list of languages is shown in Table 3. We group the languages into five categories based on the amount of available training data: more than 10M pairs (8 languages), 1M to 10M pairs (5 languages), 100k to 1M pairs (17 languages), 10K to 100K pairs (13 languages), and finally, less than 10K pairs of training data (5 languages). ML50 includes languages in N language families, from Germanic and Romance languages to Indic and African ones. Many additional languages we contribute are lower resource, compared to the languages in the original mBART.

We gather parallel data between English and 49 other languages to form ML50, to enable the training of machine translation models. We select these 49 languages based on the amount of parallel and monolingual data to cover languages with different amount of resources and under different language families. The quantity of available monolingual data is relevant for pretraining, so we want to ensure there is a sufficient amount. All of the data is publicly available, such as WMT, IWSLT, WAT, TED, and other published research works. For training data, each language pair can include multiple sources. We simply concatenate them together and remove duplicated source-target sentence pairs for each language pair. We use fasttext Joulin et al. (2017) to perform language identification on both source and target sentences, and we remove sentences pairs if either source or target sentence is not predicted as expected language. We further filter out training data that match to any source or target side sentences in evaluation datasets. Compared to other datasets such as opus100, the ML50 benchmark contains around 4 times more training data. The full list of languages, data sources, and amount of resulting data can be found in Table 6 in the Appendix.

Evaluation Data

To ensure high quality evaluation of languages covered in ML50, we include publicly available, widely used evaluation sets. We source these evaluation datasets from translation workshops such as WMT, IWSLT, WAT, and other published research works. We follow the evaluation protocol, including tokenization, used for each of these evaluation sets, to ensure our results are comparable with existing work. We release these scripts to make it easier for others. Compared to other datasets such as opus100, we choose to use high quality existing evaluation datasets rather than use part of the training data as evaluation. This is because training data, particularly for low resource languages, is often very noisy and unreliable.

3 Performance on 50 Languages

We evaluate the performance of mBART50 on the ML50 Benchmark. We compare to the same baselines — bilingual finetuning, bilingual training from scratch, and multilingual training from scratch. Results are displayed in Table 4.

In the Many-to-One setting averaged across all languages, multilingual finetuning improves over the strongest baseline, multilingual many-to-many from scratch, by 2.5 BLEU points. For lower resource language pairs, the improvement is much more significant. For example, the improvement for languages with 4K-10K training data is 4.8 BLEU points over the strongest baseline, and the improvement for languages with 10K-100K training data is 4+ BLEU over the strongest baseline.

For One-to-Many, the performance of all methods — bilingual finetuning, multilingual from scratch, and multilingual finetuning — is similar. On average, all models have around 5.7 to 7 BLEU points improvement over bilingual baselines.

Finally, in Many-to-Many, multilingual finetuning achieves 0.8 improvement in the to-English direction over the strongest baseline. In the from-English direction, the performance of Many-to-Many from multilingual finetuning is similar to multilingual from scratch, both around 5.5 to 6 BLEU improvement over bilingual baselines.

4 Comparison to Bilingual Finetuning

We examine the performance of our proposed multilingual finetuning method compared to bilingual finetuning. Current work shows that strong translation models can be created by finetuning pretrained models to bilingual translation models. However, this means that a separate model would need to be created for each translation direction of interest, which creates a large quantity of models that need to be finetuned. In contrast, multilingual finetuning allows a multitude of directions to be captured within one model.

However, multilingual finetuning would mean that the same model capacity must model many directions rather than just one, which could decrease performance. In Figure 1, we analyze the improvement of multilingual finetuning over the bilingual finetuning. On the left, we compare the Many-to-one setting translating into English, and on the right we compare the one-to-Many setting translating out of English to many different languages.

In the Many-to-one setting, every language pair except one is improved by multilingual finetuning. Some low resource languages see substantial improvement of 10+ BLEU points, with the largest improvement being over 15 BLEU improvement. On average, multilingual finetuning improves $12.3$ BLEU across all directions into English. In the one-to-Many setting, performance is about the same between multilingual finetuning and bilingual finetuning, with the average improvement at $6.3$ BLEU across all directions out of English comparing to bilingual baselines.

Discussion

In the Many-to-one setting, where models must encode various different languages and decode into English, large improvements are seen when doing multilingual modeling. Previous work has similarly observed this improvement Arivazhagan et al. (2019b) in multilingual training from scratch, as multilingual modeling increases the quantity of target-side English data seen by the model. For example, compared to bilingual finetuning, our multilingual finetuning model is exposed to English target side data from 50 different language pairs.

However, in the one-to-Many setting and the Many-to-Many setting, models must decode into 50 different languages. This is a difficult decoding challenge, as a strong conditional language model must be learned for each language. While pretraining exposes the model to monolingual data, the quantity of monolingual data varies for each language. For lower resource languages, such as Gujarati or Xhosa, the quantity of monolingual data available even through online resources such as Commoncrawl, remains limited. Other work Arivazhagan et al. (2019b) observes similar trends in performance of one-to-Many.

Overall, we find that multilingual finetuning performs better than any of our assessed baselines — bilingual training from scratch, bilingual finetuning, and multilingual training from scratch — when averaged across the Many-to-one and one-to-Many directions. It is important to note that this effect mainly comes from the strong improvement of the Many-to-one setting, and all approaches have similar performance in the one-to-Many setting.

2 Comparison of mBART50 on 25 Languages

We show that the mBART model can be extended from 25 languages to 50 languages without starting from scratch. In this section, we evaluate if adding additional languages is harmful for performance on the original 25 languages. As the model remains the same size but has more to model, it could have reduced capacity for the original 25 languages, but we do not see any reduction in performance. Results are shown in Figure 2. For each language, we plot the performance when doing bilingual finetuning with mBART25 and mBART50. We show that performance is almost exactly the same with both models, indicating that the number of languages can be doubled without loss of performance.

Conclusion

We demonstrate that multilingual neural machine translation models can be created from pretrained models such as mBART. Previous work using pretrained models focused only on bilingual finetuning, and work in multilingual translation trained only from scratch. While using pretrained models could limit the number of languages possible, we show that mBART can be extended to double the number of original languages, without loss of performance on the original languages. We release mBART50 for the community as a strong generative denoising pretrained model in 50 different languages. Further, to train and evaluate on 50 languages, we develop and release the ML50 benchmark. In conclusion, we show that by performing multilingual finetuning, strong improvements of over 2 BLEU points can be achieved in the Many-to-one setting. Overall, averaging across the Many-to-one and one-to-Many directions, our proposed multilingual finetuning strategy outperforms all baselines.