Beyond English-Centric Multilingual Machine Translation

Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin

cs.CL cs.LG

Introduction

Multilingual Machine Translation (MMT) aims to build a single model to translate between any pair of languages. Neural network models have been very successful for bilingual machine translation (Bahdanau et al., 2014; Gehring et al., 2017; Vaswani et al., 2017) and more recently, neural MMT models have shown promising results (Firat et al., 2016; Zhang et al., 2020). Multilingual translation models factorize computation when translating to many languages and share information between similar languages, which benefits low resource directions (Arivazhagan et al., 2019) and enables zero-shot translation (Gu et al., 2019).

However, in the past, these systems have not performed as well as bilingual models when trained on the same language pairs (Johnson et al., 2017), as model capacity necessarily must be split between many languages (Arivazhagan et al., 2019). This has been alleviated by increasing model capacity (Aharoni et al., 2019; Zhang et al., 2020), but increased model size also necessitates larger multilingual training datasets which are laborious and difficult to create. To ease this challenge, most prior work has focused on English-Centric datasets and models which translate from and to English but not between non-English languages. This English-Centric bias in the data and resulting models is not reflective of how people use translation and empirically leads to lower performance for non-English translation directions.

In this work, we create more diverse multilingual machine translation models by building a large-scale Many-to-Many dataset for 100 languages. We considerably reduce the complexity of this task through the automatic construction of parallel corpora (Artetxe and Schwenk, 2018b; Schwenk et al., 2019) with a novel data mining strategy that exploits language similarity to avoid mining all directions. We also leverage backtranslation to improve the quality of our model on zero-shot and low resource language pairs. Overall, we build the first true Many-to-Many dataset comprising 7.5B training sentences for $100$ languages, providing direct training data for thousands of translation directions.

The quantity of data in a Many-to-Many dataset increases quadratically with the number of languages, making neural networks with standard capacity underfit rapidly. To that effect, we leverage progress in scaling (Kaplan et al., 2020; Arora et al., 2018) to train models that are over $50$ times larger than current bilingual models with model parallelism (Huang et al., 2019; Shoeybi et al., 2019). Even with these tools, scaling the number of parameters hardly follows the quadratic increase in data induced by the Many-to-Many setting, and we propose several scaling strategies tailored to the specificities of our problem. In particular, we consider a deterministic mixture-of-experts strategy to split the model parameters into non-overlapping groups of languages which we train with a novel re-routing strategy. Language specific mixture-of-experts also reduce the need to densely update parameters and are more parallelizable in a multi-machine setting. Overall, combining these strategies allows us to scale the capacity of the models to a size of $15.4$ B parameters and still train them efficiently on hundreds of GPUs.

The resulting method allows us to scale Transformers and directly translate between 100 languages without pivoting through English at a performance that is competitive with bilingual models on many competitive benchmarks, including WMT. Figure 1 illustrates our data mining strategy as well as our model architecture. This paper is organized as follows: first, we introduce several standard components of modern machine translation and explain how they apply in the multilingual setting (§ 2), then describe our strategy to scale the number of language pairs to create a Many-to-Many dataset (§ 3). We then systematically compare this Many-to-Many dataset to an English-Centric approach (§ 4). Next, we incorporate increased model scale through both dense scaling and sparse mixture-of-experts (§ 5). Finally, we end with a thorough analysis, including human evaluation, of the quality of our 100x100 Many-to-Many translation system (§ 6).

Preliminaries

In this work, we investigate how we can best translate from 100 languages to 100 languages, or 9900 directions, using a single model. We describe our starting point in this section, and provide preliminary context on Transformer-based neural machine translation models.

Modern neural machine translation systems are based on several standard components, namely a subword segmentation method and an encoder-decoder architecture called a Transformer. We describe these components in the context of multilingual translation.

The input and output of translation systems are sequences of tokens. These tokens are units from a dictionary built with the goal to reconstruct any sentence in any language. Using words as base units is challenging, as it leads either to vocabularies with poor coverage or to large vocabularies. This is especially true in the multilingual setting. Another limitation of word-based systems are languages that are not naturally split into words, like Thai. An alternative approach is to use subword units, which are learned directly from data (Sennrich et al., 2015; Kudo and Richardson, 2018). We use SentencePiecehttps://github.com/google/sentencepiece as it was designed to work with languages with no segmentation, making it particularly suited to our setting. We train a model with 0.9995 character coverage to have sufficient representation of character-based languages.

1 Transformers

Both the encoder and decoder are composed of the same type of layers, called Transformer layers. Each Transformer layer takes a sequence of vectors as input and outputs a sequence of vectors. In the encoder, transformer layers are composed of two sublayers, a self-attention and a feed-forward layer. These are applied sequentially and are both followed by a residual connection (He et al., 2015) and layer normalization (Ba et al., 2016):

The self-attention layer is an attention layer that updates each element of the sequence by looking at the other elements, while the feed-forward layer (FFN) passes each element of the sequence independently through a 2-layer MLP. In the decoder, there is an additional third sublayer, between the self-attention and the feed-forward, which computes attention over the output of the encoder. We refer the reader to Vaswani et al. (2017) for details of these layers.

The Transformer architecture has been designed for the bilingual case, where the target language is fixed. In the case of multilingual machine translation, the target language is not fixed, and several strategies can be applied to condition the network to produce a sentence in the desired target language. Similarly to Ha et al. (2016) and Johnson et al. (2017), we add a special token in the encoder indicating the source language and a special token in the decoder indicating the target language.

Our starting point for improving massively multilingual translation models is a large Transformer model, with 12 Encoder and 12 Decoder layers, with 8192 FFN size and 1024 embedding dimension. We share the weight matrices of the input and output embeddings. The total parameter count is 1.2B. We train with the Adam optimizer (Kingma and Ba, 2015) and warmup first for 4000 updates, with label smoothing $0.1$ (Szegedy et al., 2015; Pereyra et al., 2017). For regularization, we tune the dropout parameter between $\{0.1,0.2,0.3\}$ . To stabilize the training of deeper Transformers, we train with LayerDrop (Fan et al., 2019) 0.05 and pre-normalization (Nguyen and Salazar, 2019).

To train with billions of sentences, we split the training data into 256 different shards to manage memory consumption. However, directly dividing mid and low resource languages into shards would reduce the variability of each shard’s data for mid or low resource languages. Imagine the case where there are only $100$ sentences of a language direction per shard — the model would easily overfit. Thus, each language is divided into a different number of shards based on resource level, such that high resource languages have more shards and the lowest resource languages only have one shard. Subsequently, lower resource shards are replicated until the full number of shards is reached.

Unless otherwise specified: for all results, we report single models with no checkpoint averaging, use beam search with beam 5, and do not tune length penalty.

Building a Many-to-Many Parallel Dataset for 100 Languages

In this section, we provide an overview of our Many-to-Many setting: the selection of the $100$ languages, the evaluation benchmarks, and the construction of a large-scale training set through data mining (Artetxe and Schwenk, 2018b) and backtranslation (Sennrich et al., 2016a) that provides training data thousands of directions.

The first step of establishing a Many-to-Many dataset is to select $100$ languages for which there already exist high-quality, annotated datasets that can be used for model evaluation.

We consider several factors to select which languages to focus on. First, we include widely-spoken languages from geographically diverse language families. We cover a diversity of scripts and resource levels (as shown in Table 1) to have high coverage of languages worldwide. Second, we use languages for which public evaluation data exists, as we must be able to quantify model performance. Lastly, we only use languages for which monolingual data is available, as monolingual data is a critical resource for large-scale mining. Combining these three criteria results creates our full list of 100 languages, summarized in Table 1.

1.2 Evaluation Benchmarks

We use publicly available evaluation datasets to evaluate the performance of all of our models. To cover our set of $100$ languages and $2200$ directions, we bring together data from a variety of sources. We describe each evaluation dataset below.

WMT — The majority of language pairs from WMT go through English and the data is from the news domain. We consider data for $13$ languages (Ondrej et al., 2017; Bojar et al., 2018; Barrault et al., 2019).

WAT — The WAT competition covers Asian languages paired with English. We consider data for Burmese-English (Riza et al., 2016), which contains news articles. WAT contains many other evaluation directions, but many of those are covered by WMT or in a specific domain, so we focus on Burmese-English for WAT only.

IWSLT — The IWSLT translation competition contains data from TED talks paired with English translations. We use data for $4$ languages (Cettolo et al., 2017).

FLORES — FLOREShttps://github.com/facebookresearch/flores (Guzmán et al., 2019) pairs two low resource languages, Sinhala and Nepali, with English in the Wikipedia domain.

TED — The TED Talks datasethttps://github.com/neulab/word-embeddings-for-nmt (Ye et al., 2018) contains translations between more than $50$ languages; most of the pairs do not include English. The evaluation data is n-way parallel and contains thousands of directions.

Autshumato — Autshumatohttps://repo.sadilar.org/handle/20.500.12185/506, CTexT® (Centre for Text Technology, North-West University), South Africa; Department of Arts and Culture, South Africa is an $11$ -way parallel dataset comprising $10$ African languages and English from the government domain. There is no standard valid/test split, so we use the first half of the dataset for valid and second half for test.

Tatoeba — Tatoeba Challengehttps://tatoeba.org/eng/ covers $692$ test pairs from mixed domains where sentences are contributed and translated by volunteers online. The evaluation pairs we use from Tatoeba cover 85 different languages.

We evaluate the quality of translations with BLEU (Papineni et al., 2002). We first detokenize all data, then apply standard tokenizers for each language before computing BLEU. For most languages, we use the moses tokenizer (Koehn et al., 2007).https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl For Chinese we use the SacreBLEU tokenizer (tok zh) and convert all traditional characters generated by the model to simplified characters using HanziConvhttps://github.com/berniey/hanziconv (Post, 2018),The evaluation datasets for Chinese usually contained simplified characters. However, our training data contains a mix of simplified and traditional characters, and thus the model could generate either. We convert the generated traditional Chinese characters to simplified for consistency. for Indian languages we use the Indic NLP library (Kunchukuttan, 2020),https://github.com/anoopkunchukuttan/indic_nlp_library for Japanese we use Kytea,https://github.com/neubig/kytea for Thai we use PyThaiNLP (Phatthiyaphaibun et al., 2016),https://github.com/PyThaiNLP/pythainlp for Arabic we use the QCRI Arabic Normalizer,http://alt.qcri.org/tools/arabic-normalizer/ for Korean we use Mecab,https://pypi.org/project/python-mecab-ko/ for Burmese we use the official segmentation tool provided by Ding et al. (2019), for Romanian we follow Sennrich et al. (2016b) and apply Moses tokenization, special normalization, and remove diacritics for Romanian texts,https://github.com/rsennrich/wmt16-scripts/blob/master/preprocess/ and finally for Serbian we transliterate the output to Latin characters before computing BLEU.In Serbian, both Latin script and Cyrillic script are used, and often intermixed within a sentence in the evaluation data. As the target sentence could be in either script and it is not possible to predict the target script from the input, we transliterate before computing BLEU. We release the tokenization and evaluation scripts for reproducibility herehttps://github.com/pytorch/fairseq/tree/master/examples/m2m_100. We remove all data from all evaluation sets from our training sets.

2 Covering the Language Matrix by Mining Relevant Parallel Data

Supervised translation systems rely on large quantities of parallel sentences, which we refer to as bitext data, which are traditionally derived from human translations. Most existing bitext datasets go through English, with a few domain specific exceptions such as proceedings from international organizations (Koehn, 2005; Ziemski et al., 2016). These corpora are limited in size and domain, and an alternative is to mine parallel data (Resnik, 1999; Utiyama and Isahara, 2003) in large collections of monolingual data (Conneau et al., 2019; Wenzek et al., 2019). In this work, we leverage and extend the corpus provided by two of these mining projects: CCMatrix (Schwenk et al., 2019) and CCAlignedhttp://www.statmt.org/cc-aligned (El-Kishky et al., 2020). In the following, we describe our mining strategy and summarize the main ideas of CCMatrix and CCAligned. We refer the reader to the references for a detailed description of the approaches.

Mining parallel data consists of searching for sentences that could be potential translations in large monolingual corpora. This search requires a measure that captures the semantic similarity between sentences in different languages. Most recent methods build this similarity by comparing the embeddings from a neural network trained on multilingual data (Artetxe and Schwenk, 2018b; Chen et al., 2020; Kvapilíková et al., 2020). We focus on the embeddings generated by the LASER encoder, which enables the comparison of sentences in $94$ different languages (Artetxe and Schwenk, 2018a). We then retrieve parallel corpora efficiently using FAISS indexing (Johnson et al., 2019). LASER embeddings generalize to unseen languages, like Asturian, allowing us to mine bitexts for $100$ languages. The generic data mining pipeline consists of several steps: (1) a large corpus of text is preprocessed and divided into different languages, (2) candidate pairs of aligned sentences are embedded and stored in a index, (3) indexed sentences are compared to form potential pairs, (4) the resulting candidate pairs are filtered in post-processing.

CCMatrix takes a global approach: all unique sentences in one language are compared with all unique sentences in another language. This global mining approach has the advantage of considering all possible documents when searching for the translation of a sentence. CCMatrix works on the large monolingual corpora in the $91$ languages of CCNet (Wenzek et al., 2019), but at this scale, the global search is computationally demanding even with fast indexing from FAISS (Johnson et al., 2019). Thus, we apply it to a selected subset of relevant pairs, as detailed in § 3.2.1.

CCAligned avoids the scaling challenges of global mining by pre-selecting documents to compare. This local mining follows a hierarchical approach: first, document-level language identification along with various rules is applied to find whole documents that are likely to contain mutual translations (El-Kishky et al., 2020). Parallel sentences are then mined using LASER-based alignment within the paired documents only. Filtering (Chaudhary et al., 2019) is performed to remove unaligned data that exists because the original webpage did not have any parallel data, only partial parallel data, or other processing failures. One advantage of this approach is that it is very fast, scalable, and retrieves parallel sentences with high precision. Another is that each English document is aligned to many non-English documents — thus, mining non-English pairs can be quickly performed by joining non-English documents paired to the same source.

We apply a filtering step to remove sentences of greater than 50% punctuation. The data is then deduplicated, and we remove any sentence that appears in any validation or test dataset – even if it is associated with another language pair. Finally, we apply length and language-specific filtering. The length filtering removes sentences that are too long – more than $250$ subwords after segmentation with SPM – or with a length mismatch between the sentence and its translation – if the length ratio is greater than $3\times$ . The language-specific filtering removes sentences that contain more than $50\%$ of characters that have not been marked as core to the identified language – specifically, characters that are commonly used in the identified language with the exception of white space, numbers, punctuation, and Latin characters for non-Latin script languages.

2.1 Bridge Language Group Mining Strategy

Mining data for each and every language pair is prohibitive — previous work circumvents this issue by focusing only on the $99$ pairs that go through English (Zhang et al., 2020). One alternative to the extensive computation required to mine all possible combinations of pairs is sparse mining, or mining only a select subset of pairs. A straightforward strategy is to randomly select pairs to mine, but this does not use any linguistic information on how languages are related and spoken around the world.

In this work, we propose an alternative based on language families and bridge languages that avoids exhaustively mining every possible pair. Our goal is to reduce the number of bitext pairs while preserving translation directions of practical interest. We first group all the $100$ languages into $14$ language groupings. All languages within a grouping are mined against each other. For instance, within the Indic language grouping, we mine all pairs of Bengali, Hindi, Marathi, Tamil, Urdu, and so on. The motivation for this strategy is two-fold. First, people living in areas that speak multiple languages in the same grouping tend to communicate a lot with each other and would benefit from high quality direct translation. Second, systematically mining languages of the same grouping is helpful for training language-specific parameter models (see § 5.2).

For the most part, languages are grouped by linguistic similarity, e.g. Germanic, Slavic, or Malayo-Polynesian languages. However, the size of the resulting groupings varies greatly, resulting in less mined data for the languages in the smallest groupings. We further group languages by geographic and cultural proximity to reduce this discrepancy. For example, Uralic and Baltic languages are gathered into a single group to increase the quantity of mined data. The resulting groupings are shown in Table 1.

To connect languages across groupings, we define 1–3 bridge languages in each grouping, usually those with the most resources, such as Bengali, Hindi, and Tamil for the $12$ languages in the Indo-Aryan family. All $26$ bridge languages are highlighted in Table 1. These bridge languages are mined against all other bridge languages. Finally, all $100$ languages are mined against English. We illustrate this mining strategy in Figure 2. On the left, we depict what many current approaches model: data only through English. On the right, we depict our Many-to-Many language matrix for several example languages. Compared to English-Centric, our dataset has far greater coverage of non-English, direct translation directions.

In total, our final training dataset contains 7.5B parallel sentences, corresponding to $2200$ directions. In Figure 3, we show all bridge languages and demonstrate how their associated training data is divided between translations with English, within a language grouping, or with bridge languages across language groupings. Of particular interest is the comparison between the additional Many-to-Many data and the data through English. We observe that 5–10 times more parallel data can be mined if using a Many-to-Many strategy, compared to an English-Centric one. This is particularly beneficial for mid- and low-resource languages.

2.2 Results

We validate the impact of several decisions made during data construction. First, we study the impact of our bridge language strategy compared to English-Centric mining augmented by other random pairs, as well as fully random mining. Second, we investigate the impact of the level of sparsity chosen in our bridge strategy, focusing on a subset of 50 languages.

We experimentally evaluate the impact of our bridge language mining strategy on the performance of our baseline model in Table 2 (left). We consider two additional baselines, a fully random mining strategy (Random 20%) and a English-Centric + Random strategy (Random 20% w/ En). In the Random strategy, mined pairs are randomly chosen, while in the English-Centric + Random strategy, we retain all pairs through English and only select the remaining pairs randomly. We show that fully random mining has a substantial negative impact on performance, as a lot of high quality data is aligned through English, so sampling fully randomly eliminates a large portion of the potential training set. Random 20% w/ En is worse as well. Through examination, we find that randomly sampling pairs to mine often selects pairs that do not produce as much data, as the pairs may not include high resource languages. However, the bridge language strategy ensures that high resource languages are mined, and then focuses on mining languages in related families. This produces a large amount of bitext, and at the same time, covers many language directions.

We control the sparsity of our language matrix using the number of bridge languages. In Figure 2 (right), we show the impact of sparsity on the performance of our baseline model compared to a fully mined language matrix (0% sparse). We observe that increasing the amount of mined data to make the matrix less sparse is helpful, but fully mining the matrix is not substantially better. The main reason is that our mining strategy prioritizes frequently used pairs which are often associated with the largest bitext, while the discarded pairs are often associated with small bitext. For example, fully mining the matrix would mine a pair such as Icelandic to Chinese, but the amount of data produced by mining this pair is quite low. This case is representative of what occurs as the full matrix is mined — as increasingly more data is mined, the additional pairs begin to add less data which in turn leads to diminishing quality improvements.

3 Augmenting Bitext Data with Backtranslation

Backtranslation (BT) creates synthetic bitexts from unaligned monolingual data (Schwenk, 2008; Bojar and Tamchyna, 2011; Sennrich et al., 2016a; Edunov et al., 2018; Hoang et al., 2018). The core idea is to translate monolingual sentences in the backward direction, and add the obtained synthetic translations to the training set. More precisely, when training a model to translate from a source language to a target language, backtranslation generates additional data by translating monolingual target sentences into the source language. Using backtranslation thus requires the ability to translate in both directions, which fits well into the setting of multilingual machine translation (Zhang et al., 2020; Siddhant et al., 2020). However, generating these backtranslations is time consuming even for a single direction, which is compounded in the Many-to-Many case. We thus focus on applying backtranslation on specific pairs to supplement mining data where needed.

Our goal is to translate between 100 languages and to provide good translation quality for as many translation directions as possible. To this end, we use BT to improve directions which have initially lower translation quality. We identify these language directions by measuring the quality of our 1.2B parameter multilingual model before applying BT. Since back-translation is computationally intensive, we focus on $100$ directions with a BLEU score of between $2$ and $10$ . For $50$ of these directions, we do not have any bitext at all as we did not mine all 4,450 possible language pairs.

For the selected pairs, we first generate synthetic translations that are added to the training set without upsampling. Following Caswell et al. (2019), we add a special encoder-side BT token to these translations to indicate to the model that they are synthetic. For each of the $100$ target languages, we randomly sample $50$ million unique monolingual sentences from the cleaned CommonCrawl corpus of Wenzek et al. (2019). The synthetic translations are then generated with our $1.2$ B MMT model. We use a beam search with beam of size $5$ and fix all the hyper-parameters, including the length penalty, to the same values for all the directions. We apply the same filtering to the backtranslations as the original mined training data, which substantially reduces the size of the resulting synthetic bitexts.

Results are shown in Figure 4, where we compare the original Many-to-Many model used to create the backtranslations (blue line) with the improvements after training a multilingual model with the backtranslation added (orange scatter). Backtranslation almost always improves performance for any direction, regardless of the original BLEU score. As the amount of data generated with BT correlates with the length of training time, we decide to focus on applying BT on directions with low performance (BLEU between 2 and 10) to improve our MMT system where it underperforms.

4 Balancing Languages in a Many-to-Many Setting

Our goal is to design a sampling technique such that the distribution of languages on the source and target sides is equal to a given target distribution. Unfortunately, sequentially sampling the source language and then the target would not work, as some languages are only paired with a subset of languages — making it impossible to sample the target language according to a given distribution. Moreover, the sizes and distributions of bitexts greatly vary from a language to another. Instead, we propose directly sampling a pair of languages from a matrix of pair probabilities such that the marginal distributions of languages corresponds to our target distribution. In practice, this means that each row and column of the matrix should sum to the probability of the corresponding language. More precisely, we estimate a square matrix $\mathbf{P}^{*}$ such that:

where $\mathbf{p}$ is the vector stacking the probabilities of the $L$ languages and $\mathbf{Q}$ is the matrix of pair probabilities. This problem can be solved exactly with the Sinkhorn-Knopp algorithm. The matrix $\mathbf{Q}$ has entries equal to for pairs with no bitext and this algorithm preserves them in the solution $\mathbf{P}^{*}$ , hence adding no probability mass to missing bitexts. We calculate this once before training and set the temperature $T$ to $5$ . In Table 3, we show the benefits of this strategy over temperature sampling with a constant improvement of $0.5$ in BLEU.

Many-to-Many Compared to English Centric

In this section, we first present an experiment to better understand the performance improvements of English-Centric systems and to compare them to our Many-to-Many setting.

We train our $1.2$ B model on the full $100$ language Many-to-Many dataset and compare it to the same model trained only on data through English. We use the same vocabulary built with SentencePiece on the full dataset in both cases. Each model has a different dataset size and we train for 500K updates. This number of updates corresponds to one pass over the entire Many-to-Many dataset and $3.5$ passes on the English-centric data. We tune the dropout rate for each model over the values $\{0.1,0.2,0.3\}$ .

1 Main Result

In Table 4, we compare the performance of both models on different types of directions, namely, any language to English (To English), English to any language (From English), and all the directions not involving English (Non-English). Performance is aggregated over $150$ directions for To English and From English, and over 2500 directions for Non-English. On the pairs including English, both models achieve similar performance, suggesting that a $1.2$ B model does not underfit even though the additional non-English data represents $98\%$ of the directions and 74% of the data. For the non-English pairs, we consider two translation strategies for the English-Centric model: directly translating as if the model was trained on the pair – by using the corresponding language tokens – or by pivoting through English. Our model outperforms direct translation with the English-Centric model by $10.2$ BLEU and when the English-Centric model uses pivoting by $5.5$ BLEU. While this result is not surprising, it confirms that a purely English-Centric model has limited potential on non-English pairs, and there is a fundamental need for training on Many-to-Many data.

2 Understanding the Source of Improvement

The main impact of adding Many-to-Many data is on the directions that do not include English. In this section, we provide a detailed study of where we observe the largest improvements with the additional data.

Many non-English pairs are not covered by our Many-to-Many model, and we can thus study if the improvements we observe originate primarily from directions associated with bitext data or if we observe the same improvement on directions where the Many-to-Many model generates translations in a zero-shot fashion. In Table 6, we show the performance if the evaluation is split between the Non-English pairs with and without bitext. On directions with bitext, the Many-to-Many model outperforms the English-Centric model by $7$ BLEU for direct translation, and by $3.5$ BLEU for English-Centric with pivoting. This shows the importance of diverse data. Not surprisingly, this gap is even bigger on pairs without bitext. Many-to-Many performs nearly $11$ BLEU better than the English-Centric model for direct translation, and with pivoting the gain remains over $6$ BLEU.

A hypothesis to explain the gain between English-Centric and Many-to-Many models is the effect of additional source and target side training data. Even if the Many-to-Many system has never seen a direction at training time, it benefits from additional source and target side data available through other training pairs. As mining non-English language pairs creates more training data compared to English-centric datasets, the Many-to-Many model benefits from a larger training set. In Table 6, we compare both models after seeing the same quantity of data. We train both models for one epoch. The English-Centric model performs better on To English directions, likely because it only has one output language to learn, but the Many-to-Many model outperforms on From English directions and Non-English directions.

The main factor for improvement is the quantity of data associated with either a pair or a language. Pairs that have a large quantity of mined data, such as Spanish-Portuguese, greatly benefit from our Many-to-Many dataset. We show this effect in the left panel of Figure 5 (left). A second source of improvement is observed on languages for which the Many-to-Many dataset contains a large amount of data across many pairs. This data benefits the decoder-side language model in a way that is comparable with BT. In the right panel of Figure 5, we show the impact of this cumulative monolingual data on the average performance per language. Finally, we also observe a third type of improvements from the similarity in vocabulary and syntax from related languages. A striking example is the quality of translation between English and Belarusian, where the Many-to-Many model achieves 12.7 BLEU on the TED evaluation set, compared to 3.2 BLEU for a bilingual model. The number of bitexts for Belarusian is small, but Belarusian is related to Russian, and the Many-to-Many model transfers its knowledge from Russian to Belarusian.

3 Understanding the Performance of English-Centric Systems

In Table 4, we confirm an observation made in Arivazhagan et al. (2019) that an English-Centric model improves the most over bilingual models on the directions into English, while improvement in the other directions (From English) remain more modest. A hypothesis to explain this discrepancy between directions from and to English is that the decoder of an English-Centric model learns a better English language model by leveraging the aggregated English data across all through-English directions.

We test this hypothesis with a controlled experiment where we compare a Many-to-English model with bilingual models using backtranslated English data (§ 3.3). The experiment is based on 11 Slavic languages and we backtranslate the exact same English data as was used to train the Many-to-English model so that both models are trained on the same English data. Figure 6 shows that backtranslation performs comparably to the Many-to-English approach. While this improves our understanding of Many-to-English translation, a multilingual approach nevertheless retains the advantage of combining many directions into a single model which greatly simplifies modeling.

Components for Scaling Multilingual Translation Models

Our goal is to build a single model capable of translating $9,900$ language directions covering $100$ languages. This creates several challenges for models with insufficient capacity to capture that many languages and scripts adequately. To this end, previous MMT work has considered different types of large capacity models (Arivazhagan et al., 2019; Lepikhin et al., 2020). In this section, we investigate different ways to add capacity to an MMT model: we first investigate dense scaling, where we increase the depth and width of standard Transformer architectures. Then, we identify disadvantages of dense scaling, and propose an alternative to effectively add language-specific parameters and exploit the nature of language similarities within the task of multilingual machine translation.

During the training of a neural network, we need to fit its weights, activations, gradients, and optimizer state in memory. This restricts the maximum capacity of a network that we can train on a single accelerated device such as a GPU. In this section, we describe two directions to circumvent this limitation. The first direction focuses on fitting a larger model on single device by reducing the memory required by activations and optimizer states during the training process. The second direction focuses on efficient training of even larger models through model parallelism e.g. splitting a model across multiple devices. In this work, we pursue both techniques to densely scale the capacity of Transformers.

To reduce the amount of memory, we consider optimizer state sharding and gradient checkpointing. Optimizer state sharding (Rajbhandari et al., 2019) divides the optimizer state across distributed data parallel workers so that each worker only needs to store a fraction of the optimizer state. We also apply gradient checkpointing, which saves memory by discarding intermediate activations before the forward pass finishes (Chen et al., 2016). During the backward pass, these activations are recomputed again as required. This trades time for memory. In the case of a Transformer based architecture, applying gradient checkpointing at pipeline parallel model partition boundaries reduces the memory used by activations by almost 50%.

Reducing the memory consumption enables fitting greater model capacity on a single GPU, but the physical limitations of a single device still apply. A solution is to split the model into separate components that are dispatched across different GPUs and trained in parallel. This type of solution scales model capacity with the number of GPUs. There are two broad paradigms to split a model: along the width or along the depth. Tensor parallelism (Shoeybi et al., 2019; Shazeer et al., 2018) splits by width, while pipeline parallelism (Huang et al., 2019; Kim et al., 2020) splits by depth, placing different layers on different GPUs. We use pipeline parallelism, but both methods work equally well with Transformers. We use the implementation from fairscalehttps://github.com/facebookresearch/fairscale.

1.2 Training Large Dense Models

We investigate several strategies to increase the capacity of a sequence-to-sequence Transformer model in the context of multilingual machine translation.

We consider increasing the capacity of a Transformer by either increasing the number of layers (depth axis) or the dimensions of each layer, including the feedforward (width axis). On the left panel of Figure 7, we analyze which axis to prioritize by comparing models with different sizes, $1$ B, $2$ B, and $10$ B, obtained by growing their depth or width (see Appendix B for model configurations and dimensions). We report their performance in BLEU and their inference speed measured in words per second (WPS). We train these models on a dataset that covers $80$ languages and evaluate them on $38$ different benchmark directions with more than $1$ k parallel sentences per direction. The main result of this study is that wider models scale better than deeper models in terms of performance and WPS. In the rest of this paper, we thus focus on wider models.

In the right panel of Figure 7, we compare the performance of wide models as we increase their capacity from $418$ M to $12$ B parameters. We train these models on the full set of $100$ languages and evaluate them on all supervised evaluation pairs. We report their performance in BLEU for pairs with either low, mid or high resource training data. First, as we increase the number of parameters, we observe that the performance increases, even on low-resource pairs. This suggest that even a $12$ B parameter model could be underfitting with our many-to-many multilingual dataset. However, improvements increase roughly logarithmically in the number of parameters, and we need to scale model size by an order of magnitude to improve by a few BLEU points, e.g., $+1.5$ BLEU from $1.2$ B to $12$ B. As we scale models densely, their runtime and memory usage becomes too prohibitive to justify the gain in performance, and so, we consider alternatives to increase the capacity of our models more efficiently.

2 Scaling Model Capacity with Language-Specific Parameters

In this section, we introduce a layer whose parameters are split by language or language group based on similarity in vocabulary. Each translation direction only accesses a subset of these parameters, allowing the model capacity to scale without significantly affecting the training and inference time. The layer is trained with a novel re-routing scheme to improve generalization which we detail below. Compared to previous work (Wang et al., 2018; Bapna et al., 2019; Zhang et al., 2020), we focus on allocating entire language-specific layers and using this to scale model size while maintaining training speed.

We follow the sequence-to-sequence Transformer architecture and replace some of its layers by a set of parallel Transformer layers, one for each pre-defined group of languages. More precisely, assuming we have split the languages into $K$ fixed groups, this parallel layer is composed of $K$ parallel Transformer sublayers, one per language group. For each translation, we then select the corresponding sublayer among the $K$ possibilites depending on the language direction. If the parallel layer is in the encoder, we select the sublayer according to the source language, while if it is in the decoder, we select according to the target language. In practice, we only add these layers to either the encoder or decoder, not both. This enables us to split translations along with their sublayers per GPU, leading to faster training and efficient memory usage. Figure 8 shows an example of the resulting trunk-and-branch architecture when the parallel layer is in the decoder.

We group languages based on two criteria: the amount of training data and their vocabulary. The motivation for these criteria is that we can learn a specific layer for a language with enough data, and for the rest, overlapping vocabulary is a good proxy for similar languages. First, each language with more than $100$ M sentences forms its own group and hence has its own sublayer. We have $28$ languages that fit this criteria: hu, ru, hi, ro, fr, nl, fi, pt, ar, el, vi, en, ms, tr, he, id, pl, cs, sv, fa, zh, bg, de, es, ko, ja, it, da. Second, we group the remaining languages by vocabulary overlap, leading to $18$ additional groups. To create these groupings, we calculate the vocabulary overlap between the training data of different languages and cluster those that have high overlap together. Note that some low resource languages have their own script — such as Kannada — and are not clustered with any similar languages as the script is unique. However, to maintain balance between groups (Wang et al., 2020), we cluster the remaining languages together and roughly balance the amount of training data for each group. In total, we form $46$ groups, each with its own sublayer in a language-specific layer.

During training and inference, a sublayer is deterministically selected according to its language direction. This guarantees that our model always uses the same memory and time during inference, regardless of the translation pair. However, during training, this deterministic routing does not share information between similar languages if not associated with the same sublayer. For example, the sublayer associated with Ukrainian does not benefit from the large quantity of Russian training data, since Russian has its own isolated sublayer. We mitigate this shortcoming by random re-routing of translations, i.e., randomly picking another sublayer instead of the designated one. This shares information between languages associated with different sublayers, benefiting low resource languages by training on similar high resource languages. The re-routing is completely random, though could be restricted to re-route only to similar languages.

We can integrate a language-specific layer into an already pre-trained Transformer by adding it either at the end of the decoder or at the beginning of the encoder. We can then freeze the parameters of the pre-trained Transformer and learn the language-specific components. These additional language-specific layers train rapidly as the rest of the model already has strong performance. This strategy means it is straightforward to adapt pre-trained networks to a new domain or language by training a small number of dedicated parallel layers, and could easily be extended to various other applications.

2.1 Evaluation of the Language-Specific Layer

We experiment with different scenarios by adding a language-specific layer to the encoder or decoder, or to a pre-trained densely scaled model. We demonstrate the importance of random re-routing. Finally, we validate this strategy by comparing it to scaling models densely.

The trunk-and-branch architecture for language-specific layers is general and can be used to specialize capacity for any neural architecture. We explore adding language-specific capacity in the encoder or decoder using a smaller setting of 10 high-resource languages. Table 7 shows that language-specific parameters are generally more effective when applied to the decoder network. Recent studies show that encoders are more important for bilingual machine translation (Wu et al., 2019; Kasai et al., 2020), however, these studies are based on systems modeling only a single language direction compared to our setting. In our case, increasing the encoder or the decoder does not impact performance significantly, and we focus on decoder for the rest of this paper.

Figure 9 shows the impact of the re-routing strategy on performance as we increase the number of training samples routed to random parallel layers as opposed to their assigned layers. With a re-routing rate of 20%, an improvement of about 0.8 BLEU can be achieved over no re-routing for low and mid resource languages, without affecting the performance of high resource languages. Too much stochasticity leads to performance similar to no random re-routing for high resource languages, but still improves mid to low resource performance compared to no re-routing.

We compare adding language specific capacity with densely scaling model size in Table 8 on $100$ languages. As language-specific layers add many parameters, we compare to baseline models at various sizes for multiple points of comparison. Our conclusion is that language-specific layers improve results compared to baselines of similar parameter size, particularly for mid and high resource languages where there is sufficient data to train the language-specific sublayers. Further, compared to dense scaling, sparse scaling only uses a fraction of the parameters in each forward pass, which maintains fast training speed despite large total model size.

We demonstrate the impact of adding language-specific layers to the decoder of a pre-trained $12$ B parameter Transformer in Figure 10. We show that adding language-specific layers for five languages improves results on the WMT evaluation datasets. The language-specific layer adds $3.4$ B parameters and we train it for $20$ K updates with the rest of the network frozen. The total size of this model is $15.4$ B parameters. For several directions, we observe gains of more than 1 BLEU, which validates this strategy. On average, we observe gains of 0.9 BLEU.

Bringing it all Together

We have explored the creation of a true many-to-many dataset for the multilingual translation of 100 languages, as well as how to effectively scale Many-to-Many models through a mix of dense and sparse scaling. In this section, we summarize our final results, compare to existing published work — both multilingual benchmarks and competitive directions in WMT — and end with a human evaluation of the overall quality of our translation quality.

We highlight that there are several real-world usecases of translation directions not involving English. For example, many countries have official and regional languages that are not English, which would be natural candidates for direct translation. For example, it is intuitive to translate Kazakh directly to Russian in Kazakhstan. In Table 9, we compare English-Centric models to Many-to-Many on a variety of different non-English directions. We see that across the board, our M2M-100 model has drastically better performance and on average improves over 7 BLEU across these directions.

2 Comparison on Various Translation Benchmarks

Next, we compare our M2M-100 model to various existing work on different benchmarks. While the training data is not the same, we conduct this comparison to provide a reference point for the overall strength of our model. An important note is that for each of these benchmarks, there are various different tokenizers used which affect BLEU — we follow the tokenization and BLEU calculation of each of these benchmarks, rather than the evaluation methodology of our previous results. Thus, the numbers in this subsection are not comparable to the rest of the paper, as they use the tokenization of each benchmark. Further, this comparison was prepared in advance, so all sentences appearing in these evaluation sets were removed from the training data we used.

First, we compare our Many-to-Many model to submissions to WMT, the premier translation competition. We display results on a variety of different language directions, some of which are standard natural language processing machine translation benchmarks, such as English-French, English-German, and English-Russian. Results are shown in Table 10. $\textbf{En}\leftrightarrow{}\textbf{De/En}\leftrightarrow{}\textbf{Ru:}$ we evaluated publicly available single model checkpoints prior to finetuning from Ng et al. (2019) on WMT2019. $\textbf{En}\leftrightarrow{}\textbf{Zh:}$ we report results from Li et al. (2019) which contains single model BLEU results on WMT2019. $\textbf{En}\leftrightarrow{}\textbf{Lt:}$ we report results from Pinnis et al. (2019) on WMT2019; both directions are the best single model systems which use unconstrained training data. $\textbf{En}\rightarrow{}\textbf{Fr:}$ we report results from Edunov et al. (2018). $\textbf{Fr}\rightarrow{}\textbf{En:}$ we report results from Johnson et al. (2017) on WMT2014. $\textbf{En}\leftrightarrow{}\textbf{Lv:}$ we report results from Pinnis et al. (2017) on WMT2017. $\textbf{En}\leftrightarrow{}\textbf{Tr:}$ we report results from Sennrich et al. (2017) on WMT17. $\textbf{En}\leftrightarrow{}\textbf{Et:}$ we report results from Pinnis et al. (2018) on WMT18. $\textbf{En}\leftrightarrow{}\textbf{Fi:}$ we report results from Talman et al. (2019) on WMT17. Many submissions to the WMT shared task use ensembling, in-domain finetuning, or reranking methods, which are standard techniques to improve quality. As these could be added to our system at inference time as well, we focus instead on comparing single model results. To identify comparisons, we examine the WMT Shared Task proceedings as well as the submissions at http://matrix.statmt.org/.

As seen in Table 10, our M2M-100 system can achieve very competitive performance compared to bilingual models tuned especially for individual WMT translation directions. This shows that our model maintains strong translation quality on individual directions.

Next, we compare our models to other multilingual translation work. Table 11 displays several previously published results on different sets of benchmarks. Note that for each comparison, we follow the published setting in tokenization, evaluation, and whether or not the BLEU is tuned on the validation set to maximize comparability.

We first compare to mBART (Liu et al., 2020), which creates bilingual models based on finetuning a pretrained model on individual language directions. After pretraining as a denoising autoencoder, publicly available bitext data is used to create various different bilingual models, one for each evaluation direction. Liu et al. (2020) tune the test set BLEU on the validation set. Following their setting, we tune the generation beam size between {5,10}, and length penalty between {0.5, 1.0, 1.5}, and the number of checkpoints to average between {1, 5, 10}. Our model provides +0.7 BLEU improvement.

We then compare to the bilingual baselines provided in CCMatrix (Schwenk et al., 2019), which trained individual models for each direction. As these models generate with no tuning, we generate on all pairs with beam size 5 and length penalty 1, using only the best checkpoint. Our one Many-to-Many multilingual model achieves a 2 BLEU point gain on average compared to training hundreds of individual models.

We next compare the performance of our multilingual system to other published multilingual systems. We compare to the English-Centric multilingual model from Zhang et al. (2020) on the OPUS100 corpus. Their model is trained with noisily aligned through-English data from OPUS (Tiedemann, 2012; Zhang et al., 2020), with online backtranslation to improve the performance of non-English pairs. Note that Zhang et al. (2020) train on 100 directions, but we only overlap a subset of directions. However, we fully cover their full set of non-English evaluation pairs. Finally, the OPUS100 non-English directions come only with a test set, so we generate with beam size 5, length penalty 1, and use the best checkpoint. As shown in Table 11, we improve by more than 4 BLEU.

3 Human Evaluation

We end with a human evaluation study to understand the quality of our model translations. We focus on 20 different directions, none of them involving English. We include languages commonly spoken in the same region, such as Japanese-Chinese, Hindi-Tamil, and Russian-Ukrainian, as well as directions that cross language families, such as Chinese-French, French-Arabic, and Russian-Spanish. We also include several very low resource directions, such as French-Wolof, Hindi-Marathi, and Japanese-Mongolian. All of our evaluators are native speakers in one of the languages and fluent in the other.

Each evaluator rates 50 different translations for semantic accuracy on a scale of 1 to 10. Results are shown in Figure 11. On semantic accuracy, most of our evaluations score between 8.5 and 9.5 (with 10 being the best possible score). For lower resource directions, the scores remain reasonable. Hindi to Tamil and Wolof to French score around 7-8. The most challenging direction based on human evaluation is French into Wolof (fr-wo), likely because there is not sufficient target-side Wolof data.

Next, we compare our model with an English-Centric model on 10 directions in Figure 12. Each evaluator is asked to rate 100 sentences, 50 from each model, in a blind test. Across the board, we find that our Many-to-Many system scores better in translation accuracy - both for related and unrelated languages.

4 Discussion

Creating high quality datasets to train translation models has been a long-standing area of research. For example, previous work has explored how to best filter noisy datasets (Koehn et al., 2018, 2019). Our use of large-scale mined training data presents large quantities of data to train multilingual models on, but brings challenges as well. For example, our mining methods mine both simplified and traditional Chinese text, tokenized and untokenized text, and many examples with code switching. We apply several data filtering methods, but the cleanliness and quality of alignment is critical for training high-quality translation systems. Further, multilingual translation can be affected by domain mismatch, as people in different parts of the world discuss different topics (Shen et al., 2019), which presents additional challenges for curating good training sets. Thus, we see the continued improvement of data quality as an important direction for multilingual translation systems, which require a lot of data to train well.

Strong performance for low-resource languages remains a critical area for future improvement (Gu et al., 2018; Sennrich and Zhang, 2019). For many languages, our system still requires substantial improvements. Examples include African languages such as Xhosa and Zulu, European languages such as Catalan and Basque, and Southeast Asian languages such as Iloko and Cebuano. For many of these, even monolingual resources on the internet are limited, which strongly affects the quantity and quality of mined data. Using curated data, possibly supplemented by mining, may provide a starting point for future improvement. For example, several resources for African languages exist, including JW300 (Agić and Vulić, 2019) used in the masakhane machine translation effort ( $\forall$ et al., 2020) and datasets for Nigerian Pidgin (Ahia and Ogueji, 2020), Wolof (Alla et al., 2020), Fon (Emezue and Dossou, 2020), Igbo (Ezeani et al., 2020), Amharic, Tigrigna, Afan-Oromo, Wolaytta, and Ge’ez (Abate et al., 2018). Other lines of work present resources for low-resource Asian languages, such as the ALT project (Riza et al., 2016; Ding et al., 2016), Mongolian, Uyghur, and Tibetian (Anonymous, 2020), or strategies for improvement on specific directions (Chen et al., 2019). Further research is required to bring together small datasets of higher quality translations, mined data, and monolingual resources to create improved translation systems for very low resource languages.

Conclusion

We introduced M2M-100, a new Many-to-Many multilingual translation model that can translate between the 9,900 directions of 100 languages. The underlying dataset was mined from CommonCrawl using a novel strategy which exploits language groupings to avoid mining every possible direction while maintaining good accuracy. Such a large dataset requires models with increased capacity and to this end we explored densely scaling the number of parameters as well as sparsely, through introducing language-specific parameters trained with a novel random re-routing scheme.

Results show that M2M-100 outperforms English-Centric multilingual models trained on data where either the source or target language is English. The system improves over 10 BLEU on average compared to an English-Centric baseline when translating directly between non-English directions. M2M-100 is competitive to bilingual models from WMT and improves over existing publicly available multilingual translation systems. Human judges indicate that our model translates fluently with high semantic accuracy.

We thank Yuqing Tang and Peng-Jen Chen for their work on the multilingual translation infrastructure in fairseq. We thank Xian Li, Chau Tran, Yuqing Tang, Peng-Jen Chen, and Marc’Aurelio Ranzato for insightful conversations. We thank our volunteer human evaluators for closely examining the translation quality of our models through various directions.

References

A Additional Information about Data

Figure 13 displays the dictionary coverage for each of our 100 languages.

B Model Architectures

Table 14 shows the various model configurations considered in our experiments when scaling dense models.

C Exploiting Multilinguality at Inference Time with Multi-source Self-Ensembles

Throughout the paper, we explored how to improve the performance of single models, scaling the amount of data as well as the model size, but there remain numerous directions for future investigation of multilinguality. One direction is understanding how to exploit the nature of multilingual translation at inference time as well.

A known, effective strategy to improve accuracy is to ensemble multiple models at inference time. However, this requires training multiple models which substantially increases the training compute requirements. Instead, we suggest exploring self-ensembles, created by applying the multilingual model to the same source sentence in different languages. For example, if we wish to translate Galician to English, then instead of directly translating between the two, we ensemble the translation of Spanish to English with the translation of Galician to English, using the same multilingual model for both directions, and by averaging the predicted token log-probabilities, as for standard multi-model ensembles. The additional source is obtained by translating the input to another intermediary language. After this, we ensemble the translation of both sources to the target. This uses the same multilingual model for all steps.

We evaluate both pivoting and self-ensembling on zero-shot directions as these can benefit from better accuracy. We report results on 100 randomly sampled zero-shot translation directions which have at least 1000 examples in the validation and test set. Next, for each translation direction, we choose the intermediary language that resulted in the highest BLEU on the validation set; the same is done to choose the intermediary language for pivoting. We also tune a weight to balance the two language directions (Garmash and Monz, 2016). Table 12 shows that multi-source self-ensembling improves the single model result by 0.2 BLEU on average. It also performs as well as standard multi-model ensembling but requires training only a single model. This is particularly relevant for large models trained on vast quantities of data, which require a lot of compute to be able to perform standard ensembling.