InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training
Zewen Chi, Li Dong, Furu Wei, Nan Yang, Saksham Singhal, Wenhui Wang, Xia Song, Xian-Ling Mao, Heyan Huang, Ming Zhou
Introduction
Learning cross-lingual language representations plays an important role in overcoming the language barrier of NLP models. The recent success of cross-lingual language model pre-training Devlin et al. (2019); Conneau and Lample (2019); Conneau et al. (2020a); Chi et al. (2020); Liu et al. (2020) significantly improves the cross-lingual transferability in various downstream tasks, such as cross-lingual classification, and question answering.
State-of-the-art cross-lingual pre-trained models are typically built upon multilingual masked language modeling (MMLM; Devlin et al. 2019; Conneau et al. 2020a), and translation language modeling (TLM; Conneau and Lample 2019). The goal of both pretext tasks is to predict masked tokens given input context. The difference is that MMLM uses monolingual text as input, while TLM feeds bilingual parallel sentences into the model. Even without explicit encouragement of learning universal representations across languages, the derived models have shown promising abilities of cross-lingual transfer.
In this work, we formulate cross-lingual pre-training from a unified information-theoretic perspective. Following the mutual information maximization principle Hjelm et al. (2019); Kong et al. (2020), we show that the existing pretext tasks can be viewed as maximizing the lower bounds of mutual information between various multilingual-multi-granularity views.
Specifically, MMLM maximizes mutual information between the masked tokens and the context in the same language while the anchor points across languages encourages the correlation between cross-lingual contexts. Moreover, we present that TLM can maximize mutual information between the masked tokens and the parallel context, which implicitly aligns encoded representations of different languages. The unified information-theoretic framework also inspires us to propose a new cross-lingual pre-training task, named as cross-lingual contrast (XlCo). The model learns to distinguish the translation of an input sentence from a set of negative examples. In comparison to TLM that maximizes token-sequence mutual information, XlCo maximizes sequence-level mutual information between translation pairs which are regarded as cross-lingual views of the same meaning. We employ the momentum contrast He et al. (2020) to realize XlCo. We also propose the mixup contrast and conduct the contrast on the universal layer to further facilitate the cross-lingual transferability.
Under the presented framework, we develop a cross-lingual pre-trained model (InfoXLM) to leverage both monolingual and parallel corpora. We jointly train InfoXLM with MMLM, TLM and XlCo. We conduct extensive experiments on several cross-lingual understanding tasks, including cross-lingual natural language inference Conneau et al. (2018), cross-lingual question answering Lewis et al. (2020), and cross-lingual sentence retrieval Artetxe and Schwenk (2019). Experimental results show that InfoXLM outperforms strong baselines on all the benchmarks. Moreover, the analysis indicates that InfoXLM achieves better cross-lingual transferability.
Related Work
Multilingual BERT (mBERT; Devlin et al. 2019) is pre-trained with the multilingual masked language modeling (MMLM) task on the monolingual text. mBERT produces cross-lingual representations and performs cross-lingual tasks surprisingly well Wu and Dredze (2019). XLM Conneau and Lample (2019) extends mBERT with the translation language modeling (TLM) task so that the model can learn cross-lingual representations from parallel corpora. Unicoder Huang et al. (2019) tries several pre-training tasks to utilize parallel corpora. ALM Yang et al. (2020) extends TLM to code-switched sequences obtained from translation pairs. XLM-R Conneau et al. (2020a) scales up MMLM pre-training with larger corpus and longer training. LaBSE Feng et al. (2020) learns cross-lingual sentence embeddings by an additive translation ranking loss.
In addition to learning cross-lingual encoders, several pre-trained models focus on generation. MASS Song et al. (2019) and mBART Liu et al. (2020) pretrain sequence-to-sequence models to improve machine translation. XNLG Chi et al. (2020) focuses on the cross-lingual transfer of language generation, such as cross-lingual question generation, and abstractive summarization.
2 Mutual Information Maximization
Various methods have successfully learned visual or language representations by maximizing mutual information between different views of input. It is difficult to directly maximize mutual information. In practice, the methods resort to a tractable lower bound as the estimator, such as InfoNCE Oord et al. (2018), and the variational form of the KL divergence Nguyen et al. (2010). The estimators are also known as contrastive learning Arora et al. (2019) that measures the representation similarities between the sampled positive and negative pairs. In addition to the estimators, various view pairs are employed in these methods. The view pair can be the local and global features of an image Hjelm et al. (2019); Bachman et al. (2019), the random data augmentations of the same image Tian et al. (2019); He et al. (2020); Chen et al. (2020), or different parts of a sequence Oord et al. (2018); Henaff (2020); Kong et al. (2020). Kong et al. (2020) show that learning word embeddings or contextual embeddings can also be unified under the framework of mutual information maximization.
Information-Theoretic Framework for Cross-Lingual Pre-Training
In representation learning, the learned representations are expected to preserve the information of the original input data. However, it is intractable to directly model the mutual information between the input data and the representations. Alternatively, we can maximize the mutual information between the representations from different views of the input data, e.g., different parts of a sentence, a translation pair of the same meaning.
In this section, we start from a unified information-theoretic perspective, and formulate cross-lingual pre-training with the mutual information maximization principle. Then, under the information-theoretic framework, we propose a new cross-lingual pre-training task, named as cross-lingual contrast (XlCo). Finally, we present the pre-training procedure of our InfoXLM.
The goal of multilingual masked language modeling (MMLM; Devlin et al. 2019) is to recover the masked tokens from a randomly masked sequence. For each input sequence of MMLM, we sample a text from the monolingual corpus for pre-training. Let denote a monolingual text sequence, where is the masked token, and is the corresponding context. Intuitively, we need to maximize their dependency (i.e., ), so that the context representations are predictive for masked tokens Kong et al. (2020).
For the example pair , we construct a set that contains and negative samples drawn from a proposal distribution . According to the InfoNCE Oord et al. (2018) lower bound, we have:
where is a function that scores whether the input and is a positive pair.
Given context , MMLM learns to minimize the cross-entropy loss of the masked token :
where is the vocabulary, is a look-up function that returns the token embeddings, is a Transformer that returns the final hidden vectors in position of . According to Equation (3.1) and Equation (2), if and , we can find that MMLM maximizes a lower bound of .
Next, we explain why MMLM can implicitly learn cross-lingual representations. Let denote a MMLM instance that is in different language as . Because the vocabulary, the position embedding, and special tokens are shared across languages, it is common to find anchor points Pires et al. (2019); Dufter and Schütze (2020) where (such as subword, punctuation, and digit) or is positive (i.e., the representations are associated or isomorphic). With the bridge effect of , MMLM obtains a v-structure dependency “”, which leads to a negative co-information (i.e., interaction information) Tsujishita (1995). Specifically, the negative value of indicates that the variable enhances the correlation between and Fano (1963).
In summary, although MMLM learns to maximize and in each language, we argue that the task encourages the cross-lingual correlation of learned representations. Notice that for the setting without word-piece overlap Artetxe et al. (2020); Conneau et al. (2020b); K et al. (2020), we hypothesize that the information bottleneck principle Tishby and Zaslavsky (2015) tends to transform the cross-lingual structural similarity into isomorphic representations, which has similar bridge effects as the anchor points. Then we can explain how the cross-lingual ability is spread out as above. We leave more discussions about the setting without word-piece overlap for future work.
2 Translation Language Modeling
Similar to MMLM, the goal of translation language modeling (TLM; Conneau and Lample 2019) is also to predict masked tokens, but the prediction is conditioned on the concatenation of a translation pair. We try to explain how TLM pre-training enhances cross-lingual transfer from an information-theoretic perspective.
Let and denote a translation pair of sentences, and a masked token taken in . So and are in the same language, while and are in different ones. Following the derivations of MMLM in Section 3.1, the objective of TLM is maximizing the lower bound of mutual information . By re-writing the above mutual information, we have:
The first term corresponds to MMLM, which learns to use monolingual context. In contrast, the second term indicates cross-lingual mutual information between and that is not included by . In other words, encourages the model to predict masked tokens by using the context in a different language. In conclusion, TLM learns to utilize the context in both languages, which implicitly improves the cross-lingual transferability of pre-trained models.
3 Cross-Lingual Contrastive Learning
Inspired by the unified information-theoretic framework, we propose a new cross-lingual pre-training task, named as cross-lingual contrast (XlCo). The goal of XlCo is to maximize mutual information between the representations of parallel sentences and , i.e., . Unlike maximizing token-sequence mutual information in MMLM and TLM, XlCo targets at cross-lingual sequence-level mutual information.
We describe how the task is derived as follows. Using InfoNCE Oord et al. (2018) as the lower bound, we have:
where is a set that contains the positive pair and negative samples. In order to maximize the lower bound of , we need to design the function that measures the similarity between the input sentence and the proposal distribution . Specifically, we use the following similarity function :
where is the Transformer encoder that we are pre-training. Following Devlin et al. (2019), a special token [CLS] is added to the input, whose hidden vector is used as the sequence representation. Additionally, we use a linear projection head after the encoder in .
Another design choice is how to construct . As shown in Equation (4), a large improves the tightness of the lower bound, which has been proven to be critical for contrastive learning Chen et al. (2020).
In our work, we employ the momentum contrast He et al. (2020) to construct the set , where the previously encoded sentences are progressively reused as negative samples. Specifically, we construct two encoders with the same architecture which are the query encoder and the key encoder . The loss function of XlCo is:
During training, the query encoder encodes and is updated by backpropagation. The key encoder encodes and is learned with momentum update He et al. (2020) towards the query encoder. The negative examples in are organized as a queue, where a newly encoded example is added while the oldest one is popped from the queue. We initialize the query encoder and the key encoder with the same parameters, and pre-fill the queue with a set of encoded examples until it reaches the desired size . Notice that the size of the queue remains constant during training.
Mixup Contrast
For each pair, we concatenate it with a randomly sampled translation pair from another parallel corpus. For example, consider the pairs and sampled from two different parallel corpora. The two pairs are concatenated in a random order, such as , and . The data augmentation of mixup encourages pre-trained models to learn sentence boundaries and to distinguish the order of multilingual texts.
Contrast on Universal Layer
As a pre-training task maximizing the lower bound of sequence-level mutual information, XlCo is usually jointly learned with token-sequence tasks, such as MMLM, and TLM. In order to make XlCo more compatible with the other pretext tasks, we propose to conduct contrastive learning on the most universal (or transferable) layer in terms of MMLM and TLM.
In our implementations, we instead use the hidden vectors of [CLS] at layer 8 to perform contrastive learning for base-size (12 layers) models, and layer 12 for large-size (24 layers) models. Because previous analysis Sabet et al. (2020); Dufter and Schütze (2020); Conneau et al. (2020b) shows that the specific layers of MMLM learn more universal representations and work better on cross-lingual retrieval tasks than other layers. We choose the layers following the same principle.
The intuition behind the method is that MMLM and TLM encourage the last layer to produce language-distinguishable token representations because of the masked token classification. But XlCo tends to learn similar representations across languages. So we do not directly use the hidden states of the last layer in XlCo.
4 Cross-Lingual Pre-Training
We pretrain a cross-lingual model InfoXLM by jointly maximizing the lower bounds of three types of mutual information, including monolingual token-sequence mutual information (MMLM), cross-lingual token-sequence mutual information (TLM), and cross-lingual sequence-level mutual information (XlCo). Formally, the loss of cross-lingual pre-training in InfoXLM is defined as:
where we apply the same weight for the loss terms.
Both TLM and XlCo use parallel data. The number of bilingual pairs increases with the square of the number of languages. In our work, we set English as the pivot language following Conneau and Lample (2019), i.e., we only use the parallel corpora that contain English.
In order to balance the data size between high-resource and low-resource languages, we apply a multilingual sampling strategy Conneau and Lample (2019) for both monolingual and parallel data. An example in the language is sampled with the probability , where is the number of instances in the language , and refers to the total number of data. Empirically, the sampling algorithm alleviates the bias towards high-resource languages Conneau et al. (2020a).
Experiments
In this section, we first present the training configuration of InfoXLM. Then we compare the fine-tuning results of InfoXLM with previous work on three cross-lingual understanding tasks. We also conduct ablation studies to understand the major components of InfoXLM.
We use the same pre-training corpora as previous models Conneau et al. (2020a); Conneau and Lample (2019). Specifically, we reconstruct CC-100 Conneau et al. (2020a) for MMLM, which remains languages by filtering the language code larger than 0.1GB. Following Conneau and Lample (2019), for the TLM and XlCo tasks, we employ language pairs of parallel data that involves English. We collect translation pairs from MultiUN Ziemski et al. (2016), IIT Bombay Kunchukuttan et al. (2018), OPUS Tiedemann (2012), and WikiMatrix Schwenk et al. (2019). The size of parallel corpora is about 42GB. More details about the pre-training data are described in the appendix.
Model Size
We follow the model configurations of XLM-R Conneau et al. (2020a). For the Transformer Vaswani et al. (2017) architecture, we use 12 layers and 768 hidden states for InfoXLM (i.e., base size), and 24 layers and 1,024 hidden states for InfoXLM (i.e., large size).
Hyperparameters
We initialize the parameters of InfoXLM with XLM-R. We optimize the model with Adam Kingma and Ba (2015) using a batch size of for a total of 150K steps for InfoXLM, and 200K steps for InfoXLM. The same number of training examples are fed to three tasks. The learning rate is scheduled with a linear decay with 10K warmup steps, where the peak learning rate is set as for InfoXLM, and for InfoXLM. The momentum coefficient is set as and for InfoXLM and InfoXLM, respectively. The length of the queue is set as . The training procedure takes about days Nvidia DGX-2 stations for InfoXLM, and days Nvidia DGX-2 stations for InfoXLM. Details about the pre-training hyperparameters can be found in the appendix.
2 Evaluation
We conduct experiments over three cross-lingual understanding tasks, i.e., cross-lingual natural language inference, cross-lingual sentence retrieval, and cross-lingual question answering.
The Cross-Lingual Natural Language Inference corpus (XNLI; Conneau et al. 2018) is a widely used cross-lingual classification benchmark. The goal of NLI is to identify the relationship of an input sentence pair. We evaluate the models under the following two settings. (1) Cross-Lingual Transfer: fine-tuning the model with English training set and directly evaluating on multilingual test sets. (2) Translate-Train-All: fine-tuning the model with the English training data and the pseudo data that are translated from English to the other languages.
Cross-Lingual Sentence Retrieval
The goal of the cross-lingual sentence retrieval task is to extract parallel sentences from bilingual comparable corpora. We use the subset of 36 language pairs of the Tatoeba dataset Artetxe and Schwenk (2019) for the task. The dataset is collected from Tatoebahttps://tatoeba.org/eng/, which is an open collection of multilingual parallel sentences in more than 300 languages. Following Hu et al. (2020), we use the averaged hidden vectors in the seventh Transformer layer to compute cosine similarity for sentence retrieval.
Cross-Lingual Question Answering
We use the Multilingual Question Answering (MLQA; Lewis et al. 2020) dataset for the cross-lingual QA task. MLQA provides development and test data in seven languages in the format of SQuAD v1.1 Rajpurkar et al. (2016). We follow the fine-tuning method introduced in Devlin et al. (2019) that concatenates the question-passage pair as the input.
3 Results
We compare InfoXLM with the following pre-trained Transformer models: (1) Multilingual BERT (mBert; Devlin et al. 2019) is pre-trained with MMLM on Wikipedia in languages; (2) XLM Conneau and Lample (2019) pretrains both MMLM and TLM tasks on Wikipedia in languages; (3) XLM-R Conneau et al. (2020a) scales up MMLM to the large CC-100 corpus in languages with much more training steps; (4) Unicoder Liang et al. (2020) continues training XLM-R with MMLM and TLM. (5) InfoXLMXlCo continues training XLM-R with MMLM and TLM, using the same pre-training datasets with InfoXLM.
Table 1 reports the classification accuracy on each test of XNLI under the above evaluation settings. The final scores on test set are averaged over five random seeds. InfoXLM outperforms all baseline models on the two evaluation settings of XNLI. In the cross-lingual transfer setting, InfoXLM achieves 76.5 averaged accuracy, outperforming XLM-R (reimpl) by 1.5. Similar improvements can be observed for large-size models. Moreover, the ablation results “XlCo” show that cross-lingual contrast is helpful for zero-shot transfer in most languages. We also find that InfoXLM improves the results in the translate-train-all setting.
Cross-Lingual Sentence Retrieval
In Table 2 and Table 3, we report the top-1 accuracy scores of cross-lingual sentence retrieval with the base-size models. The evaluation results demonstrate that InfoXLM produces better aligned cross-lingual sentence representations. On the 14 language pairs that are covered by parallel data, InfoXLM obtains 77.8 and 80.6 averaged top-1 accuracies in the directions of xx en and en xx, outperforming XLM-R by 20.2 and 21.1. Even on the 22 language pairs that are not covered by parallel data, InfoXLM outperforms XLM-R on 16 out of 22 language pairs, providing 8.1% improvement in averaged accuracy. In comparison, the ablation variant “XlCo” (i.e., MMLMTLM) obtains better results than XLM-R in Table 2, while getting worse performance than XLM-R in Table 3. The results indicate that XlCo encourages the model to learn universal representations even on the language pairs without parallel supervision.
Cross-Lingual Question Answering
Table 4 compares InfoXLM with baseline models on MLQA, where we report the F1 and the exact match (EM) scores on each test set. Both InfoXLM and InfoXLM obtain the best results against the four baselines. In addition, the results of the ablation variant “XlCo” indicate that the proposed cross-lingual contrast is beneficial on MLQA.
4 Analysis and Discussion
To understand InfoXLM and the cross-lingual contrast task more deeply, we conduct analysis from the perspectives of cross-lingual transfer and cross-lingual representations. Furthermore, we perform comprehensive ablation studies on the major components of InfoXLM, including the cross-lingual pre-training tasks, mixup contrast, the contrast layer, and the momentum contrast. To reduce the computation load, we use InfoXLM15 in our ablation studies, which is trained on 15 languages for 100K steps.
Cross-lingual transfer gap Hu et al. (2020) is the difference between the performance on the English test set and the averaged performance on the test sets of all other languages. A lower cross-lingual transfer gap score indicates more end-task knowledge from the English training set is transferred to other languages. In Table 5, we compare the cross-lingual transfer gap scores of InfoXLM with baseline models on MLQA and XNLI. Note that we do not include the results of XLM because it is pre-trained on 15 languages or using #M=N. The results show that InfoXLM reduces the gap scores on both MLQA and XNLI, providing better cross-lingual transferability than the baselines.
Cross-Lingual Representations
In addition to cross-lingual transfer, learning good cross-lingual representations is also the goal of cross-lingual pre-training. In order to analyze how the cross-lingual contrast task affects the alignment of the learned cross-lingual representations, we evaluate the representations of different middle layers on the Tatoeba test sets of the 14 languages that are covered by parallel data. Figure 1 presents the averaged top-1 accuracy of cross-lingual sentence retrieval in the direction of xx en. InfoXLM outperforms XLM-R on all of the 12 layers, demonstrating that our proposed task improves the cross-lingual alignment of the learned representations. From the results of XLM-R, we observe that the model suffers from a performance drop in the last few layers. The reason is that MMLM encourages the representations of the last hidden layer to be similar to token embeddings, which is contradictory with the goal of learning cross-lingual representations. In contrast, InfoXLM still provides high retrieval accuracy at the last few layers, which indicates that InfoXLM provides better aligned representations than XLM-R. Moreover, we find that the performance is further improved when removing TLM, demonstrating that XlCo is more effective than TLM for aligning cross-lingual representations, although TLM helps to improve zero-shot cross-lingual transfer.
Effect of Cross-Lingual Pre-training Tasks
To better understand the effect of the cross-lingual pre-training tasks, we perform ablation studies on the pre-training tasks of InfoXLM, by removing XlCo, TLM, or both. We present the experimental results in Table 7. Comparing the results of TLM and XlCo with the results of TLMXlCo, we find that both XlCo and TLM effectively improve cross-lingual transferability of the pre-trained InfoXLM model. TLM is more effective for XNLI while XlCo is more effective for MLQA. Moreover, the performance can be further improved by jointly learning XlCo and TLM.
Effect of Contrast on Universal Layer
We conduct experiments to investigate whether contrast on the universal layer improves cross-lingual pre-training. As shown in Table 6, we compare the evaluation results of four variants of InfoXLM, where XlCo is applied on the layer 8 (i.e., universal layer) or on the layer 12 (i.e., the last layer). We find that contrast on the layer 8 provides better results for InfoXLM. However, conducting XlCo on layer 12 performs better when the TLM task is excluded. The results show that maximizing context-sequence (TLM) and sequence-level (XlCo) mutual information at the last layer tends to interfere with each other. Thus, we suggest applying XlCo on the universal layer for pre-training InfoXLM.
Effect of Mixup Contrast
We conduct an ablation study on the mixup contrast strategy. We pretrain a model that directly uses translation pairs for XlCo without mixup contrast (TLMMixup). As shown in Table 7, we present the evaluation results on XNLI and MLQA. We observe that mixup contrast improves the performance of InfoXLM on both datasets.
Effect of Momentum Contrast
In order to show whether our pre-trained model benefits from momentum contrast, we pretrain a revised version of InfoXLM without momentum contrast. In other words, the parameters of the key encoder are always the same as the query encoder. As shown in Table 7, we report evaluation results (indicated by “TLMMomentum”) of removing momentum contrast on XNLI and MLQA. We observe a performance descent after removing the momentum contrast from InfoXLM, which indicates that momentum contrast improves the learned language representations of InfoXLM.
Conclusion
In this paper, we present a cross-lingual pre-trained model InfoXLM that is trained with both monolingual and parallel corpora. The model is motivated by the unified view of cross-lingual pre-training from an information-theoretic perspective. Specifically, in addition to the masked language modeling and translation language modeling tasks, InfoXLM is jointly pre-trained with a newly introduced cross-lingual contrastive learning task. The cross-lingual contrast leverages bilingual pairs as the two views of the same meaning, and encourages their encoded representations to be more similar than the negative examples. Experimental results on several cross-lingual language understanding tasks show that InfoXLM can considerably improve the performance.
Ethical Considerations
Currently, most NLP research works and applications are English-centric, which makes non-English users hard to access to NLP-related services. Our work focuses on cross-lingual language model pre-training. With the pre-trained model, we are able to transfer end-task knowledge from high-resource languages to low-resource languages, which helps to build more accessible NLP applications. Additionally, incorporating parallel corpora into the pre-training procedure improves the training efficiency, which potentially reduces the computational cost for building multilingual NLP applications.
Acknowledgements
We appreciate the helpful discussions with Bo Zheng, Shaohan Huang, Shuming Ma, and Yue Cao.
References
Appendix A Pre-Training Data
We reconstruct CCNethttps://github.com/facebookresearch/cc_net and follow Conneau et al. (2020a) to reproduce the CC-100 corpus for monolingual texts. The resulting corpus contains languages. Table 8 reports the language codes and data size in our work. Notice that several languages can share the same ISO language code, e.g., zh represents both Simplified Chinese and Traditional Chinese. Moreover, Table 9 shows the statistics of the parallel data.
Appendix B Results of Training From Scratch
We conduct experiments under the setting of training from scratch. The Transformer size and hyperparameters follow BERT-base Devlin et al. (2019). The parameters are randomly initialized from . We optimize the models with Adam using a batch size of for a total of M steps. The learning rate is scheduled with a linear decay with 10K warmup steps, where the peak learning rate is set as . For cross-lingual contrast, we set the queue length as . We use a warmup of K steps for the key encoder and then enable cross-lingual contrast. We use an inverse square root scheduler to set the momentum coefficient, i.e., , where is training step.
Table 10 shows the results of and various ablations. significantly outperforms MMLM on both XNLI and MLQA. We also evaluate the pre-training objectives of InfoXLM, where we ablate XlCo, TLM, and MMLM, respectively. The findings agree with the results in Table 7.
Appendix C Hyperparameters for Pre-Training
As shown in Table 11, we present the hyperparameters for pre-training InfoXLM. We use the same vocabulary with XLM-R Conneau et al. (2020a).
Appendix D Hyperparameters for Fine-Tuning
In Table 12 and Table 13, we present the hyperparameters for fine-tuning on XNLI and MLQA. For each task, the hyperparameters are searched on the joint validation set of all languages (#M=1). For XNLI, we evaluate the model every 5,000 steps, and select the model with the best accuracy score on the validation set. For MLQA, we directly use the final learned model. The final scores are averaged over five random seeds.