XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation

Yaobo Liang, Nan Duan, Yeyun Gong, Ning Wu, Fenfei Guo, Weizhen Qi, Ming Gong, Linjun Shou, Daxin Jiang, Guihong Cao, Xiaodong Fan, Ruofei Zhang, Rahul Agrawal, Edward Cui, Sining Wei, Taroon Bharti, Ying Qiao, Jiun-Hung Chen, Winnie Wu, Shuguang Liu, Fan Yang, Daniel Campos, Rangan Majumder, Ming Zhou

cs.CL

Introduction

Pre-training + Fine-tuning has become a new NLP paradigm, where the general knowledge are firstly learnt from large-scale corpus by self-supervised learning and then transferred to downstream tasks by task-specific fine-tuning. Three different types of pre-trained models are explored recently, including monolingual pre-trained models Radford et al. (2018); Devlin et al. (2019); Liu et al. (2019); Yang et al. (2019b); Dong et al. (2019); Lewis et al. (2019a), multilingual and cross-lingual pre-trained models Devlin et al. (2019); Conneau and Lample (2019); Huang et al. (2019); Conneau et al. (2019) and multimodal pre-trained models Lu et al. (2019); Li et al. (2020); Chen et al. (2019); Zhou et al. (2020). In this paper, we focus on the cross-lingual pre-trained models, due to their importance to alleviating the low-resource issue among languages, where an NLP task often has rich training data in one language (such as English) but has few or no training data in other languages (such as French and German). In order to further advance the development of cross-lingual pre-trained models for various downstream tasks in different languages, this paper introduces XGLUE, a new benchmark dataset that can be used to: (i) train large-scale cross-lingual pre-trained models using multilingual and bilingual corpora, (ii) evaluate generalization capabilities of the cross-lingual pre-trained models across a diverse set of cross-lingual tasks.

The contribution of XGLUE is two-fold. First, it provides 11 diversified cross-lingual tasks covering both understanding and generation scenarios. XTREME Hu et al. (2020) is a concurrent work of XGLUE. But it includes cross-lingual understanding tasks only. Besides, XGLUE introduces 6 new tasks selected from Search, Ads and News scenarios,which makes XGLUE have more practical values. Second, an extended version of Unicoder Huang et al. (2019) is described and evaluated as a strong cross-lingual pre-trained model baseline on XGLUE for both understanding and generation tasks. We also evaluate the base versions (12-layer) of Multilingual BERT Devlin et al. (2019), XLM Conneau and Lample (2019) and XLM-R Conneau et al. (2019) for comparison.

XGLUE Benchmarkhttps://microsoft.github.io/XGLUE/

We collect two corpora, Small Corpus and Large Corpus, with different sizes for cross-lingual pre-training. Table 1 lists the data statistics.

We extract raw sentences from Wikipedia using WikiExtractor. It leads to a 101G multilingual corpus covering 100 languages.

We use an in-house pipeline to extract bilingual sentence pairs from the Web, which leads to a 146G bilingual corpus covering 27 languages, including Arabic, Bulgarian, Danish, German, Greek, English, Spanish, Finnish, French, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Russian, Swedish, Swahili, Thai, Turkish, Urdu, Vietnamese and Chinese.

1.2 Large Corpus (LC)

Following Wenzek et al. (2019), we construct a clean version of Common Crawl (CC)https://commoncrawl.org/. as the multilingual corpus. First, we use a language identification model trained based on Wikipedia to classify the language of each page in CC. Then, we train a language model for each language using the corresponding part of the Wikipedia corpus, and use it to filter documents as Wenzek et al. (2019) did. We use one CC dump for English and twelve CC dumps for other languages. It leads to a 2,500G multilingual corpus covering 89 languages. We also include the 101G multilingual corpus described in Section 2.1.1.

We reuse the bilingual corpus described in Section 2.1.1. We will add CCMatrix Schwenk et al. (2019) in the future.

2 Downstream Tasks

We select 11 cross-lingual tasks in XGLUE, which are categorized into 3 groups: single-input understanding tasks, pair-input understanding tasks, and generation tasks. For each task, training set is only available in English. In order to obtain a good performance on XGLUE, a model should be able to learn how to do a task well using its English training set, and then transfer this ability to test sets in other languages. Table 2 gives the dataset statistics and Table 3 lists languages covered by all tasks.

We select a subset of the following two NER tasks, CoNLL-2002 NER Sang (2002) and CoNLL-2003 NER Sang and De Meulder (2003), to form this cross-lingual NER dataset. It covers 4 languages, including English, German, Spanish and Dutch, and 4 types of named entities, including Person, Location, Organization and Miscellaneous entities that do not belong to the previous three types. F1 score is used as the metric.

Following Kim et al. (2017), we select a subset of Universal Dependencies (UD) Treebanks (v2.5) Zeman et al. (2019), which covers 18 languages. Accuracy (ACC) of the predicted POS tags is used as the metric.

This task aims to predict the category given a news article. It covers 5 languages, including English, Spanish, French, German and Russian. Each labeled instance is a 3-tuple: $<$ news title, news body, category $>$ . The category number is 10. We crawl this dataset from a commercial news website. Accuracy (ACC) of the multi-class classification is used as the metric.

2.2 Pair-input Understanding Tasks

The MLQA Lewis et al. (2019b) is a multilingual machine reading comprehension task, which contains QA annotations labeled in 7 languages, including English, Arabic, German, Spanish, Hindi, Vietnamese and Chinese. F1 score of the predicted answers is used as the metric.

We reuse the original XNLI dataset Conneau et al. (2018) in XGLUE.

The PAWS-X Yang et al. (2019a) is a paraphrase identification dataset, which extends the Wikipedia portion of the PAWS Zhang et al. (2019) evaluation to more languages. We select 4 languages, including English, Spanish, French and German, from the original dataset and use them in XGLUE. Accuracy (ACC) of the binary classification is used as the metric.

This task aims to predict whether an advertisement (ad) is relevant to an input query. It covers 3 languages, including English, French and German. Each labeled instance is a 4-tuple: $<$ query, ad title, ad description, label $>$ . The label indicates whether the ad is relevant to the query (Good), or not (Bad). We construct this dataset based on a commercial search engine. Accuracy (ACC) of the binary classification is used as the metric.

This task aims to predict whether a web page is relevant to an input query. It covers 7 languages, including English, German, French, Spanish, Italian, Portuguese and Chinese. Each labeled instance is a 4-tuple: $<$ query, web page title, web page snippet, label $>$ . The relevance label contains 5 ratings: Perfect (4), Excellent (3), Good (2), Fair (1) and Bad (0). We construct this dataset based on a commercial search engine. Normalize Discounted Cumulative Gain (nDCG) is used as the metric.

This task aims to predict whether a $<$ question, passage $>$ pair is a QA pair. It covers 3 languages, including English, French and German. Each labeled instance is a 3-tuple: $<$ question, passage, label $>$ . The label indicates whether the passage is the answer of the question (1), or not (0). We construct this dataset based on a commercial search engine. Accuracy (ACC) of the binary classification is used as the metric.

2.3 Generation Tasks

This task aims to generate a question for a given passage. We collect $<$ passage, question $>$ pairs from a commercial search engine. It covers 6 languages, including English, French, German, Spanish, Italian and Portuguese. BLEU-4 score is used as the metric.

This task aims to generate a proper title for a given news body. We collect $<$ news body, news title $>$ pairs from a commercial news website. It covers 5 languages, including German, English, French, Spanish and Russian. BLEU-4 score is used as the metric.

Pre-train Unicoder for Cross-lingual Understanding Tasks

We select Unicoder Huang et al. (2019) as the backbone model. Section 3 introduces a simplified version of Unicoder using two pre-training tasks (MLN and TLM) for cross-lingual understanding tasks. Section 4 describes how to extend Unicoder to cover cross-lingual generation tasks.

The original Unicoder Huang et al. (2019) includes more pre-training tasks besides MLM and TLM. But to keep the baseline pre-trained model simple and to reduce the experimental cost, we just use MLM and TLM in this paper. It means for understanding tasks, Unicoder is almost equal to XLM, except some hyper-parameter differences.

Following Devlin et al. (2019), this task extends the masked language model task to multiple languages. At each iteration, a batch is composed of sentences sampled from different languages. The sampling probability of a language $l_{i}$ is defined as $\lambda_{l_{i}}=p_{l_{i}}^{\alpha}/\sum_{l_{i}}p_{l_{i}}^{\alpha}$ , where $p_{l_{i}}$ is the percentage of the language $l_{i}$ in the entire corpus, the smoothing factor $\alpha$ is set to 0.3. For each batch, we randomly sample 15% of the words and replace them with (i) a special symbol [MASK], (ii) a random token or (iii) keep them unchanged with probability 80%, 10% and 10%, respectively. For each token, we only use its token embedding and position embedding, and discard segment embedding and language embedding.

2 Translation Language Model (TLM)

Following Conneau and Lample (2019), this task extends the MLM task to bilingual corpus. Given a bilingual sentence pair, TLM first concatenates them into a single sentence, and then masks words using the same strategy of MLM. The pre-trained model learns to recover each masked word based on the bilingual context. We follow MLM to sample language pairs in each batch with $\alpha=0.3$ .

Pre-train Unicoder for Cross-lingual Generation Tasks

The encoder-decoder architecture is employed to extend Unicoder to generation tasks, where the BPE embeddings are shared between encoder and decoder. Two separate generative tasks are proposed for Unicoder pre-training: Multilingual Denoising Auto-Encoding (xDAE) and Multilingual Future N-gram Prediction (xFNP).

Motivated by BART Lewis et al. (2019a), xDAE aims to predict the original text $X=(x_{1},x_{2},...,x_{|X|})\in l_{i}$ from a language $l_{i}$ based on its corrupted form $c(X)$ , where $c(X)$ is a noising function that corrupts an input text $X$ as its output.

Four different text noising strategies for $c(\cdot)$ are explored in this paper. (1) Shuffle the input text $X$ by adding a noise $\alpha\sim{\rm U}(0,3)$ to the input indices and then re-ordering $X$ based on the rank of the noised indices. (2) Drop words with a probability of 0.1. (3) Replace 10 $\%$ of the input words in $X$ with the [MASK] symbol. (4) Sample a number of token spans from $X$ with span lengths drawn from a Poisson distribution ( $\lambda=3$ ), and then replace each token span with a single [MASK] token. Here, 0-length spans correspond to the insertion of [MASK] tokens. Based on the performance of different noising strategies (Table 10), we select (4) and use it in pre-training. We leave finding better text noising strategies for future work.

We train Unicoder using this task by maximizing the following loss function $\mathcal{L}_{xDAE}$ :

where $L={l_{1},...,l_{N}}$ denotes $N$ languages, $X$ is an instance in the $i^{th}$ language $l_{i}$ , $p(x_{t}|x_{<t},c(X))$ denotes the probability of generating a single token $x_{t}$ at time step $t$ given $c(X)$ and $x_{<t}$ .

2 Multilingual Future N-gram Prediction (xFNP)

Motivated by ProphetNet Yan et al. (2020), xFNP introduces a future n-gram prediction mechanism to natural language generation. It encourages the model to plan for the future tokens explicitly and prevents over-fitting on strong local correlations.

Given an input text $X=(x_{1},x_{2},...,x_{|X|})\in l_{i}$ from a language $l_{i}$ , we randomly mask $k$ token spans of $X$ to generate the masked text $X^{{}^{\prime}}$ as the input, and concatenate all masked token spans into $Y$ as the output. Details of this mask strategy are described in Section 6.1. After this, xFNP first encodes $X^{{}^{\prime}}$ to $H_{enc}$ with the encoder:

Then, instead of predicting the next token only at each time step, xFNP generates $n$ future tokens simultaneously at time step $t$ with the decoder:

Following Yan et al. (2020), we set $n=2$ .

We train Unicoder using this task by maximizing the following loss function $\mathcal{L}_{xFNP}$ :

where $X^{{}^{\prime}}$ and $Y$ are generated from $X$ based on the method mentioned above. Following Yan et al. (2020), we set $\alpha_{0}=\alpha_{1}=1$ .

Related Work

GLUE Wang et al. (2019) includes 9 natural language understanding tasks that are labeled in English only. Comparing to GLUE, XGLUE not only expands task annotations to multiple languages, but also includes natural language generation tasks. XNLI Conneau et al. (2018), NER Sang (2002); Sang and De Meulder (2003), POS Tagging Kim et al. (2017), MLQA Lewis et al. (2019b) and PAWS-X Yang et al. (2019a) are 5 multilingual datasets built for specific tasks. XGLUE not only includes these 5 existing tasks, but also introduces 6 new tasks selected from real-world scenarios (i.e., Search, Ads and News). This makes XGLUE have more practical values. XTREME Hu et al. (2020) is a concurrent work of XGLUE. Comparing to it, XGLUE includes both understanding and generation tasks, which, to the best of our knowledge, is the first attempt in the cross-lingual dataset construction efforts.

Multilingual BERT (M-BERT) Devlin et al. (2019) performs pre-training based on the multilingual corpus with the masked language model task. By sharing the model parameters and the vocabulary for all languages, M-BERT can obtain the cross-lingual capability over 102 languages. XLM Conneau and Lample (2019) performs cross-lingual pre-training based on multilingual corpus and bilingual corpus, by introducing the translation language model task into pre-training. Based on XLM, Unicoder Huang et al. (2019) uses more cross-lingual pre-training tasks and achieves better results on XNLI. XLM-R Conneau et al. (2019) is a RoBERTa Liu et al. (2019)-version XLM without using translation language model in pre-training. It is trained based on a much larger multilingual corpus (i.e. Common Crawl) and become the new state-of-the-art on XNLI. In this paper, we use both the Common Crawl corpus and the bilingual corpus, aiming to build a stronger baseline model on XGLUE. BART Lewis et al. (2019a) and ProphetNet Yan et al. (2020) are two latest generative pre-trained models. We borrow ideas from these two works and extend Unicoder to cross-lingual generation tasks, which goes a step further to verify and explore different text generation approaches in the cross-lingual scenario.

Experiments

The hyper-parameters are set as follows: 768 hidden units, 12 heads, GELU activation, a dropout rate of 0.1, 512 max input length, 12 layers in encoder.

In the pre-training stage, we first initialize UnicoderLC with XLM-Rbase Conneau et al. (2019), and then run continue pre-training with the accumulated 8,192 batch size with gradients accumulation. We use Adam Optimizer with a linear warm-up Vaswani et al. (2017) and set the learning rate to 3e-5. We select different understanding tasks randomly in different batches.

In the fine-tuning stage, the batch size is set to 32. We use Adam Optimizer Kingma and Ba (2014) with warm-up and set the learning rate to 5e-6. For all sentence classification tasks, we fine-tune 10 epochs. For POS Tagging and NER, we fine-tune 20 epochs. And for POS Tagging, we set the learning rate to 2e-5. For MLQA, we set the learning rate to 3e-5, batch size to 12 and train 2 epochs following BERT for SQuAD. After each epoch, we test the fine-tuned model on the dev sets of all languages. We select the model with the best average result on the dev sets of all languages.

We evaluate Unicoder ${}_{SC}^{xDAE}$ and Unicoder ${}_{SC}^{xFNP}$ as two separate models.

For Unicoder ${}_{SC}^{xDAE}$ , the hyper-parameters are set as follows: 768 hidden units, 12 heads, GELU activation, a dropout rate of 0.1, 512 max input length, 12 layers in encoder, 12 layers in decoder.

In the pre-training stage, we first initialize encoder and decoder with XLM-R Conneau et al. (2019), and then run continue pre-training with 1,024 batch size. We use Adam optimizer with warm-up and set the learning rate to 2e-4.

In the fine-tuning stage, the batch size is 1024. We use Adam Optimizer Kingma and Ba (2014) with learning rate 1e-5 and warm-up steps 2000.

For Unicoder ${}_{SC}^{xFNP}$ , the hyper-parameters are set as follows: 1,024 hidden size, 12 layers in encoder, 12 layers in decoder, 512 max input length.

In the pre-training stage, we pre-train the model from scratch and follow ProphetNet Yan et al. (2020) to randomly mask a continuous span (with a fixed length 9) in every 64 tokens. About 15% of the tokens in original sequence are masked in this step. We use a special symbol [MASK] to replace 80% of the masked tokens, keep 10% unchanged, and random replace 10% of the masked tokens. We set the batch size to 1,024, training steps to 350,000. The learning rate is set to 1e-4. We set the number of future tokens $n$ to 2.

In the fine-tuning stage, we use Adam Optimizer Kingma and Ba (2014) and set the learning rate to 1e-4. We set the batch size to 64 and the warm-up steps to 1,000.

2 Main Result

7 cross-lingual pre-trained models are evaluated on XGLUE and compared in Table 4: 12-layer M-BERT Devlin et al. (2019) trained on Wikipedia corpus for 102 languages, 12-layer XLM Conneau and Lample (2019) trained on Wikipedia and bilingual corpora for 15 languages, 12-layer XLM-Rbase Conneau et al. (2019) trained on Common Crawl corpus for 100 languages, 12-layer UnicoderSC trained on small corpus for 100 languages, 12-layer UnicoderLC trained on large corpus for 100 languages, 12-layer Unicoder ${}_{SC}^{xDAE}$ and 12-layer Unicoder ${}_{SC}^{xFNP}$ trained on Wikipedia corpus for 100 languages. Given a downstream task, each pre-trained model is fine-tuned using its English training set and then applied to all test sets in different languages. Note that, all results are reproduced by this paper, except the XLM ${\dagger}$ results on XNLI are from Conneau and Lample (2019).

We find (1) UnicoderLC performs slightly better than M-BERT and XLM-Rbase on the 9 understanding tasks, as it is pre-trained based on multilingual and bilingual corpora at the same time and uses TLM; (2) UnicoderLC performs better than UnicoderSC, as it is pre-trained based on the larger corpus; (3) Unicoder ${}_{SC}^{xDAE}$ and Unicoder ${}_{SC}^{xFNP}$ show good cross-lingual transfer capabilities and perform significantly better than M-BERT and XLM-Rbase on the 2 generation tasks. It proves the importance of introducing generation tasks into pre-training for cross-lingual text generation; (4) Unicoder ${}_{SC}^{xFNP}$ performs slightly better than Unicoder ${}_{SC}^{xDAE}$ . But it is not a fair comparison, because they use different text denoising tasks (sentence prediction vs. span prediction) and different generation mechanisms (single-token prediction vs. multi-token prediction). We leave combining these two tasks for future work.

3 Ablation Study

We define pivot-language (pl) fine-tuning as fine-tune a pre-trained model for a downstream task using its labeled data in a pivot language (e.g. English) and the apply the fine-tuned model to all languages. Table 4 chooses English as the pivot language, as all tasks in XGLUE have labeled data in English. But is English always the optimal choice? Will the results become better, if we do fine-tuning using other pivot languages?

To answer these questions, we evaluate Unicoder on XNLI and NTG using different pivot languages in fine-tuning and list comparison results in Table 5 and Table 6, respectively. (1) For each test set in language $l_{i}$ in Table 5 and Table 6, its best result is often achieved when the model is fine-tuned using $l_{i}$ as the pivot language; (2) For XNLI in Table 5, the best pivot languages are Spanish (es), Greek (el) and Turkish (tr), rather than English (en). For NTG in Table 6, the best pivot language is French (fr) for both Unicoder ${}_{SC}^{xDAE}$ and Unicoder ${}_{SC}^{xFNP}$ . It means the average quality of a cross-lingual pre-trained model could be further improved on a downstream task, by selecting a specific pivot language in fine-tuning.

3.2 Multi-language Fine-tuning

We define multi-language (ml) fine-tuning as fine-tune a pre-trained model for a downstream task using all its available labeled data in different languages. We evaluate Unicoder on XNLI and NTG using this fine-tuning method and list evaluation results in Table 7 and Table 8, respectively.

We find multi-language fine-tuning can achieve better results than pivot-language fine-tuning on both XNLI and NTG. It means the average quality of a cross-lingual pre-trained model could be significantly improved on a downstream task, by using combined labeled data in multiple languages.

3.3 Multi-task Fine-tuning

We define multi-task fine-tuning as fine-tune a pre-trained model for multiple downstream tasks using their combined labeled data. To reduce the experimental cost, we evaluate Unicoder on following 5 understanding tasks: XNLI, PAWS-X, NC, QAM and QADSM, using their merged English labeled data in fine-tuning. Results are listed in Table 9.

We find PAWS-X and QADSM can benefit from the joint fine-tuning strategy, but XNLI, NC and QAM cannot. We leave discovering relationships between different tasks for better downstream task fine-tuning for future work.

3.4 Impacts of Text Noising Strategies

We investigate the impacts of different text noising strategies (Section 4.1) in Unicoder ${}_{SC}^{xDAE}$ , and list comparison results in Table 10, where (1)+(2)+(3) denotes the result of using the first three strategies in pre-training, (4) denotes the result of using the last strategy in pre-training, (1)+(2)+(3)+(4) denotes the result of using all strategies in pre-training. To reduce experiment cost, we set max sequence length to 256 and only train 60K steps. We find that (4) can achieve the best average result on NTG. So all results of Unicoder ${}_{SC}^{xDAE}$ reported in this paper is pre-trained using (4) only.

We also compare Unicoder ${}_{SC}^{xDAE}$ with XNLG Chi et al. (2019) on the Abstractive Summarization task. For fairly comparison, we implement xDAE in same code base and use same pre-training languages as XNLG. The zero-shot comparison results are listed in Table 11. We can see that by using xDAE only in pre-training, Unicoder ${}_{SC}^{xDAE}$ can outperform XNLG significantly, which is pre-trained using 4 tasks including MLM, DAE, XMLM and XAE. It verifies the effectiveness of the fourth text noising strategy described in Section 4.1 for generation tasks.

Conclusion

We present XGLUE as a new cross-lingual benchmark and conduct comprehensive evaluations with interesting findings observed. We thank STC-A NLP, Bing Answers, Bing Ads, Bing Relevance and Microsoft News for providing the datasets.