Few-shot Learning with Multilingual Language Models

Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li

cs.CL cs.AI

Introduction

Large autoregressive language models such as GPT-3 can be adapted, via few- and zero-shot learning, to a wide range of tasks with significantly less cost than full fine-tuning Brown et al. (2020); Bommasani et al. (2021). These models have been primarily developed for English. Although the training data of GPT-3 contains a small percentage of non-English text (7%) allowing it to achieve some promising cross-lingual generalization, the model is almost exclusively deployed for use cases in English. Multilingual masked and sequence-to-sequence language models have been studied, including mBERT, XLM-R, mT5, and mBART Devlin et al. (2019); Conneau et al. (2020); Xue et al. (2020); Fedus et al. (2021); Goyal et al. (2021a); Liu et al. (2020). These models are typically fine-tuned on large amount of labeled data in downstream tasks. Despite notable recent work at smaller scales Zhao and Schütze (2021) and for domain-specific tasks Winata et al. (2021), the multilingual few-shot learning capabilities of language models are less well understood.

In this paper, we train four multilingual generative language models (up to 7.5 billion parameters), XGLM’s, and present a comprehensive study of multilingual zero- and in-context few-shot learning. We train the models using a large-scale corpus of 500B tokens that comprises 30 diverse languages, up-sampling the less-resourced languages to render a more balanced language representation. We evaluate the models on multiple multilingual natural language understanding (NLU) tasks, machine translation and a subset of English tasks demonstrated in Brown et al. (2020).

We found XGLM demonstrate strong cross-lingual capability where using English prompts together with non-English examples yields competitive zero- and few-shot learning performance. Our largest model (XGLM ${}_{\text{7.5B}}$ ) achieves strong zero- and few-shot learning performance on language completion and inference tasks (e.g. XStoryCloze: 65.4% 0-shot, 66.5% 4-shot; XNLI: 46.3% 0-shot, 47.3% 4-shot). It also establishes a new state-of-the-art on few-shot machine translation across a large number of language pairs in the FLORES-101 benchmark Goyal et al. (2021b), significantly outperforming the GPT-3 model of comparable size (6.7 billion parameters). On the other hand, multilingual pre-training causes performance drop on English. On 8 English NLU tasks, XGLM ${}_{\text{7.5B}}$ underperforms GPT-3 ${}_{\text{6.7B}}$ by 10.9% on average in zero-shot learning. GPT-3 ${}_{\text{6.7B}}$ also surpasses XGLM ${}_{\text{7.5B}}$ in machine translation on several high-resource language pairs, including WMT-14 en $\leftrightarrow$ fr, WMT-16 en $\leftrightarrow$ de and WMT-19 en $\leftrightarrow$ zh.

We conduct an in-depth analysis of different multilingual prompting approaches and examine cross-lingual transfer through template and demonstration examples respectively. We show that non-English templates sometimes yield unexpected low zero- and few-shot learning accuracy even if they are crafted by native speakers (§4.3). Both using the English template (§4.4) and adding demonstration examples (§4.5) provide effective remedy. However, using demonstration examples from another language often cannot further improve the zero-shot learning performance when a strong prompting language (e.g. Engilsh) is used, which indicates room for improvement in cross-lingual pre-training and in-context transfer approaches.

Models and Pre-training Data

We extend the pipeline used for mining the CC100 corpus (Conneau et al., 2020; Wenzek et al., 2020) to generate CC100-XL, a significantly larger multilingual dataset covering 68 Common Crawl (CC) snapshots (from Summer 2013 to March/April 2020) and 134 languages. Our pretraining data include 30 languages covering 16 language families. The natural data distribution is skewed with the number of English tokens being 6 times that of the second largest language. Following previous work on multilingual pre-training Conneau et al. (2020); Liu et al. (2020), we up-sampled the medium and low resource languages to create a more balanced language distribution (Appendix F.1).We inadvertently over-sampled some of the less resourced languages which is reflected in the statistics of ko, fi, th, bg, ca, hi, et languages, as shown in Figure 1. We did not ablate the effect of this mistake due to the extreme computational cost. Studying optimal language balancing is an important area for future work. Figure 1 shows the language distribution of our pre-training data before (blue) and after (green) up-sampling.

We process all languages with a joint vocabulary of size 250k created through unigram language modeling Kudo (2018), using the SentencePiece library Kudo and Richardson (2018). We train the unigram-LM model using 10 million sentences randomly sampled from the filtered data, according to the multinomial distribution defined in Lample and Conneau (2019) with $\alpha=0.3$ .

2 Models and Training

We train decoder-only causal language models with the Transformer architecture similar to GPT-3 Brown et al. (2020). This allows us to study the effect of scaling up model size along both width and depth dimensions. As a result, we compare four models with 564M, 1.7B, 2.9B and 7.5B parameters, respectively. The architecture details are summarized in Table 1. Our models match that of GPT-3 modelsFor XGLM 2.9B we used the optimal depth-to-width parameter allocation for GPT-3 architectures based on rank bottleneck analysis Levine et al. (2020). This allocation is expected to have improved training efficiency. However, it did not converge for XGLM 7.5B in our experiments, and we fell back to the original GPT-3 setup. except with the additional embedding parameters from a larger vocabulary. All models are trained for up to 500B tokens, with context length of 2048 tokens. Further training details are described in Appendix A.

Multilingual In-context Learning

We measure the performance of our multilingual language models on downstream tasks in different languages given the tasks and few-shot demonstrations specified via prompts without further parameter updates (Appendix B).

Previous work on English in-context learning has shown that performance heavily depends on the prompt construction, and it is challenging to find the optimal prompt for a given language model Gao et al. (2021); Perez et al. (2021). This problem is further complicated in the multilingual setting, where we need to find the optimal prompts for examples in different languages.

In this work, we consider three approaches for obtaining the prompts for non-English tasks.

The first approach is to ask native speakers of the target language to handcraft the prompts. Prompts created this way are expected to have the most natural surface form. However, language expertise is expensive and we further consider two alternatives.

We assume high-quality prompts of a task can be easily sourced in EnglishSanh et al. (2021); Mishra et al. (2021). Non-verbal prompts do not contain words in any particular language (e.g. the StoryCloze and WMT prompts shown in Table 2), while verbal prompts have different realizations in different languages (Table 3). If the prompt is non-verbal, we simply apply it to the other languages. If the prompt is verbal, we translate it into the other languages using automatic translation APIs.

We consider the third approach which directly applies the prompts in English (or another high-resource language) to non-English examples. We expect this approach to be competitive, as a result of the cross-lingual capability of the model after being trained on a diverse set of languages.

2 Learning from Cross-lingual Demonstrations

The cross-lingual nature of multilingual language models further enable the possibility of learning from a different language in context without parameter updates. To do so we simply append examples from another language as the demonstration examples in the language model context. Such capability enables cheap transfer from high-resource languages to the low-resource target languages.

Experiments and Results

We evaluate the zero-shot and in-context few-shot learning capabilities Brown et al. (2020) of XGLM on a spectrum of downstream tasks (Table 4).

We select four multilingual tasks spanning commonsense reasoning (XCOPA), anaphora resolution (XWinograd), natural language inference (XNLI) and paraphrasing (PAWS-X). We also created a new dataset, XStoryCloze, by professionally translating the validation split We further split the translated data into train and test (20% vs. 80%, respectively) for each language, keeping the parallel sentence mapping in both splits. of the English StoryCloze dataset (Spring 2016 version) to 10 other typologically diverse languages (ru, zh Simplified, es Latin American, ar, hi, id, te, sw, eu, my)For all of our multilingual NLU datasets, the non-English sections of the data are (professionally) translated from the English section. Despite being the dominant approach adopted by the community Ruder et al. (2021), it was previously shown to introduce data artifacts that inflate the measured cross-lingual transfer of models Artetxe et al. (2020). We leave collecting native multilingual datasets that include non-English data as future work, and strongly encourage the community to also adopt this practice.. In addition, we evaluate our models on machine translation (§4.8) and multilingual social value tasks (Appendix E.1).

We also evaluate our models on English commonsense reasoning and QA, a subset of benchmark tasks used by Brown et al. (2020), and compare the performance to state-of-the-art English-centric few-shot learning models. The tasks are detailed in Table A1.

2 Setup

We follow the guidelines suggested by Perez et al. (2021) and adopt a cross-task generalization setting Triantafillou et al. (2020) to select our scoring function. We reserve three held-out tasks (XNLI, XCOPA and XStoryCloze) to perform the selection based on their development set performance, and directly apply the selected settings to the rest of the tasks. In the end, we use the averaged per-token log-probabilities ignoring the common prefix of different candidates as the scoring function for all multilingual tasks with no additional calibration or normalization. Appendix C.2 details the selection.

We focus on benchmarking the 0- and 4-shot learning performance of the models on all tasks. For cross-lingual demonstration (§4.5), scaling law (§4.9) and translation (§4.8) we also reported 1-shot and 32-shot performance. We report the average results across 5 runs, randomly sampling a different set of few-shot examples each time. Without further specification, we use few-shot examples in the same language as the target example. Appendix C.3 details our complete evaluation protocol.

3 Comparing Prompting Approaches

We first compare different multilingual prompting approaches proposed in §3.1 using XGLM ${}_{\text{7.5B}}$ on XNLI and XCOPAThe original XCOPA release Ponti et al. (2020b) does not contain the English section. We added the English release from SuperGLUE Wang et al. (2019) to facilitate cross-lingual experiments.. Native speakers among the authors handcraftedThe native speakers were instructed to create a prompt that convert the task into a natural cloze-style question in their native language with no further restrictions. the prompts for the following tasks: XNLI (en, zh, es and hi) and XCOPA (en, zh), as shown in Table 3. We compare the performance of these human-written prompts to English prompts, machine-translated (MT) prompts and human-translated (HT) prompts.

Table 5 and 6 show the performance of different prompting approaches Appendix D.1 provides the comparison between English prompts and the MT and HT prompts on the complete dev sets of XNLI and XCOPA.. English templates perform the best on average across languages for both tasks except for the 4-shot setting of XCOPA, where it slightly underperforms the machine translated templates. On the XNLI task, the English template significantly improves the performance of Chinese (zh) and Hindi (hi) over their native templates and translated templates. Similar trends are observed for Thai (th) and Swahili (sw) on XCOPAThe strong performance of English templates may be partially contributed to the fact that the non-English evaluation data on XNLI and XCOPA are obtained from translation. Testing how well the English templates perform on native non-English test sets is an interesting future work.. For both tasks there exist languages where the native templates strongly outperforms the English templates (Spanish (es) for XNLI and Chinese for XCOPA), indicating significant room for future work on language-specific prompt engineering.

4 Cross-lingual Transfer through Templates

We further examine if the ability of universal prompting is English specific, and in addition, what characterize a language pair for which cross-lingual prompting can work. To this end, we apply each of the human-written non-English templates to the rest of the languages. As shown in Table 5 and 6, using the Spanish prompt yields competitive 0- and 4-shot performance across all languages, with the 4-shot average performance being comparable to that of the English template. The Hindi template also achieves significantly above random performance on the XNLI tasks for most languages (especially en). The Chinese template, however, achieves close-to-random performance for all languages on XNLI, as well as close-to-random for Thai (0-shot) and Swahili (0-shot) on XCOPA. We hypothesize that the common sub-tokens and the amount of code-switching text in the pre-training data play a significant role in enabling cross-lingual prompting. And in general, high-resource languages with large amounts of pre-training data and vocabulary overlap with other languages act as better universal prompting languages. We leave a more systematic verification of this hypothesis to future work.

5 Cross-lingual Transfer through Demonstration Examples

We examine the capabilities of learning from cross-lingual demonstration examples (§3.2) of XGLM ${}_{\text{7.5B}}$ on XNLI. We examine two settings for each train-eval language pair: same-language-prompting, where the prompt templates and the example are in the same language, and source-languauge-prompting where the prompt templates for both the demo and test examples are in the source language. We use the human-translated prompts for same-language-prompting.

Table 7 shows results on a subset of language pairs of XNLI, where we evaluate transfer through demonstration examples from in-context demonstration examples from high-resource languages to lower-resourced ones, and between languages that are typologically similar. We report the difference between the 32-shot learning results and the 0-shot learning results. The non-English templates in this experiment are obtained via human-translation. While they typically underperform the in-language few-shot setting (Figure A2), most cross-lingual few-shot settings significantly improve over the 0-shot setting for the target language. Bulgarian is an exception as it does not benefit from Russian despite being in the same language family. Another language that does not work well in the cross-lingual settings is Swahili (low resource), for which we examined transfer from English (high resource) and Arabic (medium resource). In contrast, Thai (medium) and Urdu (low resource) significantly benefit from cross-lingual demonstrationsBoth Thai and Urdu obtained close-to-random zero-shot learning performances using the translated templates, which might make them easier to be further improved. Besides, there is inherent code switching in these languages (English presence in Thai and Urdu both lexical and morphological). Turkish and Arabic also have influence on Urdu. We hypothesize that these factors also positively impacted the cross-lingual in-context learning performance..

We also observed the benefit of cross-lingual transfer from demonstration examples is generally canceled if a better prompt (e.g. the English prompt) is used for the target language. We report the crosslingual demonstration experiments between all pairs of languages for XNLI, XCOPA and XStoryCloze and provide more discussion in Appendix D.2.

6 Performance on Multi-lingual Tasks

Using English as the universal prompting language, we characterize the zero- and few-shot in-context learning capabilities of XGLM ${}_{\text{7.5B}}$ on XNLI, XCOPA and XStoryCloze and compare them to English centric language models of comparable size.

We compare XGLM ${}_{\text{7.5B}}$ to GPT-3 ${}_{\text{6.7B}}$ on high, medium, low and extremely low resources languagesWe use GPT-3 Curie: https://blog.eleuther.ai/gpt3-model-sizes/. The results are summarized in Table 9 and 10. On all three tasks, XGLM ${}_{\text{7.5B}}$ outperforms GPT-3 ${}_{\text{6.7B}}$ by a large margin according to the average performance across languages, especially on medium, low and extremely low resource languages. On XNLI, GPT-3 ${}_{\text{6.7B}}$ performs well on English and similar languages, surpassing XGLM ${}_{\text{7.5B}}$ on en, de (4-shot), es (4-shot), fr (0-shot). A possible explanation is that these languages have significant presence in the GPT-3 training data (fr: 1.8%, de: 1.5%, es: 0.8% as shown in Figure 1) and can benefit more from the lexical cognates from English.

We also create a translate-test baseline, where we translate the non-English examples of the multilingual tasks to English using the Google Cloud Translation APIhttps://cloud.google.com/translate and use GPT-3 ${}_{\text{6.7B}}$ repl., an in-house replication of GPT-3 ${}_{\text{6.7B}}$ , to perform inference. We found the translate-test is a strong baseline of multilingual zero- and few-shot learning as is shown in Table 9 and 10. Across all three tasks, it significantly narrows the performance gap between English and other languages, especially on XNLIThe performance of translate-test baselines might be inflated given MT systems are often trained on backtranslations which makes it good at translating translationese (Edunov et al., 2019), which commonly exist in non-English evaluation data. Besides, the translation-test approach relies on high-quality machine translation (MT) systems trained on large amounts of parallel data..

7 Performance on English Tasks

We also benchmark the performance of XGLM ${}_{\text{7.5B}}$ on English tasks. Figure 2 shows the comparison between XGLM ${}_{\text{7.5B}}$ , GPT-3 ${}_{\text{6.7B}}$ and GPT-3 ${}_{\text{6.7B}}$ repl. on a subset of English tasks used by Brown et al. (2020). Our replication of GPT-3 ${}_{\text{6.7B}}$ , GPT-3 ${}_{\text{6.7B}}$ repl., performs better than or close to GPT-3 ${}_{\text{6.7B}}$ on all tasks. While XGLM ${}_{\text{7.5B}}$ performs competitively on all tasks, there remains a considerable performance gap comparing to GPT-3 ${}_{\text{6.7B}}$ and GPT-3 ${}_{\text{6.7B}}$ repl.. On most tasks XGLM ${}_{\text{7.5B}}$ and GPT-3 ${}_{\text{6.7B}}$ repl. show similar performance trend as $k$ changes. For example, both models show a performance dip at 1-shot on HellaSwag and PIQA, and 128-shot on COPA.

There are multiple reasons why XGLM ${}_{\text{7.5B}}$ underperforms English centric models on the English tasks. First, only 32.6% of XGLM ${}_{\text{7.5B}}$ ’s 500B-token training data is English while both English-centric models are trained on close to 300B English tokens. Second, the model capacity of XGLM ${}_{\text{7.5B}}$ is shared by 30 languages, and the “curse of multilinguality” can degrade the performance across all languages Conneau et al. (2020). Further scaling up the model capacity and training data can potentially close this gap. The differences between the training corpora of the three models may have also contributed to the performance difference. While both English centric models incorporate high-quality English monolingual corpora such as BookCorpus Zhu et al. (2019) in their training data (GPT-3 ${}_{\text{6.7B}}$ also upsamples such high-quality data), XGLM ${}_{\text{7.5B}}$ is trained solely on data extracted from Common Crawl. However, we do not expect this to be the main impact factor. Scao et al. (2022) conducted a similar experiment showing that a multilingual model (1.3B parameters) pre-trained over 13 languages also significantly underperforms an English model trained from the same data source in terms of zero-shot generalization.

8 Performance on Machine Translation

We report machine translation results on popular WMT pairs in Table 11, and a subset of FLORES-101 (Goyal et al., 2021b) in Table 12. We use greedy decoding for both GPT-3 and our own model, and use the same 32 examples for few-shot learning in each case.

GPT-3 yields strong results on a few languages that are best represented in its training data, narrowly surpassing our model on WMT French-English, German-English, and Chinese-English, as well as a few pairs the FLORES-101 set. GPT-3 is particularly strong when English is the target language, presumably due to its strong English language modeling capability. However, it does poorly on the broader set of less-resourced languages. For instance, GPT-3 fails completely when translating into Korean, Arabic, Swahili, Hindi, Burmese and Tamil in FLORES-101, with a spBLEU score of 1.2 in the best case.

In contrast, our model obtains solid results across the board. In addition to surpassing GPT-3 in 171 out of 182 language pairs in the FLORES-101 set, our model is also competitive with the official supervised baseline for this dataset, even surpassing it in 45 language pairs. This suggests that large-scale multilingual language models have a great potential for building machine translation systems for low-resource languages, even if little or no parallel data is available.

9 Scaling up Model Size

Finally, we study the impact of scaling up the model parameter size on its 0- and few-shot learning capabilities. Figure 3 shows the performance ( $k=0,4,32,128$ ) of the four XGLM models (564M, 1.7B, 2.9B, 7.5B) on the five multilingual tasks. The $y$ -axis represents the average accuracy across languages for each task. On commonsense reasoning tasks (XStoryCloze, XCOPA, XWinograd), the performance of all models increases as $k$ increases from 0 to 32. The performance gain from demonstration examples also gets larger as the model size increases, indicating bigger models can better leverage the in-context examples. On XNLI, the performance of all models increases as $k$ increases from 0 to 4, but decreases for $k$ at 32 and above. With the same number of demonstration examples, larger models do not always benefit more. PAWS-X is a task where in-context learning struggles – the performance of all models oscillates near random (50%) as $k$ changes. A possible reason is the adversarial nature of PAWS-X, where the paraphrase and non-paraphrase pairs by design have high lexical overlap. We expect scaling to be an effective recipe for building stronger multilingual language models, given the current trend.

Related Work

Brown et al. (2020) first demonstrated in-context few-shot learning using the GPT-3 model. This method removes the need for task-specific updates to the model parameters: the few-shot examples that one would normally use for fine-tuning are provided at inference time to the same model for each task. On several high-resource Latin language pairs, GPT-3 achieves machine translation performance that is close to or better than state-of-the-art supervised models, given only a handful of demonstration examples.Study shows that language contamination in pre-training data can effectively boost the cross-lingual capability of English-centric language models Blevins and Zettlemoyer (2022). With a heavier tail of deliberately introduced multilingual data, PALM-540B Chowdhery et al. (2022) later achieves even stronger few-shot machine translation performance. Such change in the learning paradigm raises new questions about multilinguality, which has not been studied as extensively. Winata et al. (2021) evaluates the in-context few-shot learning abilities of several GPT-2, GPT ${}_{\text{NEO}}$ and T5 on three additional languages (de, es, fr) using multiple NLU tasks, considering monolingual prompts as well as cross-lingual prompts, demonstrating the multilingual in-context learning skills of the English GPT and T5 models. Zhao and Schütze (2021) evaluated different fine-tuning and prompt-tuning Liu et al. (2021) approaches on XLM-R and demonstrates the effectiveness of prompting in few-shot crosslingual transfer and in-language training of a multilingual masked language model.

Early multilingual pre-training work train word embeddings over multilingual corpora Mikolov et al. (2013). The multilingual versions of contextualized embedding models such as BERT Devlin et al. (2019), RoBERTa Liu et al. (2019), BART Lewis et al. (2019) and T5 Raffel et al. (2020) were also developed: mBERT Devlin et al. (2019), XLM-R Conneau et al. (2020), mBART Liu et al. (2020), and mT5 Xue et al. (2020). Such models were trained on a single, multilingual text corpus such as mC4 Xue et al. (2020) or CC25 Liu et al. (2020).

Several approaches have been developed to facilitate cross-lingual transfer, including sub-word tokenizers which enabled efficient, shared vocabulary learning across languages Kudo and Richardson (2018), joint training for efficient knowledge transfer across languages Pires et al. (2019); Jiang et al. (2020); Kassner et al. (2021), etc. A notable concurrent work is BLOOM https://bigscience.huggingface.co/blog/bloom, which scales multilingual pre-training to 46 languages and 175 billion parameters.

Conclusion

We introduce four multilingual generative language models (XGLMs) at different scales, and study their in-context few- and zero-shot learning capabilities. We show that the few-shot learning capability of XGLM steadily improves as it scales. Our largest model (7.5B parameters) sets a new state of the art for few-shot learning in more than 20 languages (including mid- and low-resource languages) on commonsense reasoning, NLI and machine translation tasks. An in-depth analysis shows the models are highly cross-lingual, which leads to strong few-shot learning performance in non-English languages.

Limitations

Although the multilingual language model is an important step towards building inclusive general-purpose foundation models, our current models have the following limitations.

Our models are trained on a static multilingual corpus extracted from CommonCrawl, with English text comprising 32.6% of the total number of tokens corresponding to 163B tokens. The English data portion of the corpus corresponds to roughly 54% only of GPT-3’s training data. We applied several data filtering strategies as proxies for data quality assurance (see a comprehensive list in the Data Card in Appendix F), such as removing duplicated documents and paragraphs by URLs, filtering out paragraphs with high ratio of digits and punctuation, removing paragraphs with profanity, filtering by max number of URLs and minimum length, etc. Such filtering may potentially result in bias of the remaining data used in pretraining, which would need further analysis to understand. Furthermore, the raw data were taken from static CommonCrawl snapshots, which may not include entities and events beyond the time span of the snapshots (till March 2020), such as COVID-19, etc. As such we also note the potential difference in genres between CommonCrawl and the genres used in GPT-3 comprising in addition to CommonCrawl, corpora such as BookCorpus and Wikipedia.

Moreover, GPT-3 is trained on 118 languages despite the fact that 93% of the data is English.https://github.com/openai/gpt-3/blob/master/dataset_statistics/languages_by_word_count.csv In contrast our models are trained on 30 languages after rigorous language identification and filtering.

As is shown in Section 4.7 and Figure 2, our model underperforms English-centric models on eight tasks ranging from commonsense reasoning to QA. There are several factors which could be contributing to this gap, such as

Difference in training data quality (XGLM is trained on filtered CommonCrawl data only, while the English-centric models are trained on data including both CommonCrawl as well as high-quality corpora such as BookCorpus and Wikipedia) and quantity (as is described in the previous paragraph, the multilingual model was trained on 54% of the English data used in English-centric models);

Curse of multilinguality. Previous work in multilingual training has shown that increasing the number of languages in model with shared parameters hurts performance on all training languages, e.g. English Conneau et al. (2020).

Additional experiments controlling for these factors would shed more light on the observed gap.

In this work, we only experimented with causal language models with a decoder-only architecture, which had previously demonstrated promising few-shot learning capabilities Brown et al. (2020). However, such architecture and pretraining objective do not leverage bidirectional context such as those used by masked language models (MLM), or sequence-to-sequence architectures with denoising autoencoder pretraining objectives.

We compare our language models to the baselines primarily in the in-context learning paradigm, using the same prompts for all language models in the comparison unless explicitly specified. Despite minimal effort engineering the prompts for any model, it is possible that the prompts work better with some models than the others, which introduces bias to the evaluation. However, we expect this factor to have small impact and the relative strengths of the models can be reliably measured given the volume of tasks they were evaluated on.

We evaluate and analyze the models’ performance on hate speech detection and gender bias for professional occupations. These studies are limited by the available evaluation datasets. We are limited in our study as we only investigate this problem space for six languages (English, French, Spanish, Italian, Portuguese, and Polish) where a majority of them (5) pertain to the Romance language family. It would be pertinent to investigate the impact of multilingual models on social value tasks across a wider and more diversified set of languages before drawing solid conclusions. Moreover, we contend that studies on other tasks such as stereotype Nangia et al. (2020); Nadeem et al. (2021), ethics Hendrycks et al. (2020) would provide a more comprehensive view of model behavior for social value tasks.

Ethical Considerations

Devising multilingual pre-trained language models can serve as a powerful tool in the NLP arsenal for multiple reasons.

From an engineering perspective, XGLM pertains to a family of models that represent single unified models catering to many languages which have wide application across many applications. Such a unified single model saves on carbon footprint as well as energy consumption (comparing to the alternative: separate models for different languages) leading to more energy efficiency. A single model, despite having the risk of being a single point of failure, has the powerful incentive of being easier to maintain, access, distribute, and track.

Models such as XGLM represent a paradigm shift from the Anglo-centric view of the world of NLP to being able to cater to all languages on an equal footing. Paying attention to the design of such models is critical to ensure equitability and inclusion, exemplified here by attempting to balance language representation. The further power of XGLM specifically is its ability to perform comparably to Anglo-centric models in zero to few shot settings. Possessing powerful multilingual models that can perform well in such settings especially for medium to extremely low resource languages helps alleviate the burden of creating supervised data for such languages especially for economically challenged languages (medium to low digital presence typically goes hand in hand with economic disparities). Moreover, having such models catering to scarcer languages spurs scientific research in such languages leading to more diversified NLP, and more diversified science in the broader sense.

We further investigate the impact of our models on social valued problems such as hate speech detection and bias (Appendix §E). Despite inconclusive results overall (bordering on negative), we note that for the relatively scarcer data setting (Polish) the multilingual models outperform the Anglo-centric models indicating that XGLM will be performant for less resourced languages. This is especially significant for social value tasks where obtaining training data is quite problematic due to the inherent expense of obtaining high quality annotated data.

In the spirit of transparency and accountability for large-scale language modeling we include detailed model card and data card with the model and paper release.

References

Appendix A Pretraining Details

All models are trained with the Fairseq library Ott et al. (2019). We use Adam optimizer with $\beta_{1}=0.9$ , $\beta_{2}=0.98$ , $\epsilon=1e-8$ . We adjust the learning rate based on model size, e.g. $1.5e-3$ for the 564M and 1.7B model, $7.5e-4$ for the 2.9B model, and $1.2e-4$ for the 7.5B models. Learning rates were adjusted with a 2000 warm-up updates followed by a polynomial decay schedule. All models are trained with data parallel and an effective batch size of 4M tokens. The XGLM 7.5B model was trained on 256 A100 GPUs for about 3 weeks, at a speed of 311.6k words per secondOn 256 A100 GPUs, the inference speed can reach 1.47 million words per second. Besides, inference can be done with significantly less resources. For example, using 8 v100 GPUs, it took 6 hrs to evaluate XGLM 7.5B on XStoryCloze..

We replicate the GPT-3 ${}_{\text{6.7B}}$ architecture and optimization hyperparameters to the best of our knowledge for training this model. The most significant difference between this model and GPT-3 ${}_{\text{6.7B}}$ is in the training data. The training data used by GPT-3 ${}_{\text{6.7B}}$ repl. is a combination of six English-language datasets, totaling 453GB and 112B tokens (which we up-sampled to 300B tokens):

BookCorpus (Zhu et al., 2019), a dataset consisting of more than 10K unpublished books (4GB);

English Wikipedia, excluding lists, tables and headers (12GB);

CC-News (Nagel, 2016), a dataset containing 63 millions English news articles crawled between September 2016 and February 2019 (76GB);

OpenWebText (Gokaslan and Cohen, 2019), an open source recreation of the WebText dataset used to train GPT-2 (38GB);

CC-Stories (Trinh and Le, 2018), a dataset containing a subset of CommonCrawl data filtered to match the story-like style of Winograd schemas (31GB);

English CC100 (Wenzek et al., 2020), a dataset extracted from CommonCrawl snapshots between January 2018 and December 2018, filtered to match the style of Wikipedia (292GB).

The data are encoded using the same Byte-Pair Encoding (BPE) as GPT-2 (Radford et al., 2019) and RoBERTa (Liu et al., 2019) with a vocabulary of 50K subword units.

A.1 Validation Perplexity

We use in-domain validation perplexity to validate the convergence status of the models. Figure A1 shows the average perplexity of the four models evaluated using a validation dataset sampled from CC100-XL. The validation data contains 30k sentences for each language that do not overlap with the pre-training data. We group the results by resource level.

Appendix B Multilingual In-context Learning Formulation

We extend the in-context learning framework proposed by Brown et al. (2020) to the multilingual setting. Let $\mathcal{M}$ be a causal language model and $\mathcal{D}$ be a task. $\mathcal{D}=(\mathcal{P},\mathcal{E})$ consists of a task description $\mathcal{P}$ and a few demonstration examples in one or more languages

We consider the setting where the task description comes in the form of a prompt $\mathcal{P}=(\mathcal{T},v)$ . $\mathcal{T}$ is a cloze-style template that converts an example input $x$ into a string $\mathcal{T}(x)$ that contains a [Mask]symbol.We relaxed the prompt format of GPT-3 by allowing the [Mask]symbol to appear anywhere in $\mathcal{T}(x)$ instead of just in the end. Having this additional flexibility leads to better performance on some tasks. This is inspired by the masked language modeling prompts constructed by recent work Schick and Schütze (2021); Zhang et al. (2021). For classification and multiple-choice problems, $v:\mathcal{Y}\rightarrow\mathcal{V}^{*}$ is a verbalizer that maps each candidate label or choice $y\in\mathcal{Y}$ into a string $v(y)$ . Both $\mathcal{T}(x)$ and $v(y)$ can be tokenized into a sequence of one or more tokens in the language model vocabulary $\mathcal{V}$ . An instantiated prompt $\mathcal{P}(x,y)$ is obtained by substituting the [Mask]symbol in $\mathcal{T}(x)$ with $v(y)$ . Table 2 shows the prompts used by all tasks in our main experiments.

This general formulation can cover most NLP tasks. For classification problems, $v$ is a mapping from classes to strings; for multiple-choice problems, $v$ is an identity function that maps each candidate choice to itself. For text generation problems, $v$ is identity and we decode free-form text from [Mask], which in this case is positioned at the end of $\mathcal{T}(x)$ .

Suppose we have $k$ demonstration examples available in a source language:

In this case, we concatenate the instantiated prompts of the demonstration examples $\{\mathcal{P}(x^{s}_{i},y_{i})\}^{k}_{i=1}$ and make it the prefix of the input string used in the zero-shot learning setting to form the objective:

where [Sep] is a separator symbol chosen empirically.

When $s=t$ , we have the in-language few-shot learning setup.

When $s\neq t$ , we have the cross-lingual few-shot learning setup.

Appendix C Evaluation Details

Table A1 shows all the English tasks used in our evaluation.

C.2 Scoring Functions

We considered the following functions for scoring an instantiated prompt using a language model:

average of per-token log probabilities, ignoring the common prefix of different candidates.

We also considered the calibration approach proposed by Zhao et al. (2021) and character normalization proposed by Lieber et al. (2021).

In the end, we use the average of per-token log-probabilities ignoring the common prefix of different candidates as the scoring function for all multilingual tasks. This is selected based on the development set performance of StoryCloze and XNLI.

For English tasks, we use the same modeling choices as Brown et al. (2020). Specifically, we use the task prompts as detailed in Appendix G of Brown et al. (2020), and a single newline as the separator for few-shot learning. For WinoGrande, we take the log-likelihood of the common suffix of the different candidates as the scoring function. For ARC-easy, ARC-challenge and OpenBookQA, we normalize by the unconditional probability of each candidate by taking $\frac{p(\mathtt{completion}|\mathtt{context})}{p(\mathtt{completion}|\mathtt{answer\_context})}$ , where we use the string “Answer: ” as answer_context. For all the other tasks, we take the average of per-token log-probabilities, ignoring the common prefix of the different candidates.

C.3 Evaluation Protocol

All few-shot learning results are obtained with the in-language setting (both the training and test examples are in the same language) unless otherwise specified. We report results on the test set for all multilingual tasks (including the held-out tasks). For English tasks, we report results on the test set for ARC-easy, ARC-challenge, OpenBookQA and StoryCloze, and on the development set for the rest, following Brown et al. (2020). For few-shot learning, we report the average results across 5 runs, randomly sampling a different set of few-shot examples each time. For tasks with a training set, we sample the few-shot examples from the training set; for tasks with no training set, we sample from the dev set and report evaluation results on the test set; for dev-set examples on XNLI and XCOPA, we sample few-shot examples from the test set, since these two tasks do not have the training sets for all languages. While Brown et al. (2020) tuned the few-shot value $k$ as a hyperparameter on the dev set, we pre-selected a few $k$ values (0, 1, 4, 32, 128) and report the corresponding results.

Following Brown et al. (2020), we truncate the input such that they fit the maximum context length of XGLM ( $n_{\text{ctx}}=2048$ ) and preserve only the complete demonstration examples after truncation. For each task, we report results up to the $k$ ’s corresponding to the maximum fit.XWinograd has only a test split, and we sampled few-shot examples directly from it, following the practice used by Brown et al. (2020) for evaluating GPT-3 on Winograd. As a result we only report 0-, 1- and 4-shot results for XWinograd to minimize inflating the few-shot performance by training and testing on the same examples. Table A2 shows the average number of demonstration examples that fit the maximum context length of XGLM ( $n_{\text{ctx}}=2048$ ) for each task in our experiments.

We observe that the language model tend to fit more examples in a high-resource language in context compared to those in a low-resource language.XStoryCloze, XCOPA, XNLI and PAWS-X all contain parallel examples, which allows us to compare the maximum fit of the same set of examples across different languages. English, as the highest resourced language (Table A10), always fit the most examples. This reflects the unequal representation of different languages in our joint multilingual BPE vocabulary (§2.1). With this vocabulary induction scheme Sennrich et al. (2015), the underrepresented languages tend to have smaller sub-word units and higher fertility (defined as number of subwords per linguistic word), making it more challenging to learn word- and higher-level semantics for such languages. Other factors can also impact the tokenization granularity. For example, sharing sub-strings with other high resource languages can boost the granularity of a language; and some languages have smaller tokenization granularity as a result of their alphabet system (e.g. Chinese has an average sub-word length of 1.4, indicating the dominance of single-character tokens, despite being the third largest language in our pre-training data according to disk size).

Appendix D Additional Results

We compare the performance of English prompts and MT and HT prompts on two of our held-out tasks, XNLI and XCOPA, using their development sets. For MT prompts, we translate the English prompts into the target languages using the Google Cloud Translation API. We use the exact prompts as shown in Table 2 as the input of the translation API and manually recover the placeholders in the API output based on brackets markers (e.g. “{Sentence 1} because [Mask]” is translated to “{Sentence 1}因为[Mask]”). When the candidate set is closed, we replace [Mask]with each verbalized label and translate them separately. For example, “{Sentence 1}, right? Yes, {Sentence 2}” is translated to “{Sentence 1}，对吗？是的，{Sentence 2}”. On XNLI, we also compared to prompts manually translated from English to eliminate the impact of translation noise on the comparison.We ask native speakers to translate the English template into zh, es, fr, el, hi, vi, ar and bg. For the rest of the languages, one of the authors verified and corrected the machine translated templates using bilingual dictionaries.

As shown in Table A3 and Table A4, the in-context learning performance is sensitive to the prompting choices across all languages. For both XNLI and XCOPA, using the English prompts on average yield significantly better performance than using the machine-translated prompts. For XNLI, human translated (HT) prompts significantly improve over machine translated (MT) prompts for most languages. Surprisingly, the performance of human translated prompts lags behind that of the English prompts in the 0-shot and 4-shot settings.

Further examination of the per-language performance reveals that the relative strengths of different prompting approaches vary across languages. For es and de, HT prompts offer large gains compared to the MT prompts and the English prompts. However, for zh and ur, using translated prompts (either HT or MT) significantly hurts the performance. For zh, fr, vi, ar and hi, using native-speaker translated prompts still yields significantly lower performance compared to using the English prompts in at least one setting, suggesting that translation error is not the sole cause of the performance drop.

D.2 Full Results on Learning from Cross-lingual Demonstrations

We evaluated XGLM ${}_{\text{7.5B}}$ on XNLI in the learning from cross-lingual demonstration setting, using both the same-language-prompting and English-prompting setups. In same-language-prompting, the prompt fields and the examples are always in the same language. And in English-prompting, English prompts are used for all examples. All few-shot performances in this section are obtained using the $k$ -shot per label setting as described in §D.4.

As shown in Figure A2, for many language pairs transferring from source language demonstration can significantly improve over the zero-shot performance in the target language when human-translated templates is used. The improvement is especially significant for languages such as Chinese (zh), Thai (th) and Urdu (ur), whose zero-shot performance is close to random with human translated templates. However, we found that the effect of cross-lingual transfer from template and cross-lingual transfer from demonstration examples typically do not add up. As shown in Figure A3, using the English template significantly improves the zero-shot performance of most languages, including Chinese, Thai and Urdu. In this case, the demonstration examples in general do not help unless they are in the same language as the target example (diagonals).

Figure A5 shows the results on XStoryCloze, where we observed almost no improvement for any language pair. Possible reasons for the poor transfer results on XStoryCloze is that it requires reasoning about implicit relations between multiple sentences which is much harder to do especially in a cross-lingual setting.

D.3 Full Results in FLORES-101

Table A5 reports our full results in FLORES-101.

D.4 Majority Label Bias

In the main paper, we define $k$ -shot learning as learning from $k$ unique examples randomly drawn from the entire training population. This setting may lead to skewed few-shot training sets, especially when $k$ is small. As shown in Table A6, the XNLI task is a three-way classification problem where the model needs to judge whether the relationship between a pair of sentences is “entailment”, “neurtral” or “contradiction”. While the original XNLI dev set has a uniform class distribution, the few-shot training sets randomly sampledWe implement our random sampling procedure using the numpy.random.choice function: https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html. from it often has a much more skewed class distribution.

For a $|\mathcal{Y}|$ -way classification task, a skewed training set distribution can cause the model to score the majority class as disproportionately more likely than the other classes. This was shown by Zhao et al. (2021) as the majority label bias problem. As a result, previous work such as Zhao and Schütze (2021) adopts a $k$ -shot per class setting, where $k$ unique examples are randomly drawn from each class to form a training set of size $k\times|\mathcal{Y}|$ .

We compare learning from a uniform class distribution (randomly sampling $k$ examples per class) to learning from a more skewed distribution (randomly sampling $k\times|\mathcal{L}|$ examples from the total population) on the XNLI task. We use the 24-shot and maximum fit (truncated 48-shot) settings. As shown in Table A7, for both settings, learning from a uniform class distribution leads to significantly higher accuracy in all languages compared to learning from the skewed distributions. de, tr, bg, hi suffer the most learning from the skewed distributions ( $>2$ absolute accuracy gap in the 24-shot setting), while es suffers the least. Moreover, the variances among few-shot trials using different random seeds shrink considerably when the training set class distribution is uniform. These results highlight the severeness of the majority label bias issue in the multilingual in-context learning framework.

D.5 Knowledge Probing

We evaluate to what extent our multilingual language model can effectively store factual knowledge in different languages. To this end, we evaluate knowledge triplet completion using the mLAMA dataset (Kassner et al., 2021), which was translated from the English benchmark LAMA Petroni et al. (2019) using Google Translate. The data is from TREx Elsahar et al. (2018) with triples of the format $\langle$ object, relation, subject $\rangle$ . Following the convention of LAMA, triples are converted to templates for querying the language model. For example, a triple like $\langle$ Paris, capital-of, France $\rangle$ is converted to template “Paris is the capital of [MASK]”. While each query in the original mLAMA dataset contains hundreds of candidates on average, we restrict it to three candidates one of which is the ground truth candidate and the other two candidates are randomly sampled to ensure fast inference and save API cost. Following the evaluation protocol of mLAMA, we report precision @1 averaged over all relations per language.

We evaluate on the 25 languages covered in XGLM’s pre-training data. We compare to the GPT-3 ${}_{\text{6.7B}}$ model. As shown in Figure A6, both our multilingual model and GPT-3 Curie perform well on English. For non-English languages, our multilingual model maintains performance (above 0.6) while GPT-3 Curie drops drastically especially for medium and low resource languages. Overall, compared to an English-centric language model, our multilingual language model are better at retaining factual knowledge on a wider range of languages with $+7.1$ points on average.

Appendix E Safety and Bias Analysis

Given the centrality of large scale Language models, it is important to ensure such powerful models are used responsibly. Accordingly, we further examine XGLM’s behavior on two tasks:

Hate speech detection: A safety task to test language models’ ability to identify hateful and offensive text;

Occupation Identification: A bias task to study language models’ performance disparity between different gender groups on the task of occupation identification.

Through extensive experiments, we have following findings: First, hate speech detection in an in-context learning setting is quite challenging. Moreover, language models are not effectively leveraging few-shot examples to improve the performance. Second, although language models have relatively good performance on the occupation identification task, they run the risk of exhibiting strong gender bias for certain occupations.

We adopt datasets introduced by Huang et al. (2020) that include hate speech data from Twitter in five languages: English, Italian, Portuguese, Polish and Spanish. All hyperlinks, usernames and hashtags are replaced with generic symbols (URL, USER, HASHTAG) to anonymize user information. We remove tweets containing more than $2$ generic symbols to encourage more informative examples. We further filter out tweets of length less than $5$ tokens or more than $30$ tokens. In the spirit of creating balanced data, we randomly sample $500$ each positive (hate speech) negative (not hate speech) examples for each language. For further comparison, we translate non-English data into English by using Google Translate and then evaluate English models performance on the task.

We evaluate two approaches to prompting, similar to Section LABEL:sec:crosslingual_capability. For English prompts, we prefix “The sentence is ¡candidate¿” to the input sentence to form a prompt. We consider $10$ verbalization candidates including $5$ negative (normal., common., ok., usual., acceptable.) corresponding to classification of not hate speech and $5$ positive (sexist., racist., offensive., abusive., hateful.) representing classification of hate speech. For code-switched prompt, we translate the English prefix and candidates into the corresponding target language by using Google Translate. For example, “The sentence is normal” is translated into “Questa frase è normale.” for Spanish. For few-shot learning, we randomly draw examples from the training data and report the average performance across $5$ runs.

We compute precision, recall and accuracy for all experimental conditions. Since the test data is balanced, the accuracy of a random baseline is 50%.

E.1.2 Results

We show accuracy and recall scores in Table A8. Bolded results are the highest in the table and those with an (*) are statistically significantly better than other comparable conditions. Hate speech detection is a challenging task for all models. We observe that across the five languages, in-context learning results are only slightly better than random ( $50\%$ ). The results are also unstable and sensitive to prompt changing. Overall, the XGLM ${}_{\text{7.5B}}$ model has better recall compared to the English-centric models. For example, the XGLM ${}_{\text{6.7B}}$ En-only model has very low recall score in the zero-shot setting with the language condition set as “same language”, indicating that it blindly predicts almost everything as negative (not hate speech). Another interesting observation is that most few-shot results are worse than zero-shot, which indicates that with the prefix described above, language models are not able to utilize examples. Interestingly, we also find that in one-shot experiments models tend to copy the label of the given example instead of predicting based on the input tweet. This further demonstrates that language models are struggling with learning from few-shot examples in this task.

E.2 Gender Bias in Occupation Identification

Datasets We use the English bio dataset introduced in De-Arteaga et al. (2019) to study gender bias based on identifying a person’s occupation from their bios. For multilingual bio datasets we use those created by Zhao et al. (2020). Originally there are $28$ occupations in English, $69$ occupations in Spanish and $27$ occupations in French. To ensure we have plenty of test data for each occupation, we only keep occupations with at least $1000$ male examples and $1000$ female examples. This leads to $16$ occupations in English, $6$ occupations in Spanish and $4$ occupations in French. We follow the setup in Zhao et al. (2020) where people’s names and pronouns are removed from the bios. We then prefix “The occupation of this person is ¡candidate¿” to the input bio to form a prompt. The candidate set consists of five occupations, including the ground truth one and four other randomly sampled male and female occupations (two male and two female). Male (female) occupations refer to ones having predominantly more male (female) samples.

Metrics Similar to the metric for Hate Speech detection, we first obtain the scores for 5 candidates and consider a prediction correct if the ground truth candidate yields the highest score among five candidates. We then compute the bias score as the absolute gap between the accuracy scores on the male and female samples,We only consider gaps that are statistically significant. averaged across all occupations. A lower bias score indicates that a model has less divergence in identifying occupations for men and women.

E.2.2 Results

We present the overall accuracy scores and the bias scores (—Diff—) in Table A9. Results indicate that the XGLM ${}_{\text{6.7B}}$ En-only model achieves the best performance on English and Spanish, while the GPT-3 ${}_{\text{6.7B}}$ model achieves the best performance on French. XGLM ${}_{\text{7.5B}}$ model, instead, falls behind on all three languages, especially for Spanish and French. We think this is potentially due to that all pronouns and people’s names are removed from the test data but not training data. The training data for XGLM ${}_{\text{7.5B}}$ contains more Spanish and French compared to the other two models. Thus, XGLM ${}_{\text{7.5B}}$ may have more severe morphological mismatch on Spanish and English. Regarding the bias score, the GPT-3 ${}_{\text{6.7B}}$ model is the most biased model on both English and Spanish but least biased on French. XGLM ${}_{\text{6.7B}}$ En-only and XGLM ${}_{\text{7.5B}}$ exhibit the least bias on Spanish and English, respectively.

Appendix F Data Card

We follow the recommendations of Gebru et al. (2018) and provide a datacard for the dataset used to train XGLM, which is a subset of CC100-XL, a larger multilingual dataset we curated.

Following the recent success of multilingual self-supervised pre-training Devlin et al. (2019); Lample and Conneau (2019); Conneau et al. (2020); Xue et al. (2020); Goyal et al. (2021a); Liu et al. (2020), we train our language models on a mixture of monolingual text of different languages. We extend the pipeline used for mining the CC100 corpus (Conneau et al., 2020; Wenzek et al., 2020) to generate CC100-XL, a significantly larger multilingual dataset covering 68 Common Crawl (CC) snapshots (from Summer 2013 to March/April 2020) and 134 languages. As the first step to balance the language distribution, we sampled 30% of the data from the languages that contain more than 15 billion tokens and more than 20 million documents. This resulted in a 8.4 TB multilingual corpus with 1.9 trillion tokens.

F.2 Motivation

For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description. The CC100-XL dataset was collected to create a high quality monolingual dataset for at least 100 languages. It was mainly used for training foundation multilingual language models which may be applied to a broad list of language tasks, including neural machine translation, speech translation, question answering, etc. CC100-XL involves sentence level filtering, preserves context, improves the filtering mechanism, and paves a way for mining 200+ languages.

Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)? Meta AI.

Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number. Meta AI.

F.3 Composition

What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)? Please provide a description. The instances are textual documents sampled from Commoncrawl snapshots.

How many instances are there in total (of each type, if appropriate)? The training dataset of XGLM contains 1.74 billion documents in total.

Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable). The dataset is a subset of CC100-XL. For each language, the data is either a full set or a random subset of CC100-XL data. Especially, the medium- and low-resource languages are upsampled. In terms of language representation, the CC100-XL dataset contains 134 languages extracted using fasttexthttps://fasttext.cc/docs/en/language-identification.html from Common Crawl snapshots. We further selected a subset of 30 languages to train XGLM, taking geo-location, language family and typology diversity of the languages into account.

What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description. Each instance consists of raw text data.

Is there a label or target associated with each instance? If so, please provide a description. No.

Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text. No.

Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)? If so, please describe how these relationships are made explicit. A small percentage of document instances (¡2%) are duplicated. Other than that, there are no relationships between individual instances.

Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them. This dataset is split into training and validation only. For each high resource language, at least 5,000 randomly selected documents and 30,000 lines were split into validation set, and the rest documents are for training; for low-resource languages, at least 100 randomly selected documents and 1,000 lines (a couple of very low resource languages contain 80 documents) were split into valid set and leave the rest for training. There are 3.5 million lines of text in total for the validation set. This split is mainly to ensure a good size of validation data with the coverage and balance over all languages, meanwhile, the validation size is not too large to affect the overall training speed.

Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description. 10% of Russian sample were lost during internal data transferring. Therefore, we ended up taking 26.7% random subset of the whole Russian data from CC100-XL.

Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? It’s self-contained.

Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals’ non-public communications)? If so, please provide a description. CC100-XL is exclusively extracted from Common Crawl; and the information in it is not considered confidential.

Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why. CC100-XL is a subset of public Common Crawl data, which could contain sentences that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety.

Does the dataset relate to people? If not, you may skip the remaining questions in this section. Some documents of this data relate to people, such as news articles, Wikipedia descriptions, etc.

Does the dataset identify any subpopulations (e.g., by age, gender)? If so, please describe how these subpopulations are identified and provide a description of their respective distributions within the dataset. No.

Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how Other than the individuals who are celebrities, politicians, etc, and have their Wikipedia pages; it is possible to identify other individuals by their names, twitter account names, etc. But we built personally identifiable information (PII) identification tools following the guidelines of General Data Protection Regulation (GDPR) and National Institute of Standards and Technology (NIST) and run against this dataset, we did not found highly sensitive PII information, such as U.S. social security number, login credentials, etc.

Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals racial or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)? If so, please provide a description. We use a curated special word list of 100 languages which covers profanities, hate speech, bulling language, common slangs and profane multi-word expressions (MWEs) to tag paragraphs and remove the docs containing them. Given the size of this data, it could still contain such sensitive information (as the above lists may not be exhaustive) but should be a very small percent of instances.

F.4 Collection Process

How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/ derived from other data (e.g., part-of-speech tags, model-based guesses for age or language)? If data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified? If so, please describe how. Please refer to the main document for details.

What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)? How were these mechanisms or procedures validated? Please refer to the main document for details.

If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)? Please refer to the main document for details.

Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)? This data is mined, filtered and sampled by machines.

Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created. The data was collected from 68 Common Crawl (CC) snapshots (from Summer 2013 to March/April 2020). Therefore, it does not contain a lot of information about recent events such as COVID-19.

Were any ethical review processes conducted (e.g., by an institutional review board)? If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation. No.

Does the dataset relate to people? If not, you may skip the remainder of the questions in this section. No.

Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)? N/A

Were the individuals in question notified about the data collection? If so, please describe (or show with screenshots or other information) how notice was provided, and provide a link or other access point to, or otherwise reproduce, the exact language of the notification itself. N/A

Did the individuals in question consent to the collection and use of their data? If so, please describe (or show with screenshots or other information) how consent was requested and provided, and provide a link or other access point to, or otherwise reproduce, the exact language to which the individuals consented. N/A

If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate). N/A

Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation. Some responsible AI related evaluations were performed. Please refer to the main document.

F.5 Preprocessing/cleaning/labeling

Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remainder of the questions in this section. Yes, the detailed steps are as below:

Downloading and Sharding Commoncrawl Snapshots We downloaded 68 Commoncrawl snapshots and divided the data in 240 shards based on web-domain. At this stage, textual data gets extracted from the WET files provided by Common Crawl which involves cleaning excessive tabs and newlines.

Language Identification (LID) at Document Level For this stage, we used the fastText language identification (LID) model on the entire document which helped further divide the data by language. In addition to the original languages supported by fastText, we also added support for 28 romanized languages. In total, the data for each language contains 240 shards.

Deduplicating Documents based on URL We aggregated the data based on URL which yields 60% reduction in volume. In case two documents had the same URL, we selected the document having more recent text content.

Document Splitting and LID at Paragraph Level We segmented the documents based on newline and also stored the information about the order in which the paragraphs were appearing in the original document (i.e. seq_num). Next, we performed LID at the paragraph level again in order to divide the original documents into clusters of paragraphs where each cluster represents sentences belonging to a particular language.

Deduplicating Paragraphs Data extracted from Commoncrawl snapshots still have a lot of duplicate text even if the document is different. In order to tackle this, we applied the normalization function from CCNet Wenzek et al. (2020) and then computed a SHA-1 hash of the normalized text. This helped in reducing the content by 88%. Choosing which ¡paragraph, url¿ combination to keep can be tricky as it can lead to a lot of fragmented documents. So we devised a strategy to choose documents based on sorted ¡url, seq_num¿ order which would help in preventing fragmentation as much as possible.

Language Model Scores We scored every paragraph using a Language Model trained on data collected from OPUS Tiedemann (2012) (monolingual data collected from the availble bitexts) using a 4-gram KenLM Heafield (2011). Note that since the LMs were not trained on data belonging to a specific domain, this feature helped in eliminating general non-fluent sentences.

Heuristic based approaches We use the following techniques to further refine the filtering step (especially useful for Low resource languages having no or poor quality LM)

Ratio of digit+punctuation to total characters (current threshold ¡0.25)

Maximum number of URLs per sentence (current value 1)

Type-token ratio (current threshold ¿0.6 + removing bottom 1% per language)

Minimum number of tokens per sentence (current value 3; not applied for agglutinative languages)

Tagging profane words and removing instances containing such words

Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw” data. The “raw” data is publiclly available in in https://commoncrawl.org/the-data.

Is the software used to preprocess/clean/label the instances available? If so, please provide a link or other access point. The software is proprietary to Meta Platforms and currently unavailable publicly.

F.6 Uses

Has the dataset been used for any tasks already? If so, please provide a description. Yes, this dataset and its precursor CC100 data have been used to train machine translations and multilingual language models, which are foundation to many downstream language tasks.

Is there a repository that links to any or all papers or systems that use the dataset? If so, please provide a link or other access point. No.

What (other) tasks could the dataset be used for? This data can be used to pretrain multilingual language models, which are foundation to many current and future language tasks.

Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms? The pipeline for creating this dataset paves a way for building a scalable infrastructure for mining datasets to be be used for training large-scale models.

Are there tasks for which the dataset should not be used? If so, please provide a description. No.

F.7 Distribution

Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? If so, please provide a description. No.

How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? Does the dataset have a digital object identifier (DOI)? N/A

When will the dataset be distributed? No.

Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? If so, please describe this license and/or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions. No.

Have any third parties imposed IP-based or other restrictions on the data associated with the instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms, as well as any fees associated with these restrictions. No.

Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any supporting documentation. N/A

F.8 Maintenance

Who is supporting/hosting/maintaining the dataset? Meta AI.

How can the owner/curator/manager of the dataset be contacted (e.g., email address)? Refer to the main document.

Is there an erratum? If so, please provide a link or other access point. Currently no.

Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)? If so, please describe how often, by whom, and how updates will be communicated to users (e.g., mailing list, GitHub)? No plan for updating.

If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were individuals in question told that their data would be retained for a fixed period of time and then deleted)? If so, please describe these limits and explain how they will be enforced. N/A

Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to users. N/A

If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If so, please provide a description. Will these contributions be validated/verified? If so, please describe how. If not, why not? Is there a process for communicating/ distributing these contributions to other users? If so, please provide a description. No.