XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation

Sebastian Ruder, Noah Constant, Jan Botha, Aditya Siddhant, Orhan Firat, Jinlan Fu, Pengfei Liu, Junjie Hu, Dan Garrette, Graham Neubig, Melvin Johnson

Introduction

Most research in natural language processing (NLP) to date has focused on developing methods that work well for English and a small set of other high-resource languages (Joshi et al., 2020). In contrast, methods for other languages can be vastly more beneficial as they enable access to language technology for more than three billion speakers of low-resource languages and prevent the NLP community from overfitting to English. Motivated by these benefits, the area of multilingual NLP has attracted increasing interest recently.

However, evaluating multilingual models is challenging as it requires assessing performance on a wide range of typologically distinct languages in the face of limited heterogeneous data sources. Recently large-scale benchmarks such as xtreme (Hu et al., 2020) and xglue Liang et al. (2020) have been introduced, consolidating existing multilingual tasks and covering tens of languages. When xtreme was released, the gap between the best-performing baseline, XLM-R Large (Conneau et al., 2020), and human-level performance was roughly 25. This has since shrunk to less than 12 points, a much smaller but still substantial gap compared to the difference from human-level performance observed in English transfer learning (Wang et al., 2019a), which has recently been closed entirely on some evaluation suites He et al. (2021).

In order to examine the nature of this progress, we first perform an analysis of state-of-the-art multilingual models on xtreme. We observe that progress has not been uniform, but concentrated on cross-lingual retrieval tasks where fine-tuning on other tasks and pre-training with parallel data lead to large gains. On other task categories improvements are more modest. Models still generally perform poorly on languages with limited data and non-Latin scripts. Fine-tuning on additional translated data generally leads to the best performance.

Based on this analysis, we propose xtreme-r (xtreme revisited), a new benchmark with the dual purpose of ensuring that research in multilingual NLP focuses on the most challenging problems and equipping researchers with a broader set of tools to better understand their models (see Table 1 for a brief overview). xtreme-r follows in its predecessor’s footsteps by being massively multilingual, diverse, and accessible. It expands on xtreme by covering 50 typologically diverse languages and 10 challenging, diverse tasks. To make retrieval more difficult, we introduce two new tasks that focus on “language-agnostic” retrieval Roy et al. (2020), where targets must be retrieved from a large multilingual candidate pool. We additionally establish new state-of-the-art mT5 Xue et al. (2021) and translate-train baselines for our tasks.

xtreme-r aims to move away from a single aggregate metric summarizing a model’s performance and towards a more nuanced evaluation and comparison of multilingual models Ethayarajh and Jurafsky (2020); Linzen (2020). To this end, we introduce an extensible multilingual diagnostic and evaluation suite that consists of two main components: a) MultiCheckList, a test suite (Ribeiro et al., 2020) for probing question answering capabilities in 50 languages. This test suite is the first of its kind and enables direct evaluation of fine-grained capabilities in a massively multilingual setting. b) We extend the multi-dataset evaluation framework ExplainaBoard (Fu et al., 2020; Liu et al., 2021) to additional tasks and the multilingual setting. This framework allows us to break down performance based on language and task-specific attributes, which enables a more nuanced diagnosis of a model’s behaviour.

We also make several logistic improvements to improve xtreme-r’s utility as a leaderboard. To make it easier to choose the best model for a use case, each submission is required to provide metadata such as the number of parameters and amount of pre-training data, which we make available via an interactive leaderboard. We also introduce task and language-specific sub-leaderboards to invite submissions of dedicated models.

In sum, we make the following contributions: a) an analysis of progress in cross-lingual modeling; b) an improved benchmark covering 50 languages, including a newly created retrieval task (Mewsli-X); c) a massively multilingual diagnostic suite; d) fine-grained evaluation capabilities; e) experiments and analyses of state-of-the-art models; and f) an interactive metadata-rich leaderboard.

Examining the State of Multilingual Benchmarking

Benchmarking is critical to evaluate general-purpose language understanding technologies. To this end, benchmarks like GLUE (Wang et al., 2019b) and SuperGLUE (Wang et al., 2019a) provide a way to assess the transfer learning capabilities of various models. However, these benchmarks focus only on English. On the other hand, cross-lingual approaches have been evaluated on a wide range of disparate tasks Hu et al. (2020). xtreme was proposed as a platform to unify this fragmented evaluation landscape and to catalyze advances in cross-lingual learning by including a diverse set of tasks and languages. It consists of 9 tasks covering 40 diverse languages, which can be grouped into 4 broad task types (see §3.1 for details): classification (XNLI, PAWS-X), structured prediction (UD-POS, WikiANN-NER), question answering (XQuAD, MLQA, TyDiQA-GoldP), and retrieval (Tatoeba, BUCC). xtreme focuses on zero-shot cross-lingual transfer, i.e. models can be pre-trained on any multilingual data and are fine-tuned only in English. Similarly, xglue (Liang et al., 2020), another cross-lingual benchmark focuses on a smaller number of less typologically diverse languages, and includes generation tasks. Other non-English benchmarks focus on specific linguistic phenomena, e.g. code-switching (Khanuja et al., 2020); languages, e.g. Indonesian (Willie et al., 2020) and Persian (Khashabi et al., 2020); and language families, e.g. Indian languages Kakwani et al. (2020).

2 An Analysis of xtreme

As of April 15, 2021, all submissions to the xtreme leaderboard are large-scale Transformers (Vaswani et al., 2017) trained with masked language modeling (MLM; see Appendix A for further details). We analyze the performance of these models on the xtreme leaderboard in Figure 1.This evaluation compares “models + data” as models employ different types of data during training. We document this information in Appendix H. Overall, multilingual models have improved the average performance on xtreme from 55.8 to 81.4. Much of this improvement is concentrated in the retrieval-based tasks where performance increased from 47.7 (mBERT) to 92.7 (VECO). In contrast, performance on question answering and structured prediction tasks has improved only slightly.

Breaking down performance by language family, on Tatoeba (Figure 1(c)) recent models still struggle with a few low-resource languages. Models perform well for most other languages and their scores are concentrated in a relatively small range. On MLQA (Figure 1(b)), scores have increased slightly but remain well below performance on English. On POS tagging (Figure 1(d)), scores remain largely the same; performance is lower for some languages with non-Latin scripts and low-resource languages. We show the scores for the remaining tasks in Appendix B. The remaining gap to English performance on these tasks is partially an artefact of the evaluation setup: zero-shot cross-lingual transfer from English favors English representations whereas models fine-tuned on in-language monolingual data perform more similarly across languages Clark et al. (2020); Hu et al. (2020).

Overall, representations from token-level MLM pre-training are of limited use for cross-lingual sentence retrieval, as evidenced by the comparatively poor performance of the mBERT and XLM-R models. Fine-tuning on sentence-level tasks Phang et al. (2020); Fang et al. (2021) can mitigate this. The strong performance of recent models such as VECO and ERNIE-M on the retrieval tasks can be attributed to a combination of parallel data and new pre-training objectives that make use of it. Pre-training on parallel data improves performance on retrieval by making the pre-training task more similar to the downstream setting but does not significantly improve performance on other tasks. Fine-tuning on automatically translated task-specific data yields strong gains and is used by most recent models to achieve the best performance Hu et al. (2020); Ouyang et al. (2020); Luo et al. (2020). Nevertheless, key challenges such as how to learn robust cross-lingual syntactic and semantic processing capabilities during pre-training remain.

XTREME-R

In order to encourage the NLP community to tackle challenging research directions in pursuit of better cross-lingual model generalization, we propose xtreme-r (xtreme revisited). xtreme-r shares its predecessor’s core design principles for creating an accessible benchmark to evaluate cross-lingual transfer but makes some key changes.

First, xtreme-r focuses on the tasks that have proven to be hardest for current multilingual models. To this end, it drops xtreme’s PAWS-X and BUCC tasks since recent advances have left less room for further improvement, and they cover only a small number of less diverse languages. They are replaced instead by three new, more challenging tasks: one focusing on causal commonsense reasoning (§3.2.1) and two focusing on harder retrieval scenarios (§3.2.2), as this has been the task category where gains have been easiest to achieve. We retain xtreme’s seven other tasks as each still presents substantial challenges for state-of-the-art cross-lingual models (§3.1). Overall, xtreme-r includes 10 diverse tasks, summarized in Table 2.

We also make changes to the structured prediction tasks, NER and POS. Instead of only providing examples as lists of tokens, xtreme-r always provides the full text of an input sentence, thus ensuring that the entire benchmark now supports research on models that operate directly from the raw input string Clark et al. (2021). Furthermore, xtreme-r adopts a more realistic version of the NER task in which no gold tokenization is provided at all, meaning that systems will either have to use model-predicted tokens or embrace tokenization-free approaches. Finally, xtreme-r provides a multilingual diagnostic and evaluation suite (§3.4).

We retain the XNLI (Conneau et al., 2018), UD-POS (Nivre et al., 2018), WikiANN-NER (Pan et al., 2017), XQuAD (Artetxe et al., 2020a), MLQA (Lewis et al., 2020), TyDiQA-GoldP (Clark et al., 2020), and Tatoeba (Artetxe and Schwenk, 2019) tasks from xtreme (see Appendix C).

2 New Tasks

XCOPA The Cross-lingual Choice of Plausible Alternatives (Ponti et al., 2020) dataset asks models to decide which of two sentences causally follows a premise sentence. The XCOPA authors translated and re-annotated the validation and test sets of the English COPA (Roemmele et al., 2011) dataset into 11 languages, which we use for evaluation. The English COPA training set together with the Social IQa (Sap et al., 2019) training data are used for training. While accuracy on the English COPA recently reached 94.8% (Raffel et al., 2020), the state-of-the-art on XCOPA is only around 70%.

2.2 Retrieval from a Multilingual Pool

Many previous retrieval benchmarks assume that the entire candidate pool is in a single language. For instance, a French query will be used to search over only English candidates. However, practical settings often violate this assumption, e.g. the answer to a question may be available in any number of languages, possibly different from the query language. Models that cannot compare the appropriateness of retrieval results across languages are thus ineffective in such real-world scenarios.

xtreme-r includes two new related cross-lingual retrieval tasks. The first seeks to measure the extent to which cross-lingual representations are “strongly aligned” (Roy et al., 2020), i.e. they place the semantically most related text pairs (e.g. a question and its answer) closest together in representation space, regardless of their language identities. The second analogously frames entity linking as retrieving from a multilingual pool of entity descriptions, given an entity mention in context (Botha et al., 2020). For both, we report performance as mean average precision at 20 (mAP@20).

LAReQA Language Agnostic Retrieval Question Answering (Roy et al., 2020) is a sentence retrieval task. Each query has target answers in multiple languages, and models are expected to rank all correct answers above all incorrect answers, regardless of language. We use the LAReQA XQuAD-R dataset which contains 13,090 questions each of which has 11 target answers (in 11 distinct languages) within the pool of 13,014 candidate answer sentences. Following Roy et al. (2020), we fine-tune models on the SQuAD v1.1 train set. The fine-tuned model is used to rank the 13K candidates for each question.

Mewsli-X Mewsli (Multilingual Entities in News, linked) is an automatically extracted dataset that requires linking a contextual entity mention to its entry in a language-agnostic knowledge base by retrieving the entity’s description from a multilingual candidate pool (Botha et al., 2020). For xtreme-r, we derive Mewsli-X as a new variant of Mewsli-9, still linking against WikiData (Vrandečić and Krötzsch, 2014). Mewsli-X features 15K mentions in 11 languages: given a mention in context, the task is to retrieve the single correct target entity description from a candidate pool ranging over 1M candidates across all 50 languages of xtreme-r. Fine-tuning is done on a predefined set of English-only mention-entity pairs randomly sampled from English Wikipedia hyperlinks (see Appendix E for further details).

For our baseline systems on both tasks, we follow previous work Roy et al. (2020); Botha et al. (2020) and train a dual encoder initialized from the pre-trained model weights, optimizing for an in-batch sampled softmax loss (Gillick et al., 2018).

3 Languages

xtreme-r adds the following ten languages to xtreme: Haitian Creole, Cusco Quechuan, Wolof, Lithuanian, Punjabi, Gujarati, Polish, Ukrainian, Azerbaijani, and Romanian. In total, xtreme-r covers the following 50 languages (shown using their ISO 639-1 codes for brevity; new languages are bolded) belonging to 14 language families and two isolates: af, ar, az, bg, bn, de, el, en, es, et, eu, fa, fi, fr, gu, he, hi, ht, hu, id, it, ja, jv, ka, kk, ko, lt, ml, mr, ms, my, nl, pa, pl, pt, qu, ro, ru, sw, ta, te, th, tl, tr, uk, ur, vi, wo, yo, zh.The new languages are covered in both the new tasks as well as in UD-POS, WikiANN-NER, and Tatoeba. xtreme-r is similarly typologically and genealogically diverse as xtreme while covering a larger number of languages (see Appendix D).

4 Diagnostic and evaluation suite

To increase the language coverage of low-resource languages in xtreme-r and to enable us to systematically evaluate a model’s cross-lingual generalization ability, we augment xtreme-r with a massively multilingual diagnostic and evaluation suite. Challenge sets and diagnostic suites in NLP (Wang et al., 2019a, b; Belinkov and Glass, 2019) are mostly limited to English, with a few exceptions (Gulordava et al., 2018). As challenge sets are generally created with a human in the loop, the main challenge for creating a large multilingual diagnostic suite is to scale the annotation or translation effort to many languages and to deal with each language’s idiosyncrasies.

MultiCheckList To address this, we build on the CheckList (Ribeiro et al., 2020) framework, which facilitates creating parameterized tests for models. CheckList enables the creation of test cases using templates, which test for specific behavioral capabilities of a model with regard to a downstream task. Importantly, by relying on template-based tests, we can efficiently generate a large number of diverse multilingual test cases by creating a relatively small number of templates in 50 languages.In contrast, translating an existing test suite or dataset or annotating individual examples for 50 languages would be prohibitively expensive We focus on translating English tests, which consist of templates and their fill-in values.Templates could alternatively be created by native speakers in each language. To study the feasibility of creating multilingual test cases at scale, we translate the minimum functionality tests (MFT) of CheckList, which probe for general vocabulary and taxonomic knowledge in question answering. We instruct translators to create separate variants of a template to disambiguate linguistic phenomena, such as gender of fill-in values, question type, semantics of properties, etc. We automatically fill names in each language based on data from Wikidata and programmatically consolidate different templates in each language. We show examples of templates and the tests that they generate in different languages in Table 3.

We highlight statistics of the dataset and translation process, instructions to translators, and general challenges of template translation in Appendix F. We believe that parameterized tests are a powerful tool to obtain diverse diagnostics data for otherwise resource-starved languages. We view participatory research \forall et al. (2020) with native speakers to create template-based test cases testing for language-specific behaviour as particularly promising.

Multilingual ExplainaBoard The standard practice in leaderboards is to average performance across different settings Wang et al. (2019b, a). While this provides discriminative power, it has limited utility for examining the relative advantages of systems, the characteristics of different datasets and languages, and how these factors relate to each other. To provide more granular evaluation capabilities, we extend Fu et al. (2020); Liu et al. (2021)’s ExplainaBoard to the task categories and languages in xtreme-r. ExplainaBoard provides a more nuanced impression of a model’s performance on a task by defining task-specific attributes (e.g. entity length for NER). The test set is partitioned into different buckets based on the defined attributes and performance is broken down over different attribute values. We define new task-specific attributes for the four task types as well as task-independent attributes (see Appendix K).

Metadata We additionally would like to enable practitioners to rank submissions based on other information. To this end, we ask each submission to xtreme-r for relevant metadata such as the number of parameters, the amount of pre-training data, etc. We will show this information in an interactive leaderboard (see Appendix H for the metadata of current xtreme submissions).

Experiments

xtreme-r focuses on zero-shot cross-lingual transfer from English. While recent work Hu et al. (2020); Lauscher et al. (2020); Hedderich et al. (2020) demonstrates the benefits of fine-tuning on in-language data, we believe the zero-shot scenario remains the most effective way to evaluate the amount of a priori multilingual knowledge a pre-trained model captures. Due to variation in cross-lingual evaluation Keung et al. (2020), we recommend researchers to use the validation set of a single target language for development Artetxe et al. (2020b).

1 Baselines

We employ established pre-trained multilingual and models using translations as baselines.

mBERT Multilingual BERT Devlin et al. (2019) has been pretrained on the Wikipedias of 104 languages using MLM.

XLM-R XLM-R Large Conneau et al. (2020) uses the same MLM objective with a larger model, and was trained on a magnitude more web data from 100 languages.

mT5 Multilingual T5 Xue et al. (2021) is an encoder-decoder transformer that frames NLP tasks in a “text-to-text” format. It was pre-trained with MLM on a large multilingual web corpus covering 101 languages. We employ the largest mT5-XXL variant with 13B parameters.

Translate-train To evaluate the impact of MT, we fine-tune mBERT on translations of English training data from Hu et al. (2020). We create new translations for the XCOPA and SIQa data using an in-house MT system.We are unable to produce translations for Quechua.

Translate-train multilingual In addition, we fine-tune both mBERT and mT5 on the combined translated training data of all languages (including the original English data) jointly.

Human performance We use the human performance estimates from xtreme for the retained tasks. For XCOPA we average the proportion of annotated labels disagreeing with the majority label across all languages Ponti et al. (2020). We are not able to obtain human performance estimates for the new retrieval tasks as identifying a translation among a large number of candidates is too time-consuming for a human to perform.

2 Results

We show the main results in Table 4. As in prior work, XLM-R Large generally outperforms mBERT. Fine-tuning helps significantly on Tatoeba compared to the zero-shot setting Hu et al. (2020). The new tasks are challenging for current models, which show relatively lower performance compared to other tasks. XCOPA presents a challenging classification task that requires cross-lingual common sense reasoning while the language-agnostic nature of Mewsli-X and LAReQA puts the cross-lingual alignment of multilingual representations to the test. Analysis of the language-agnostic retrieval results show a large gap remains between cross-lingual and same-language test cases. XLM-R Large improves significantly over mBERT on the cross-lingual case in exchange for a slight drop for the same-language case. This points to XLM-R Large inducing more “strongly-aligned” representations (see Appendix I for details). The state-of-the-art mT5 improves performance on classification and QA tasks but performs less well on structured prediction and retrieval, highlighting settings where advances beyond scale are needed.Due to compute limitations, we extract mT5 embeddings by averaging the encoder outputs of a frozen mT5 model fine-tuned on SQuAD v1.1, as opposed to fine-tuning a dual encoder. For this reason, the mT5 language-agnostic retrieval scores are not directly comparable to those of mBERT and XLM-R. Training on task-specific translations is beneficial in all cases and generally performs best, although improvements on QA tasks are diminishing. To obtain a more fine-grained understanding of the performance of current models, we conduct several analyses using our multilingual diagnostic suite.

Analyses

We show the results of XLM-R fine-tuned on English SQuAD v1.1 on the 6 tests of MultiCheckList in Table 5 (see Appendix F for the full results, example failure cases, and mBERT results). While mBERT’s average error rate is greater than 85% on 4/6 test categories, XLM-R demonstrates a substantially more robust cross-lingual understanding ability. XLM-R performs worst on tests in low-resource languages with limited or no pre-training data such as gu, ha, ht, qu, sw, wo, and yo and in languages with non-Latin scripts such as he, ja, th, and zh. In addition, XLM-R displays interesting variation across languages, for instance failing in modeling comparisons in some languages, like Basque (eu), where it otherwise succeeds. We release the tests and test outputs to encourage deeper analysis and extension to other tasks and languages.

2 Nuanced Multilingual Evaluation

We showcase how nuanced multilingual evaluation enables us to perform single and pairwise system diagnosis on XQuAD in Table 6 (see Appendix K for analyses of the other tasks). We choose two systems: ERNIE-M, one of the top systems on xtreme, and XLM-R in eight languages: English, Chinese, Hindi, Greek, Russian, Turkish, Arabic, and Vietnamese (en, zh, hi, el, ru, tr, ar, vi).

Attributes We denote (Xc,Xq,Xa)(\mathbf{X}_{c},\mathbf{X}_{q},\mathbf{X}_{a}) as a tuple of a context, question and answer, and refer to cLen, qLen, aLen as their lengths (i.e., the number of tokens). We use BLEU Papineni et al. (2002) to measure lexical overlap between (Xa,Xq)(\mathbf{X}_{a},\mathbf{X}_{q}) and (Xq,Xc)(\mathbf{X}_{q},\mathbf{X}_{c}) as BLEU-AQ and BLEU-QC. We report the top 5 most frequent question types (qType), which cover 85% of questions in the training set.

Single System Analysis For almost all languages, ERNIE-M achieves the highest performance on shorter answers (XS), but the worst performance on longer answers (XL). Especially in el, the performance difference between long and short answers is more than 4040 absolute points. The influence of question and context length is language-dependent. For example, in zh the system favors long questions and contexts while in hi, it is the opposite. If the answer is lexically similar to the question (larger BLEU-AQ), the system tends to make more mistakes in all eight languages. However, a higher lexical overlap between questions and contexts (BLEU-QC) is helpful for some languages: el, ru, ar. Surprisingly, ERNIE-M struggles to answer relatively frequent question types (i.e., what, and how), while it performs better on less frequent questions, indicating that although questions about person, place and choice are less frequent, they are easier than abstract questions.

Pairwise System Analysis Although ERNIE-M outperforms XLM-R by a large margin, it is surpassed by XLM-R on a few buckets. In en, XLM-R is better at dealing with longer answers and questions. In tr, XLM-R surpasses ERNIE-M on samples with shorter answers and contexts. In zh, XLM-R performs better when dealing with questions that are lexically similar to the answers.

Conclusions

Our analyses and experiments have shed light on important directions where scale alone is not sufficient such as “strong” alignment, syntactic transfer, fine-grained natural language understanding, and answering of abstract questions. We encourage the development of better inductive biases, pre-training objectives, and evaluation resources. We make our data, translations, evaluation resources, and interactive leaderboard supporting detailed comparative analyses available to help the community gain a better understanding of multilingual models.

Ethical Considerations

xtreme-r seeks to improve language representation and language diversity in NLP research, which has been identified as a large challenge Joshi et al. (2020). We tried to cover a set of languages that is as diverse as possible, while still providing access to evaluation data in multiple tasks for each language. Despite this, xtreme-r has little representation of languages of the Americas and Africa due to a lack of labeled datasets for these languages. In addition, some languages included in xtreme-r with few data available online are only covered in a small number of datasets (see Table 7). To ameliorate this, we release training data of tasks translated into other languages, as well as the new MultiCheckList. We reiterate the on-going need for creating labeled datasets for diverse tasks in under-represented languages, to facilitate the development and evaluation of NLP models for such languages. We emphasize the importance of participatory research \forall et al. (2020) as a modus operandi for such work in order to involve marginalized communities in the research process.

2 Leaderboard chasing

New benchmarks incentivize researchers to hill-climb on aggregate metrics Ethayarajh and Jurafsky (2020). In addition, new benchmarks create new opportunities for models to reach “superhuman” performance, which may lead people outside the field to erroneously conclude that some model has “solved language”. We hope that our inclusion of ExplainaBoard and MultiCheckList help to prevent such a fallacy, by enabling more fine-grained evaluation that goes beyond a single aggregate metric.

3 Biases in multilingual models

Multilingual models have been shown to reflect biases similar to their monolingual counterparts Zhao et al. (2020). In addition, multilingual models are biased towards languages with more pre-training data Hu et al. (2020). Zero-shot cross-lingual transfer additionally introduces a bias towards the source language Søgaard et al. (2018); Anastasopoulos and Neubig (2020). Due to the paucity of training data in other languages, we nevertheless focus on English-centric transfer and encourage future dataset creation efforts to include training data in multiple languages.

4 Environmental concerns

xtreme-r aims to enable efficient evaluation of multilingual models. To this end, we created a new dataset, Mewsli-X, that captures the essence of multilingual entity linking against a diverse knowledge base but is computationally cheaper to evaluate than the large-scale Mewsli-9 Botha et al. (2020). Nevertheless, the models that perform best on benchmarks like xtreme-r are generally large-scale Transformer models pre-trained on large amounts of data, which comes at a high cost Strubell et al. (2019). We thus particularly encourage the development of efficient methods to adapt existing models to new languages Pfeiffer et al. (2020) rather than training multilingual models entirely from scratch.

Acknowledgements

We thank Marco Tulio Ribeiro for advice on CheckList. We are grateful to Laura Rimell and Jon Clark for valuable feedback on drafts of this paper, and to Dan Gillick for feedback on the Mewsli-X dataset design. We thank Hila Gonen, Bidisha Samantha, and Partha Talukdar for advice on Arabic, Bengali, and Hebrew CheckList examples.

References

Appendix

Appendix A Details of xtreme models

All submissions to the xtreme leaderboard are large-scale Transformers (Vaswani et al., 2017) trained with masked language modeling (MLM), many of which extend monolingual models. Multilingual BERT (mBERT), XLM-RoBERTa (XLM-R; Conneau et al., 2020) and multilingual T5 (mT5; Xue et al., 2021) extend BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), and T5 (Raffel et al., 2020) respectively. Rebalanced mBERT (RemBERT; Chung et al., 2021) is a more efficient, scaled-up reparameterization of mBERT. These models have been pre-trained on unlabeled data in around 100 languages from Wikipedia (mBERT) and CommonCrawl. XLM-R was the strongest baseline in xtreme Hu et al. (2020) and is the foundation for some subsequent work. It has been fine-tuned on English data of a related task prior to task-specific fine-tuning (STILTs; Phang et al., 2020). The following models furthermore propose new methods to leverage parallel data during pre-training or fine-tuning. FILTER (Fang et al., 2021), based on XLM-R, fuses representations in different languages. VECO Luo et al. (2020) is a 24-layer encoder-decoder model that uses additional MLM variants during pre-training. T-URLv2 and HiCTL Wei et al. (2021), based on InfoXLM Chi et al. (2021) and XLM-R respectively, employ contrastive losses. ERNIE-M Ouyang et al. (2020) incorporates back-translation into language modeling.We are not aware of the technical details of Polyglot and the anonymous submission.

Appendix B Task scores on xtreme

We show the performance of the models on the xtreme leaderboard broken down by language family on the remaining xtreme tasks in Figure 2.

Appendix C xtreme tasks retained in xtreme-r

XNLI The Cross-lingual Natural Language Inference corpus (Conneau et al., 2018) requires a model to determine whether a premise sentence entails, contradicts, or is neutral with respect to a hypothesis sentence. We use the crowd-sourced English data that was professionally translated to 14 other languages for evaluation and the MultiNLI (Williams et al., 2018) train set for training.

UD-POS We employ the part-of-speech tagging data from the Universal Dependencies v2.7 treebanks (Nivre et al., 2018) covering 104 languages. We use the English training data for training and evaluate on the test sets of the target languages.

WikiANN-NER For named entity recognition, we use the WikiANN dataset (Pan et al., 2017), in which proper noun spans in Wikipedia text have been automatically annotated as either location, person, or organization. We use the balanced train, dev, and test splits from Rahimi et al. (2019).

XQuAD The Cross-lingual Question Answering Dataset (Artetxe et al., 2020a) requires identifying the answer to a question as a span in the corresponding paragraph. A subset of the English SQuAD v1.1 (Rajpurkar et al., 2016) dev set was professionally translated into ten other languages for XQuAD.

MLQA Similarly to XQuAD, the Multilingual Question Answering dataset (Lewis et al., 2020) is another cross-lingual question answering task. The evaluation data in seven languages was automatically mined from Wikipedia, annotations were crowd-sourced, and answer spans aligned. For both XQuAD and MLQA, we use their respective data for evaluation and train on SQuAD v1.1.

TyDiQA-GoldP We use the gold passage (GoldP) version of TyDiQA (Clark et al., 2020), a benchmark for information-seeking question answering, which covers nine typologically diverse languages. The GoldP version is a simplification of the primary task, using only the gold passage as context and excluding unanswerable questions. We use the English training data for training and evaluate on the test sets of the target languages.

Tatoeba We evaluate on the Tatoeba dataset (Artetxe and Schwenk, 2019), which consists of up to 1,000 English-aligned sentence pairs covering 122 languages. We find the nearest neighbor using cosine similarity. To make the setting more realistic, we move away from zero-shot retrieval and fine-tune models on SQuAD v1.1.

Appendix D Languages

Language characteristics We show a detailed overview of languages in xtreme-r including interesting typological differences in Table 7. Wikipedia information is taken from Wikipediahttps://meta.wikimedia.org/wiki/List_of_Wikipedias and linguistic information from WALS Onlinehttps://wals.info/languoid. xtreme-r includes members of the Afro-Asiatic, Austro-Asiatic, Austronesian, Dravidian, Indo-European, Japonic, Kartvelian, Kra-Dai, Niger-Congo, Sino-Tibetan, Turkic, Uralic, CreoleFor simplicity, we treat Creole as a distinct language family Bakker et al. (2011)., and Quechuan language families as well as of two isolates, Basque and Korean.

Language diversity indices We measure the language diversity of xtreme-r according to the typology and language family indices of Ponti et al. (2020), which we show in Table 8 for xtreme-r, xtreme (Hu et al., 2020), and xglue (Liang et al., 2020). The typology index is based on the mean entropy of the distribution over 103 typological features from URIEL (Littell et al., 2017) across the languages while the family index consists of the number of distinct language families divided by the total number of languages. xtreme-r is similarly diverse while covering a larger number of languages.

Appendix E Mewsli-X Dataset

Mewsli-X is constructed specifically for xtreme-r and is a more carefully sampled variant of the Mewsli-9 dataset Botha et al. (2020), derived from WikiNews in the same way. Compared to Mewsli-9, Serbian is dropped and Polish, Romanian and Ukrainian are added to obtain 11 languages, while the entity descriptions to be retrieved range over all 50 languages in xtreme-r (Table 9). To broaden accessibility, the mention queries, candidates and Wikipedia-based training instances are all downsampled compared to the previous work.

The resolved, viable mentions were filtered to drop duplicate surface mentions of an entity in the same article, and mentions of years (e.g. 2014), which are commonly linked in WikiNews but not of great interest. We then performed stratified sampling by both mention language and entity frequency bins, seeking uniform sizes across strata. Entity frequency is estimated as the number of times an entity is referenced on pages in the 50-language Wikipedia collection, and then binned into five intervals: [0,1),[1,10),[10,100),[100,1000),[1000,)[0,1),[1,10),[10,100),[100,1000),[1000,\infty).

Appendix F MultiCheckList

General statistics Creating MultiCheckList involved translating around 550 words (templates and fill-in values) into 49 languages. Some languages required a larger number of words due to the creation of additional templates (Russian required translating 1043 words). In total, the translation effort cost $4,360. For each test category and each language, we automatically generate 200 test cases. Depending on the number of possible variations for each test, each test case can consist of 2 (Comparisons) to 12 (Intensifiers) examples.Note that the number of variations does not correlate with the test success rate, e.g. the “Job vs nationality” test has 8 variations for each test case and the highest success rate overall.

Instructions to translators We sent the following guidelines to annotators for the translation:

We would like you to translate the following templates and their corresponding fill-in values into other languages. Each template contains some text enclosed in curly brackets { }. These are the names of the fields that will be substituted in the template. We ask you not to translate the text within such curly brackets.

Have a look at the text in the other fields to get a better sense what can be substituted in the template. For instance, by referring to the lines with “adj”, we can see that “{first_name} is {adj} than {first_name1}” is a comparison between two people.

If there is not a literal translation or the same translation has already been used for another word, feel free to use a translation that is similar in meaning.

If the translation of the template differs based on the substituted words, please create multiple translations for the template and indicate which substituted words correspond to it. For instance, if one translation of the template assumes that some substituted words have male gender but others have female gender, create a separate translation of the template that conforms to the female gender.

In one template, {p.p1} and {p.p2} refer to a property (either “shape”, “color”, “size”, “age”, or “material”). {p.v1} and {p.v2} refer to an attribute of each property such as “old”, “new”, “red”, “blue”, etc.

Challenges of template translation We highlight some of the challenges and linguistic phenomena we encountered during the process of creating MultiCheckList in the following. Unless specified otherwise, we create separate templates to disambiguate each phenomenon.

Gender agreement: Adjectives and nationalities need to be declined to match the gender of their referring expression. To keep the translation effort manageable and avoid creating separate templates for each gender, we control for gender and restrict fill-in values to male names for the affected tests (3/6). We sample genders equally for the other tests. We welcome dedicated test suites analyzing multilingual gender bias as future extensions.

Declination: In Russian, animal and vehicle names require Accusative and Nominative in different cases.

Normalization: For appropriate substitution, fill-in values often need to include articles. We normalize answers and predictions by removing language-specific articles in order to ensure a consistent comparison.

Names: Our use of names based on data in Wikidata leads to certain biases. Names that are more common in Wikidata are more likely to be chosen. In some cases, names in Wikidata are not written in the native script. Japanese names from Wikidata are often written in hiragana or katakana rather than kanji. Our choice of using the first name is also not applicable to all languages. In Japanese, people are usually not referred to by their first name, e.g. Masanori Suzuki would be called Suzuki-san instead of Masanori.

Declension of names: In some languages, a suffix is appended to a name depending on its spelling. For instance, in Turkish the suffix changes based on the vowel of the last syllable, e.g. Ahmet’in, Ali’nin, Umut’un, Şeyma’nın, Özge’nin, etc., and Ahmet’ten, Ali’den, Umut’tan, Şeyma’dan, Özge’den, etc. In Finnish, names are appended with a variation of “lla”, e.g. Peterillä, Lisalla, Mattilla, etc.

Professions: Words for certain professions are gendered (e.g. waiter/waitress), so they only occur with male or female names.

Question syntax: In some languages, the syntax of the question changes depending on the property or adjective one asks about.

Syntax of adjectives: In some languages, the syntax changes depending on what adjective is used. In German, the translations of “happy”, “excited”, and “passionate” require different prepositions.

Stilted language: Some text appears stilted when values are filled into the translated templates. For instance, the question “どちらの方が冷静でないですか。” is an unusual way to do negation in Japanese; if directly translated to English, it would mean “Who is more not calm?”.

We tried to address most of these challenges by instructing translators to create additional templates to disambiguate linguistic phenomena and by consolidating different templates programmatically. However, as this process was relatively labor-intensive, we recommend the usage of morphologically aware templates similar to Jiang et al. (2020) for future work. Note that morphologically aware templates may not be able to resolve some of the finer linguistic differences. For this reason, we also advise working closely with native speakers to design tests that reflect natural language as closely as possible.

Full results We show the full results of XLM-R and mBERT on the MultiCheckList tests in Tables 10 and 11 respectively. mBERT only shows limited amounts of cross-lingual taxonomic knowledge. While it is able to distinguish between job and nationality and animals and vehicles in some languages, it fails to do this consistently across all languages. In addition, it completely fails to distinguish between different properties and intensifiers and is not able to perform comparisons. In contrast, while XLM-R struggles with intensifiers, it demonstrates the other capabilities much more consistently across languages.

We provide example failure cases of XLM-R on a subset of languages in Table 12. We will publicly release a comprehensive list of failure cases for XLM-R and mBERT, the complete tests and model outputs for further analysis.

Appendix G Hyper-parameters

mBERT We use the cased version, which covers 104 languages, has 12 layers, 768 hidden units per layer, 12 attention heads, a 110k shared WordPiece vocabulary, and 110M parameters.https://github.com/google-research/bert/blob/master/multilingual.md The model was trained using Wikipedia data in all 104 languages, oversampling low-resource languages with an exponential smoothing factor of 0.7. We generally fine-tune mBERT for two epochs, with a training batch size of 32 and a learning rate of 2e-5. We build on the Transformers library Wolf et al. (2019) for training on each task.

XLM-R We use the XLM-R Large version that covers 100 languages, uses a 200k shared BPE vocabulary, and has been trained with masked language modelling.https://github.com/facebookresearch/XLM We fine-tune XLM-R generally for two epochs with a learning rate of 3e-5 and an effective batch size of 16. We use the Transformers library for training XLM-R on all tasks.

mT5 We use the publicly released mT5-XXL version that has nearly 13 billion parameters with a vocabulary size 250k Xue et al. (2021). It has been trained on multilingual C4 (mC4) corpus which has 6.3 trillion tokens spanning 101 languageshttps://www.tensorflow.org/datasets/catalog/c4#c4multilingual. For all downstream tasks, we fine-tune mT5-XXL for 10k steps with a constant learning rate of 0.001, dropout rate of 0.1 and a batch size of 217 tokens. For early stopping, we save checkpoints every 200 steps and choose the checkpoint with the highest performance on the validation set.

Appendix H Metadata

We intend to ask each submission to xtreme-r for relevant metadata. Such metadata includes the number of parameters, amount of pre-training data, amount of fine-tuning data, etc. We are doing this to enhance transparency and to increase utility of our benchmark for practitioners with varying needs. As a first step in this direction, we provide information about the number of parameters and the amount of monolingual and parallel pre-training data used by all submissions to xtreme in Table 13. Note that the different systems report their training data in different ways (e.g. number of tokens, number of examples, size of the data). We plan to standardize this by asking submissions to xtreme-r to report training data in terms of number of tokens seen.

Appendix I Language-agnostic Retrieval Results

The multiway cross-language nature of Mewsli-X and LAReQA enables closer analysis of model performance by input and target language pairs. Mewsli-X can directly be split by language pair as it has a single correct target per input mention. For LAReQA, we follow the “Limit to One Target” strategy of Roy et al. (2020): instead of asking the model to retrieve all correct answers in one pass, we evaluate on each target separately, with all the other correct answers removed from the candidate pool, allowing us to report splits by language pair.

Table 14 summarizes these pairwise mAP@20 scores (here, micro-averaged), showing that XLM-R Large improves substantially over mBERT on the cross-lingual case (+38% on Mewsli-X and +137% for LAReQA) in exchange for a slight drop for the same-language case. Even so, performance on the cross-lingual case is still low at 29–36 mAP@20, and remains a challenging area for future work. Figures 3 and 4 show the detailed breakdowns.

Appendix J Detailed results

We show the detailed results for each task and language in Tables 15 (XNLI), 16 (XCOPA), 17 (UD-POS), 18 (WikiANN-NER), 19 (XQuAD), 20 (MLQA), 21 (TyDiQA-GoldP), 22 (Tatoeba), 23 (Mewsli-X), and 24 (LAReQA).

Appendix K Nuanced Multilingual Evaluation

We perform nuanced multilingual evaluations by categorizing testing examples into different attribute buckets and measuring the system performance on each attribute bucket. In the following, we describe the available attributes for tasks in xtreme-r and provide additional analysis on different attributes.

QA We denote (Xc,Xq,Xa)(\mathbf{X}_{c},\mathbf{X}_{q},\mathbf{X}_{a}) as a tuple of the corresponding context, question and answer, and refer to cLen, qLen, aLen as their lengths (i.e., the number of tokens). We use BLEU Papineni et al. (2002) to measure the lexical overlap between (Xa,Xq)(\mathbf{X}_{a},\mathbf{X}_{q}) and (Xq,Xc)(\mathbf{X}_{q},\mathbf{X}_{c}) as BLEU-AQ and BLEU-QC. We classify questions based on their first tokens and report the top 5 most frequent question types as qType (i.e., what, how, when, where, which), which cover 85% of questions in the training set. We list the six attributes as follows.

ϕBLEUaq(Xa,Xq)=BLEU(Xa,Xq)\phi_{\texttt{BLEU}_{aq}}(\mathbf{X}_{a},\mathbf{X}_{q})=\texttt{BLEU}(\mathbf{X}_{a},\mathbf{X}_{q}),

ϕBLEUqc(Xq,Xc)=BLEU(Xq,Xc)\phi_{\texttt{BLEU}_{qc}}(\mathbf{X}_{q},\mathbf{X}_{c})=\texttt{BLEU}(\mathbf{X}_{q},\mathbf{X}_{c}),

Structured Prediction Given a sentence X\mathbf{X}, we define the ii-th word token as xix_{i} and a span of words in the range of [i,j)[i,j) as Xi:j\mathbf{X}_{i:j} in the sentence. We then define five attributes including the label of a span (tag), the token length of a sentence (sLen), the token length of an entity span (eLen), the character length of an entity span (tLen) and the relative token position of an entity (rPos) in the sentence as follows.

ϕtLen(Xi:j)=Xi:j\phi_{\texttt{tLen}}(\mathbf{X}_{i:j})=|\mathbf{X}_{i:j}|

ϕrPos(Xi:j)=i/ϕsLen(X)\phi_{\texttt{rPos}}(\mathbf{X}_{i:j})=i/\phi_{\texttt{sLen}(\mathbf{X})}, relative position

where x|x| represents the number of characters.

K.2 Attribute Buckets

We bucket all test examples into different attribute buckets for a given attribute. Specifically, for an attribute defined for a task, we measure the attribute value of the test examples (see Section K.1), then determine NN attribute buckets for all test examples (N=4N=4 by default), and finally we measure the system performance on the test examples falling in each attribute interval to observe the performance change over different attribute buckets. Since the attribute values can be either continuous (e.g., answer length aLen) or discrete (e.g., question type qType), we perform different strategies for creating attribute buckets for them.

Continuous Attribute Values For the attributes with continuous attribute values (e.g., ϕaLen\phi_{\texttt{aLen}}, ϕqLen\phi_{\texttt{qLen}}, ϕcLen\phi_{\texttt{cLen}}, ϕBLEUaq\phi_{\texttt{BLEU}_{aq}}, ϕBLEUqc\phi_{\texttt{BLEU}_{qc}}), we divide the test examples into different intervals where the numbers of the test samples in all attribute intervals are equal.

Discrete Attribute Values For the attribute with discrete attribute values (e.g., ϕtype\phi_{\texttt{type}}), test samples with the same type are put into the same attribute bucket.

Table 25 and 26 show the detailed attribute intervals for each category on the XQuAD task and WikiANN-NER task, respectively.

K.3 Additional Nuanced Analysis

Table 27 and 28 illustrate the single system diagnosis of ERNIE-M and XLM-R respectively on the WikiANN-NER task in three languages (i.e., en, es, fr). We make the following observations.

ERNIE-M In Table 27, first, we observe that the effects of some attributes for ERNIE-M are language-independent. For example, based on the attribute rPos, the system is good at predicting entities located within the first 1/3 part of the English sentences, while it is relatively bad at predicting entities within the first 1/3 part of the sentences for other languages. Second, the system favors long sentences based on the attribute sLen. We even observe that performance increases as the sentence length increases on es and fr. Third, across all languages, the system performs relatively bad at predicting long entities (eLen) and entities belonging to the organization class (tag). Finally, the system is good at predicting sentences with fewer entities based on the attribute for entity density (eDen).

XLM-R In Table 28, we observe that the influence of some attributes such as sLen,eLen, eDen with respect to the system performance are similar between ERNIE-M and XLM-R, although ERNIE-M performs significantly better than XLM-R at generalizing its predictions on es, fr.

K.3.2 QA

Table 29 shows the pairwise system analysis of ERNIE-M and T-URLv2 for the XQuAD task. We find that although the overall performance of T-URLv2 outperforms ERNIE-M, it is surpassed by ERNIE-M on a few buckets. For example, in zh, ERNIE-M is better at dealing with samples that have long answers, long questions, and a high lexical overlap between questions and answers. In ru, ERNIE-M is better at dealing with samples with long answers, long questions, and lower lexical overlap between questions and answers, questions and contexts.

K.4 ExplainaBoard Demonstration

Figure 5 shows the interface of ExplainaBoard containing possible selection options to observe the fine-grained analysis for submitted systems on xtreme. We also demonstrate how to perform Single System and Pair Systems analysis on Figure 6 and 7 respectively.

Specifically, to generate a fine-grained overview, we first select models in the table, then click one of the three Analysis Buttons, which generates a fine-grained analysis such as in Figure 6 (single system analysis) and Figure 7 (pair-wise system analysis).