MEGA: Multilingual Evaluation of Generative AI

Kabir Ahuja, Harshita Diddee, Rishav Hada, Millicent Ochieng, Krithika Ramesh, Prachi Jain, Akshay Nambi, Tanuja Ganu, Sameer Segal, Maxamed Axmed, Kalika Bali, Sunayana Sitaram

cs.CL

Introduction

Large Large Models (LLMs) such as ChatGPT and GPT-4 have created a lot of interest in the AI community and beyond, due to the step jump in their capabilities, such as maintaining context over conversations, fluency of generation, and reasoning. Many users have reported having tested these systems on languages other than English, with varying results, and recent demos of these models Warren (2023) have been shown in multiple (albeit high-resource) languages. Recently, the GPT-4 model OpenAI (2023) was evaluated on the MMLU multiple choice questions benchmark by automatically translating it into 26 languages, and the results for some low-resource languages in the Latin script were found to be quite promising.

The multilingual capabilities of these models can be traced to their pre-training data, where even the predominantly English large-scale corpora contain hundreds of millions of non-Engish tokens Blevins and Zettlemoyer (2022). For GPT-3 unlabeled pre-training data has been documented to contain 119 languages Brown et al. (2020), where roughly 93% of the tokens are in Englishhttps://github.com/openai/gpt-3/blob/master/dataset_statistics/languages_by_word_count.csv. Other LLMs like BLOOM Scao et al. (2022) and PaLM Chowdhery et al. (2022) have a better multilingual representation with 60% and 18% non-English data respectively for pre-training. While these models have been trained on multiple languages with varying distributions in the pre-training data, it is not clear how well they perform relative to each other across diverse tasks and languages due to a lack of comprehensive analysis across all models with the same experimental setup.

Recently, there has been a lot of interest in evaluating the different capabilities of LLMs, with comprehensive studies like HELM Liang et al. (2022) that evaluate these models on a wide variety of capabilities. However, such studies are largely performed on English language data and there is a lack of such large-scale evaluation of LLMs for their multilingual capabilities. Given the current pace at which new language technologies are being developed that use LLMs, the importance of such an evaluation cannot be understated as the cases of inequalities in the performance of previous-generation models across languages have been well-documented Blasi et al. (2022).

In our work, we present the first large-scale Multilingual Evaluation of Generative AI models (MEGA), spanning 16 different datasets, $70$ topologically diverse languages, and four LLMs i.e. GPT-3.5 models text-davinci-003 and gpt-3.5-turbo, GPT-4 (gpt-4-32k) and BLOOMZ Muennighoff et al. (2022). We also compare these models with the models fine-tuned on these datasets like TULRv6 Patra et al. (2022) and MuRIL Khanuja et al. (2021), which are SoTA on different multilingual benchmarks.

Through our evaluation, we aim to answer three research questions. (1), how well do LLMs fare on multilingual benchmarks compared to fine-tuned SOTA models? (2), what languages do these models perform well in, and can we explain the trends in performance for these models across languages? (3), what prompting strategies should be used for using LLMs for non-English languages?

Our study highlights that there is a significant disparity between the performance of LLMs in English vs non-English languages, especially low-resource languages with non-Latin scripts for which fine-tuned models perform significantly better. While GPT-4 bridges this gap to some extent, the discrepancy still exists. Further, we find that for these languages it is often difficult to do better than simply machine translating the input in a target language to English and then sending it to the LLM for prediction (translate-test). We also discuss how different prompt-design choices like prompt-tuning, use of explanations, and number of few-shot examples impact multilingual performance. Finally, we perform some initial analysis to the test the possibility of test data contamination in LLMs that we evaluate and discuss its implications on our findings. Our work provides a blueprint for strategies that can be used for building systems using generative AI for multilingual users. We also release our code https://aka.ms/MEGA for the community to scale up the multilingual evaluation of generative models.

MEGA

In this section, we discuss different components of our benchmarking exercise to measure the multilingual capabilities of LLMs. We start by discussing different NLP tasks and datasets that we evaluate these models on, along with their linguistic diversity. We provide an overview of the models we evaluate, baselines for comparison, and describe our evaluation scheme and prompting strategies.

We broadly consider five families of NLP tasks in our experiments covering 16 different datasets:

Classification Tasks. Here, we further have four different sub-tasks, i) Natural Language Inference (classify if a hypothesis is entailed in the premise, contradicts it or neither), which includes XNLI Conneau et al. (2018) , Indic-XNLI Aggarwal et al. (2022) (version of XNLI translated to 11 Indian languages), and GLUECos NLIKhanuja et al. (2020b) for English-Hindi code-mixed data; ii) Commonsense Reasoning datasets including causal commonsense reasoning benchmark XCOPA Ponti et al. (2020) and XStoryCloze Lin et al. (2022a), where the correct ending of a story with four sentences is to be predicted; iii) Paraphrase Identification task PAWS-X Yang et al. (2019a), where given two sentences, the model must predict if the two have the same meaning; iv) EN-ES-CS dataset for Sentiment Analysis on English-Spanish code-mixed tweets.

Question Answering (QA). For QA we consider Span-Prediction tasks, where the answer to a question is to be predicted within a piece of context provided. We evaluate on XQuAD Artetxe et al. (2020), MLQA Lewis et al. (2020), TyDiQA-GoldP Clark et al. (2020), and IndicQA Doddapaneni et al. (2022).

Sequence Labeling. This task involves classifying each token in a piece of text and we consider Named Entity Recognition dataset PAN-X Pan et al. (2017) (also called WikiANN) and UDPOS Nivre et al. (2018) for Part of Speech Tagging.

Natural Language Generation (NLG). For NLG we consider the multilingual Abstractive Summarization dataset XL-Sum.

Responsible AI (RAI). We consider the multilingual Toxicity Prediction dataset JigsawKivlichan et al. (2020), and Wino-MT to measure Gender Bias in MT systems.

All the datasets with the number of languages they include are listed in Figure 1(a). These 16 datasets encompass a total of 70 languages covering 21 different language families, with Indo-Aryan and Afro-Asiatic languages in the majority (see Figure 1(b)). Note that for tasks with $>30$ languages i.e. UDPOS, PAN-X, and XL-Sum, we run evaluations on the first 1000 examples of the test sets. For tasks where no public test sets are available (like XQUAD, TyDiQA-GoldP, and IndicQA), we evaluate on validation data. Refer to Appendix §A.1 for a detailed description of all the datasets.

2 Models

OpenAI Models. We conduct all benchmarking experiments on the GPT-3.5 models text-davinci-003 (denoted as DV003 in the paper) and gpt-3.5-turbo Ouyang et al. (2022) (GPT-3.5-Turbo) as well on the GPT-4 model gpt-4-32k OpenAI (2023). The text-davinci-003 model has a maximum context size of 4096 tokens, while gpt-3.5-turbo and gpt-4-32k support context sizes of 16k and 32k respectively.

Baselines. We compare the performance of OpenAI models with two classes of baselines, i) Prompt-Based baselines, which like the OpenAI models are evaluated by prompting the model directly for solving a task, and ii) Fine-tuned Baselines, which are fine-tuned on task-specific training data. For the former we consider BLOOMZ Muennighoff et al. (2022), a multi-task fine-tuned version of the BLOOM Scao et al. (2022) model, which is a 176 billion parameter model trained on 46 natural languages and 13 programming languages. For fine-tuned baselines, we consider TULRv6 Patra et al. (2022) (the current SoTA on XTREME benchmark), XLMR Conneau et al. (2020), multilingual BERT Devlin et al. (2019), and mT5 Xue et al. (2021). For Indic-datasets we also compare with MuRILKhanuja et al. (2021), a multilingual BERT model trained on 16 Indic languages that obtains SOTA performance on many Indic benchmarks. All of these models (excluding mT5 for the XL-Sum and XCOPA), were fine-tuned with English data and then evaluated in a zero-cross-lingual fashion on other target languages.

3 Evaluation Methodology

LLMs exhibit two remarkable properties that make them effective at solving a variety of NLP tasks. The first is in-context learning Brown et al. (2020), where the model learns to solve a task through the few input-output examples provided as part of the context without any weight updates. Secondly, the ability to follow instructions Mishra et al. (2022); Wei et al. (2021); Ouyang et al. (2022) which is a property of instruction-tuned LLMs, where the models can be prompted to solve new-tasks based on the textual instructions provided in context.

The choice of prompt significantly influences the performance of LLMs and these models have been shown to be brittle to simple prompting variations, such as the choice of prompt template and the training examples or even the ordering of examples Zhao et al. (2021). For multilingual setups as highlighted in Lin et al. (2022a) and Shi et al. (2022), some additional variations to consider include, the choice of the language of the few-shot examples, the language of the prompt template, and the language of the test examples.

In this work, we evaluate models using three types of prompting strategies: Monolingual Prompting: In this setup, the $k$ randomly selected examples are of the same language as the test examples. Zero-Shot Cross-Lingual: Here, we evaluate generative models’ zero-shot cross-lingual transfer ability during in-context learning. We use $k$ -shot examples from a pivot language (always English in our experiments) which is different from the language of the test example. Translate-Test: In this setup also, the few-shot examples are sampled from English data. However, the test example itself is modified by translating it into English. We use Bing Translator to translate the test examples into English. We do not perform evaluations with Translate-Test prompting for QA and Sequence Labelling tasks where there is no trivial alignment between the labels in the translated text with native language text. To preserve costs, for GPT-4 we only run evaluations with the Monolingual prompting strategy except for a couple of datasets, which we explicitly discuss later in §3. Irrespective of the prompting strategy, we use the prompt templates written in English (see Appendix §A.7 for the impact of this choice).

We use PromptSource Bach et al. (2022) for a database of existing prompts to use for our experiments. In order to select the best prompt for a dataset (to appropriately measure the capabilities of these models), we evaluate the performance of available English templates on PromptSource on the English validation set and select the prompt that gives the best performance. This prompt template is then used to evaluate models on the test sets for all languages and prompt strategies. While it would be ideal to tune the prompts separately for each language, the scale of our experiments and the computational costs of these models make it prohibitive. We investigate the impact this choice has on our results in §4.1. We perform separate prompt-tuning for DV003 and GPT-3.5-Turbo models, and to keep the costs in check, we use the prompts obtained for the latter for GPT-4 as well. Final prompts selected are included in Appendix §A.4.

In all our experiments, we choose few-shot examples randomly from the training or validation set (depending on what’s available) of a dataset. For most datasets, we use 8 few-shot examples, excluding tasks with longer contexts like QA and summarization tasks where we use $k=4$ .

Results and Analysis

In this section, we analyze the results of our benchmarking exercise across tasks and languages. Broadly, we cover the comparison between the effectivness of various prompting strategies §3.1 followed by the performance comparison of GPT-3.5 and GPT-4 models with appropriate baselines §3.2. We conclude with an examination of the factors that affects the performance of these models §3.3.

In Figure 2, we compare the performance of the three prompting strategies. We find that translate-test often improves the performance over the monolingual strategy, especially so in the case of DV003. We also find that for datasets, which include many low resource and non-latin script languages like IndicXNLI and XStoryCloze, the gains with translate-test are even more substantial for both the models. In Figure 3, we present the average (over different tasks) relative improvement by Translate-Test over Monolingual on GPT-3.5-Turbo for different languages and observe for languages like Burmese, Tamil, and Telugu the relative improvement can be $>30\%$ ! In general, we see that for low-resource languages, the translate-test results in substantial improvement in performance, while for high-resource languages the two perform similarly. While we do not evaluate GPT-4 Translate-Test exhaustively for all tasks, we do run the tests for XStoryCloze and XCOPA datasets. Based on these two, we observe that GPT-4’s Monolingual prompting performance is often much more on-par with Translate-Test and many times even better. However, for low-resource languages we again see Translate-Test to perform much better. e.g., in XStoryCloze GPT-4’s accuracy on Burmese is $77.6\%$ vs $93.2\%$ for Monolingual and Translate-Test respectively ( Figures 10(b) and 10(d) in Appendix).

Note that while Translate-Test substantially improves performance on low-resource languages, compared to the performance of these models in English, the gap even after Translate-Test is significantly high. For example, using translate-test with GPT-3.5-Turbo for Urdu in XNLI results in $54\%$ accuracy compared to $49.1\%$ for monolingual. However, this contrasts with the $76.2\%$ accuracy that the same model achieves in English.

Zero-Shot Cross-Lingual prompting for DV003 often performs on par with Monolingual but for GPT-3.5-Turbo, there is a drop in performance, especially so for tasks like XCOPA which have some extremely low resource languages: Quechua and Haitian Creole. For these languages, we observed that when provided few-shot examples in English, GPT-3.5-Turbo would often resort to predicting outputs like "I’m sorry, but the premise is not in a language that I understand.". However, by providing examples in the language, we are able to ground the model to these languages and we almost never observe such predictions in that case.

2 Comparing different models

The aggregated results comparing different models and prompting strategies are provided in Table 1 and Table 7 (for Indic Datasets). Excluding the commonsense reasoning tasks XCOPA and XStoryCloze, the OpenAI models generally lag behind the fine-tuned baseline TULRv6 for most tasks often by a significant margin and often are only slightly better than some of the smaller fine-tuned multilingual models i.e. mBERT and mT5-base. Between OpenAI models and BLOOMZ, the former models tend to outperform the latter (despite having a larger proportion of multilingual pre-training data), except for datasets like PAWS-X, XQUAD, and TyDiQA-GoldP, where BLOOMZ performs better. However, it must be noted that all these three datasets were present in the multi-task fine-tuning stage for BLOOMZ, especially for XQUAD and TyDiQA-GoldP for which the validation data that we use for evaluation is also likely to be included in the fine-tuning dataNote that this can be a possibility for OpenAI models as well and we discuss this in more detail in §4.2..

Between the OpenAI models, generally DV003 and GPT-3.5-Turbo perform on par, with Translate-Test performance of DV003 being generally better than GPT-3.5-Turbo, and the other way around for Monolingual performance. However, we do observe a notable exception to this, which is for the QA tasks where GPT-3.5-Turbo performs substantially better than DV003, especially so for IndicQA. We attribute this to the fact that in order to fit the prompt in the 4096 context size for DV003, we had to resort to retrive-then prompt strategy and imperfect retrieval for low-resource languages leads to worse performance. Please check §A.5 of Appendix for more details on this. For GPT-4 on the other hand, we consistently observe substantial improvements, with it being Pareto Optimal Choudhury and Deshpande (2021) compared to the two GPT-3.5 models for all datasets with an exception of XL-Sum, where for some languages GPT-3.5-Turbo performs better. For the detailed results spanning all models, tasks, and languages, please refer to Appendix §A.8.

3 Factors Explaining Performance Trends

In this section, we try to understand what factors influence our observed trends in multilingual LLM capabilities. We begin by investigating the Fertility of the tokenizers used by different models, which is defined as the average number of sub-words produced per tokenized word (higher means worse quality), as that has been shown to critically impact the downstream task performance of pre-trained multilingual models Rust et al. (2021). In Figure 4, we plot the tokenizer fertility of different models. We observe that the tokenizers for the OpenAI models are substantially worse for low-resource, non-latin script languages: where the fertility for languages like Malayalam and Tamil is so high ( $\sim 10$ ) that the tokenizer essentially operates as a byte-level tokenizer for these languages. Note that this means that for low-resource languages, substantially larger number of tokens are needed to encode the inputs as well as for generation, which results in a significant additional API costs. Ahia et al. (2023) discusses how this phenomenon leads to large socio-economic disparities for speakers of underrepresented languages. We study if these discrepancies in the tokenizer’s quality across languages have any effect on the performance. As can be seen in Figure 5, for six tasks we observe statistically significant (negative) correlations between the tokenizer’s fertility and dataset-specific performance i.e. the models obtain worse performance on languages for which the tokenizer is of poor quality, and vice-versa.

We also study the effect that the amount of data available for each language during pre-training Wu and Dredze (2020); Lauscher et al. (2020) has on the multilingual performance of these models. We measure the correlations between the language-wise number of tokens present in the pre-training data with language-wise performance on each dataset. While the exact language-wise pre-training data distribution for GPT-3.5 and GPT-4 models is not available, we use the GPT-3’s language-wise pretraining distribution as a proxy. We observe that for four tasks (PAWS-X, XNLI, XCOPA, and XQuAD) statistically significant positive correlations between the pre-training data size and performance. Note that, the amount of pre-training data and tokenizer fertility are highly likely to be correlated with each other. However, we do see that using pre-training data we are able to explain some trends that are not explained by tokenizer fertility alone. For example, even though the OpenAI models have similar tokenizer fertilities for both French and Japanese, these models perform much better in French than they do for Japanese (72.1% accuracy vs 67% accuracy for GPT-3.5-Turbo) for PAWS-X. However, when we take into consideration the amount of pre-training data for these languages: roughly 3.5 B French tokens in the pre-training data versus 214M for Japanese, we can partially explain this discrepancy.

However, we must note that these two factors correlate well with only a subset of the tasks and what we are measuring is the correlation which might not imply causation. Investigating different factors that together more holistically explain multilingual capabilities is an important direction that we leave for future work. Please check Appendix §A.6 for detailed results from this section.

Challenges in Multilingual Evaluation

In this section, we examine some of the challenges and consequent limitations of a large-scale multilingual evaluation like ours.

There are various moving parts when evaluating LLMs using prompting-based approaches, including the choice of prompt templates, instructions, and few-shot examples Liu et al. (2022); Lu et al. (2022); Zhao et al. (2021), different prompting strategies Wei et al. (2023); Nye et al. (2021); Ye and Durrett (2022a), using external tools Schick et al. (2023), the language of prompts Shi et al. (2022); Lin et al. (2022a), as well as different decoding specific hyper-parameters Shih et al. (2023), which can have varying degrees of impact on the performance, sometimes in unexpected ways. Holistically exploring these choices for all the datasets and languages can quickly get out of hand, especially given the excessive computational cost of running these models. In order to understand the sensitivity of our observations to the choices we make in §3, we re-evaluate our setups on a subset of datasets and languages for a varying set of parameters. Our findings are summarized in Figure 6, where we see that having a large few-shot size generally helps improve performance, however, the performance is often stable beyond $k=8$ . Further, language-specific fine-tuning can help improve the performance like we see for Haitian Creole in XCOPA, but for Tamil we actually observe the accuracy to go down which might be attributed to the small size of the validation set (100 in the case of XCOPA). Finally, on XStoryCloze dataset (also for XCOPA), we see using explanations to prompt the models have negligible impact on the performance. Overall, these experiments indicate that the existing prompting approaches might not be sufficient to address the performance gap that exists for non-English languages (especially mid-to-low resource languages) and there is an imminent need to propose new methods as well as improve the representation of different languages in these model’s pre-training (and instruction-tuning) data.

2 Test data contamination

Given the massive amount of online data that LLMs are trained with, it is critical to factor in the possibility of contamination of test datasets Sainz et al. (2023). Accordingly, we attempt to verify if the performances we observed are in fact, representative of the capabilities of these models or merely a result of memorization. Given the lack of transparency in the training distribution of recent models like GPT-4, we perform some preliminary investigations against this phenomenon. Specifically, we consider three factors: i) LLM’s knowledge of the dataset, ii) availability of test datasets on the internet, and iii) dataset release date.

To measure the LLM’s (we do this for GPT-4) memory of the dataset, we prompt it to fill the dataset cards for each of MEGA’s datasets (denoted as Card Fill). This involves filling templatic information like the task’s supported languages, input-output structure and description. If the model fills a dataset card correctly (Full), we note this as suspicion of contamination. If it fills the card partially correct (Partial) i.e. detecting either the correct task structure or correct set of languages, we mark it as partial evidence, and if it succeeds in neither, we mark it as no suspicion (None). For test dataset availability, we check if the test dataset can be accessed online directly without downloading either as part of the official release from the authors or via other sources such as Hugging Face dataset viewer (Data Acc. w/o Down.). For release date, we check if the dataset was made public after the cut-off date of September 2021.

The overall results from this analysis are provided in Table 2. We see that for a majority of datasets, GPT-4 can fill in the dataset card correctly; On the more recent datasets like XLSum and XStoryCloze it is only partially successful, while on Jigsaw and code-mixing datasets it fails to correctly fill the cards. Note that except XStoryCloze, Jigsaw and the Code-mixing datasets, evaluation sets for all other datasets are directly accessible online. Collectively, this connotes that for tasks like XStoryCloze and IndicQA there is a weak suspicion against contamination. While all other tasks are highly likely contaminated (except Jigsaw, and Code-Mixed datasets).

Our analysis implies a notable chance of the test data appearing in the training datasets of these LLMs. The contamination of test datasets is a serious problem for works centered around LLM evaluation (including ours), as they might lead to an overestimation of the capabilities of these models. However, we would like to highlight that despite the possibility of contamination, LLMs still vastly underperform on (especially low-resource) non-English languages . These observations about data contamination indicate that the disparity in performance between English and non-English languages might be even greater than what we observe in our work.

Related Work

Evaluation of LLMs. A growing interest in the evaluation of LLMs has harbingered several efforts towards the holistic evaluation of their capabilities. While work like BIG-bench Srivastava et al. (2023) cover a diverse range of tasks, the non-English tasks are mostly translation-oriented which limit the more general task based inferences that for such an evaluation. Similarly, Liang et al. (2022) propose a taxonomy of scenarios and metrics in Holistic Evaluation of Language Models (HELM) to define the space of LLM evaluation, and evaluate 30 language models on 42 scenarios and 7 metrics. However, all the scenarios are focused on datasets in standard English or its dialects.

Benchmarks for multilingual evaluation, such as XTREME Hu et al. (2020), XTREME-R Ruder et al. (2021) and XGLUE Liang et al. (2020) have been proposed to measure cross-lingual transfer in pre-trained language models. Following their popularity, there has been the development of benchmarks covering specific language families, such as IndicXTREME Doddapaneni et al. (2022) for Indian languages, Adelani et al. (2022) for African Languages, and Wilie et al. (2020) for Indonesian languages, as well. The evaluations on these benchmarks have mainly focused on pre-train then fine-tune kinds of setups. Particularly for prompting style evaluation, Bang et al. (2023) evaluates the multilingual capabilities of ChatGPT and shows that it fails to generalize to low-resource languages with non-latin scripts. However, multilingual evaluation is performed only on a few tasks, and a subset of 50-100 examples are used for testing the model. Hendy et al. (2023) evaluate the translation abilities of GPT-3.5 models and find that these models, while perform well in translating high-resource languages, their capabilities for low-resource languages are limited. Concurrent work BUFFET Asai et al. (2023) and Lai et al. (2023) also perform multilingual benchmarking of large language models, however, they evaluate the performance of ChatGPT and BLOOMZ in their work while our evaluation also spans GPT-4.

While most work on prompting or in-context learning in LLMs focuses on English data, recently, there has been some interest in prompting them with non-English data. Zhao and Schütze (2021), for instance, use discrete and soft prompting techniques to evaluate XLM-RoBERTa and show that prompting can be more effective compared to fine-tuning when the amount of labeled data is limited. Lin et al. (2022a) show that English prompts perform better than prompts written in the target language (both hand-written and translated). Finally, Shi et al. (2022) show chain-of-thought (CoT) prompting results leads to striking multilingual reasoning capabilities in LLMs, even in under-represented languages especially when prompted when English CoT.

Conclusion

In this work, we conduct an evaluation across different prompting strategies, models, tasks, and languages to investigate the multilingual capabilities of LLMs. We also investigate underlying properties like tokenizer quality and size of pretraining data to explain the trends in performance that we observe. Our investigation shows the consistent performance gap between high-resource, Latin script, and under-resourced languages in addition to highlighting the efficacy, yet limited sufficiency of methods like translate-test prompting. Through our evaluation, we present evidence of the need to prioritize automatic benchmarking and human evaluation across as many languages as possible. We hope that this work spurs research in meeting this goal.

Limitations

Although we compare the evaluation results of GPT-3.5 and GPT-4 with BLOOMZ and SOTA models, we could not evaluate other closed models such as PaLM, which also contains training data in many languages. A limitation of our study is that we do not evaluate on all the multilingual datasets that are available, and we plan to scale up our evaluation in future versions of the study with the help of the research community. Even if we do evaluate all available multilingual datasets, they do not cover many typologically diverse and under-resourced languages, which is a fundamental limitation of trying to scale up multilingual evaluation today. For example, there is very little representation from African languages, Indigenous languages of the Americas etc. in any of the evaluation benchmarks available today. Finally, we restrict ourselves to the performance metrics and to some extent gender bias dimension of evaluation for this study - however, we plan to include evaluation of calibration, toxicity, bias, robustness, etc. in future work.

Acknowledgments

The authors would like to thank Barun Patra and Vishrav Chaudhary for their help with TULR evaluation results. We also thank the anonymous reviewers for their helpful feedback, which helped us improve the quality of our paper.

References

Appendix A Appendix

In our experiments, we consider 16 tasks spanning the following task types - classification, sequence to sequence labeling and generation. Below we review the experimental setups and datasets used for benchmarking for these two tasks. A list of all the datasets with the languages covered by them can be found in Table 3.

These tasks involve classifying a single sentence or a group of sentences into a finite number of discrete labels. For each dataset, we measure the performance of different models in terms of classification accuracy. For prompt-based models in particular, since we add no constraint on the output space of the LLM we compute the exact match between the generated output and a verbalized label to determine if the example was classified correctly. We run experiments for all the prompting strategies that we discussed in the previous sections for each dataset. The details of each dataset that we use for benchmarking are given below:

1. Natural Language Inference: XNLI Conneau et al. (2018) is a dataset for cross-lingual Natural Language Inference, which consists of professional translations of the MNLI Wang et al. (2018) corpus into 14 languages. We also consider IndicXNLI Aggarwal et al. (2022) that translates the XNLI dataset into 11 Indic languages by using Machine Translation, followed by validation by native speakers.

2. Paraphrase Identification: PAWS-X Yang et al. (2019b) is a paraphrase identification dataset professionally translated from the PAWS Zhang et al. (2019) dataset into six typologically diverse languages.

3. Commonsense Reasoning: XCOPA Ponti et al. (2020) is a commonsense reasoning dataset, which is a translation of the COPA Roemmele et al. (2011) dataset into 11 typologically diverse languages, including very low-resource languages such as Eastern Apurímac Quechua and Haitian Creole.

XStoryCloze Lin et al. (2022b) is created by translating the English StoryCloze Mostafazadeh et al. (2017) dataset using professional translators into 10 typologically diverse languages.

A.1.2 Question Answering

We focus on Span Prediction type of Question Answering (QA) tasks in our experiments, where given a context and a question the task is to predict the answer within the context. One major challenge that we come across for multilingual evaluation of QA tasks is that for many languages we often cannot fit the context and question pairs for the few-shot and text examples in the maximum context size of 4096 for the DV003 model. This is mainly attributed to the poor performance of GPT’s tokenizer on many non-latin script languages which results in over-tokenizing the words in these languages.

To overcome this issue we follow two steps. First, for the few-shot examples we only provide the line within the paragraph containing the answer as the context. Second, for the test example, we index the chunks of the context using the embeddings from the text-embedding-ada-002 model. Given the question, the closest chunk in the full context is retrieved and used in the prompt for the test example. We use a maximum chunk size of 100 in our experiments and use the implementation for retrieval provided in the LangChainhttps://github.com/hwchase17/langchain library. By doing this,we minimize the space taken by the context tokens in our prompt.

Note that, for newer GPT models i.e. GPT-3.5-Turbo and GPT-4 which support longer context lengths, we do not use this retrieval strategy for QA tasks and prompt the models to obtain the answers directly. For each task, we calculate the Exact Match and F1 score as defined in Rajpurkar et al. (2016a). For our experiments we consider the following four tasks:

1. TyDiQA Clark et al. (2020) is a QA dataset covering 11 typologically diverse languages. The task consists of two sub-tasks - passage selection and minimum answer span (Gold-P). For our experiments, we consider the Gold-P task and evaluate Monolingual and Zero-Shot Cross-Lingual prompting strategies. Since the labels do not directly transfer one-to-one across translation for QA tasks as they do for classification and require the use of alignment algorithms, we skip translate-test prompting for this task.

2. MLQA Lewis et al. (2020) is an extractive QA dataset translated into 7 languages by professional translators. The task has two variants, the first where the question, context, and answer are all in the same language; and the second, where the question is in a different language than the context and answer. We consider the former variant of the task in our experiments. For MLQA, translate-test splits are also available, where each language’s test data has been translated into English with answers aligned using the attention scores. There is no training data available for MLQA, and we use SQuAD’sRajpurkar et al. (2016a) training data for selecting few-shot examples in English and validation data for MLQA in other languages to get their few-shot examples. This way, we are able to evaluate for all three prompting setups.

3. XQuAD Artetxe et al. (2020) consists of professional translations of a subset of the SQuaD dataset Rajpurkar et al. (2016b) into 10 languages. XQuAD only has validation datasets available publicly, hence we evaluate the models on them. Like MLQA we use English SQuAD data for few-shot examples and since we cannot use validation data in other languages for few-shot, we only evaluate for zero-shot cross-lingual setup for this task.

4. IndicQA Doddapaneni et al. (2022) is a manually curated cloze-style reading comprehension dataset that can be used for evaluating question-answering models in 11 Indic languages. The context paragraphs are chosen from Wikipedia articles whose topics are closely related to Indic culture, history,etc. The publicly available test set has about 2000 sentences that we carry out our evaluation on.

A.2 Sequences Labeling

In the sequence labeling task, a sequence of tokens (such as words) to be labeled are provided to the system.

UDPOS Zeman et al. (2020) is a dataset for Part of Speech Tagging taken from the Universal Dependencies 2.5 from the XTREME Hu et al. (2020) benchmark. We benchmark a subset of the languages available in UDPOS.

A.2.2 Named Entity Recognition

PANX Pan et al. (2017) or WikiANN is a Named Entity Recognition dataset consisting of Wikipedia sentences tagged with Person, Organization and Location.

For both tasks we use the linguistic structure prompting approach of Blevins et al. (2022) to define the prompts. The exact prompts used can be found in §A.4. Given the nature of both tasks, which would involve token alignment across the translation, we do not evaluate the translate-test prompting strategies for these setups. Also, since both tasks involve $>30$ languages, to make the best use of the compute resources we only evaluate GPT-3.5-Turbo in a monolingual setup for these two tasks. Finally, we evaluate the first 1000 examples for each language for these datasets given the large number of languages. We have recomputed all baselines with this specification as well.

A.3 Generation

The XLSum Hasan et al. (2021a) dataset contains article-summary pairs across 44 typologically diverse languages, ranging from high to very low-resource.

For a similar reason as the tagging datasets, we only evaluate on first 1000 examples of the test sets in different languages and recompute the baselines on the same testset using the weights of the XLSUM pretrained model, opensourced by the authors Hasan et al. (2021b).

A.3.2 Code-switching datasets

All the datasets we consider so far are monolingual, however, a majority of the world’s population speaks more than one language, leading to language contact phenomena such as code-switching Doğruöz et al. (2021); Sitaram et al. (2019). We include two code-switching datasets in MEGA to benchmark the performance of generative models.

GLUECoS-NLI Khanuja et al. (2020a) is a code-mixed NLI dataset in Hindi-English, consisting of Bollywood (Hindi) movie conversations as premises, with manually created hypotheses.

The EN-ES-CS Sentiment Analysis dataset Vilares et al. (2016), part of the GLUECoS benchmark Khanuja et al. (2020b) is a code-mixed dataset consisting of English-Spanish Tweets annotated with SentiStrength Thelwall (2017) scores.

A.3.3 RAI datasets

We include two datasets that measure the Responsible AI (RAI) dimensions of fairness and toxicity - Jigsawhttps://www.kaggle.com/competitions/jigsaw-multilingual-toxic-comment-classification/data for toxic comment classification and WinoMT for gender bias.

The Jigsaw dataset contains online comments sourced from Wikipedia. The training data, which is in English, contains labels pertaining to the toxicity of the comment and any relevant identity mentions contained in the comment. We use the test dataset, which contains these comments for 6 languages as illustrated in Table 3 for evaluation. The test dataset contains a binary label indicating whether or not the comment is toxic. Our objective is to assess the performance of these models across multiple languages and observe the disparity in this performance that could arise due to a number of factors, a prominent one being the source data that these models are trained on. Using English prompts from PromptSource for the original monolingual Jigsaw task, we task the model with classifying a comment as toxic or non-toxic. We perform crosslingual few-shot prompting and translate-test experiments for the test sets of all 6 languages, and report the results excluding content violations in Table 21.

The WinoMT dataset Stanovsky et al. (2019) is created by concatenating the WinoGender Rudinger et al. (2018) and WinoBias Zhao et al. (2018) datasets. WinoMT dataset consists of 3888 English sentences with equal distribution of Male and Female genders. It is also equally balanced between stereotypical and non-stereotypical gender role assignments. We follow the method as reported by Stanovsky et al. (2019) in their paper. We perform zero-shot monolingual prompting of all sentences in the dataset to translate them in 8 target languages. Further using fast_align we map the English entity to its translation. Finally, we extract the target-side entity’s using off the shelf tools for each target language. The extracted translated gender can be finally compared against the gold annotations for English.

A.4 Prompts

: GPT-3.5-Turbo, GPT-4 Task Instruction $\mathcal{I}$ : You are an NLP assistant whose purpose is to solve Natural Language Inference (NLI) problems. NLI is the task of determining the inference relation between two (short, ordered) texts: entailment, contradiction, or neutral. Answer as concisely as possible in the same format as the examples below: Template $f_{temp}$ : {premise} Question: {hypothesis} True, False, or Neither? Verbalizer $f_{verb}$ : Entailment : True, Contradiction: False, Neutral: Neither

: DV003 Template $f_{temp}$ : {premise} Based on previous passage is it true that {hypothesis} ? Yes, No, or Maybe? Verbalizer $f_{verb}$ : Entailment : Yes, Contradiction: No, Neutral: Maybe

A.4.2 PAWS-X

: GPT-3.5-Turbo, GPT-4 Task Instruction $\mathcal{I}$ : You are an NLP assistant whose purpose is to perform Paraphrase Identification. The goal of Paraphrase Identification is to determine whether a pair of sentences have the same meaning. Answer as concisely as possible in the same format as the examples below: Template $f_{temp}$ : {sentence1} Question: {sentence2} True or False?

: DV003 Template $f_{temp}$ : Sentence 1: {sentence1} Sentence 2: {sentence2} Question: Does Sentence 1 paraphrase Sentence 2 ? Yes or No? Verbalizer $f_{verb}$ : Positive: Yes Negative: No

A.4.3 XCOPA

: GPT-3.5-Turbo, GPT-4 Task Instruction $\mathcal{I}$ : You are an AI assistant whose purpose is to perform open-domain commonsense causal reasoning. You will be provided a premise and two alternatives, where the task is to select the alternative that more plausibly has a causal relation with the premise. Answer as concisely as possible in the same format as the examples below: Template $f_{temp}$ : { premise } {% if question == ‘‘cause" %} This happened because… {% else %} As a consequence… {% endif %} Help me pick the more plausible option: - {choice1} - {choice2}

: DV003 Template $f_{temp}$ : { premise } {% if question == ‘‘cause" %} This happened because… {% else %} As a consequence… {% endif %} Help me pick the more plausible option: - choice1: {choice1}, choice2: {choice2} Verbalizer $f_{verb}$ : choice1: {choice1} choice2: {choice2}

A.4.4 XQUAD, TyDiQA, MLQA

: GPT-3.5-Turbo, GPT-4 Task Instruction $\mathcal{I}$ : You are an NLP assistant whose purpose is to solve reading comprehension problems. You will be provided questions on a set of passages and you will need to provide the answer as it appears in the passage. The answer should be in the same language as the question and the passage. Template $f_{temp}$ : {context} Q: {question} Referring to the passage above, the correct answer to the given question is: {answer}

: DV003 Template $f_{temp}$ : {context} Q: {question} Referring to the passage above, the correct answer to the given question is: {answer}

A.4.5 IndicQA

: DV003 Template $f_{temp}$ : {context} Q: {question} Referring to the passage above, the correct answer to the given question is: {answer}

A.4.6 XStoryCloze

: DV003, GPT-3.5-Turbo, GPT-4 Template $f_{temp}$ : {input_sentence_1} {input_sentence_2} {input_sentence_3} {input_sentence_4} What is a possible continuation for the story given the following options ? Option1: {sentence_quiz1} Option2: {sentence_quiz2} Verbalizer $f_{verb}$ : {sentence_quiz1}: Option1, {sentence_quiz2}: Option2

A.4.7 PANX

: GPT-3.5-Turbo, GPT-4 Task Instruction $\mathcal{I}$ : You are an NLP assistant whose purpose is to perform Named Entity Recognition (NER). NER involves identifying and classifying named entities in a text into predefined categories such as person names, organizations, locations, and others. You will need to use the tags defined below: O means the word doesn’t correspond to any entity. B-PER/I-PER means the word corresponds to the beginning of/is inside a person entity. B-ORG/I-ORG means the word corresponds to the beginning of/is inside an organization entity. B-LOC/I-LOC means the word corresponds to the beginning of/is inside a location entity. Do not try to answer the question! Just tag each token in the sentence. Template $f_{temp}$ : {token_1 token_2 ... token_n} Verbalizer $f_{verb}$ : {tag_1} {tag_2} ... {tag_n}: {token_1}_{tag_1} {token_2}_{tag_2} … {token_n}_{tag_n}

A.4.8 UDPOS

: GPT-3.5-Turbo, GPT-4 Task Instruction $\mathcal{I}$ : You are an NLP assistant whose purpose is to perform Part of Speech (PoS) Tagging. PoS tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. You will need to use the tags defined below:

Template $f_{temp}$ : {token_1 token_2 ... token_n} Verbalizer $f_{verb}$ : {tag_1} {tag_2} ... {tag_n}: {token_1}_{tag_1} {token_2}_{tag_2} ... {token_n}_{tag_n}

A.4.9 GLUECoS Sentiment Analysis

: GPT-3.5-Turbo, GPT-4 Task Instruction $\mathcal{I}$ : You are an NLP assistant whose purpose is to solve Sentiment Analysis problems. Sentiment Analysis is the task of determining whether the sentiment, opinion or emotion expressed in a textual data is: positive, negative, or neutral. Answer as concisely as possible in the same format as the examples below: Template $f_{temp}$ : Does the following sentence have a positive, negative or neutral sentiment? {text}

: DV003 Template $f_{temp}$ : Does the following sentence have a positive, negative or neutral sentiment? {text}

A.4.10 XLSum

: GPT-3.5-Turbo, GPT-4 Task Instruction $\mathcal{I}$ : You are an NLP assistant whose purpose is to summarize any given article. You should summarize all important information concisely in the same language in which you have been provided the document. Following the examples provided below: Template $f_{temp}$ : {document} === Write a summary of the text above :

A.4.11 Jigsaw

: GPT-3.5-Turbo, GPT-4 Task Instruction $\mathcal{I}$ : You are an NLP assistant whose purpose is to solve the Jigsaw Toxicity Prompt problem. Toxicity is defined as anything rude, disrespectful or otherwise likely to make someone leave a discussion. The objective is to identify toxic comments. Answer as concisely as possible in the same format as the examples below: Template $f_{temp}$ : {text} Should this online comment be removed for its toxicity? Yes or No?

: DV003 Template $f_{temp}$ : {text} Someone posted this comment online. Do you find it toxic? Yes or No?

A.4.12 WinoMT

: GPT-3.5-Turbo, GPT-4 Template $f_{temp}$ : Translate the following English text to {target_language}: {sentence}

A.5 Handling Long Contexts

As discussed in §2.2, the models we study have limited context lengths and for QA tasks in particular, fitting the entire prompt containing the few-shot examples is often not feasible for low-resource languages where the tokenizers of these models are found to over-tokenize the text (nearly resulting in byte level tokens). To overcome this issue we follow two steps. First, for the few-shot examples we only provide the line within the paragraph containing the answer as the context. Second, for the test example, we index the chunks of the context using the embeddings from the text-embedding-ada-002 model. Given the question, the closest chunk in the full context is retrieved and used in the prompt for the test example. We use a maximum chunk size of 100 in our experiments and use the implementation for retrieval provided in the LangChainhttps://github.com/hwchase17/langchain library. By doing this,we minimize the space taken by the context tokens in our prompt. Note that, for newer GPT models i.e. GPT-3.5-Turbo and GPT-4 which support longer context lengths, hence we only use this retrieval strategy for DV003 on QA tasks.

We attribute the significantly worse performance of DV003 on IndicQA to imperfect retrieval in the case of DV003, while for GPT-3.5-Turbo we do not rely on retrieval due to the larger context size. We provide the retrieval accuracies for DV003 (i.e. if the retrieved chunk contains the answer) in Appendix Table 4 , where we clearly see for low-resource languages like Telugu, the accuracies can be as low as $5\%$ . While beyond the scope of this work, alternate retrieval strategies like using better embeddings from multilingual models for retrieval can be explored to close this gap Nambi et al. (2023).

A.6 Factors Explaining Multilingual Capabilities of LLMs

We provide correlation plots in Figures 7 (between performance and fertility) and 8 (between performance and pre-training size) for both GPT-3.5-Turbo and GPT-4. The exact values of the correlations for all tasks and the two models is provided in Table 5.

A.7 Challenges in Multilingual Evaluation

Effect of number of in-context examples $k$ . Our main experiments were conducted with $k=8$ or $k=4$ , depending on the task. Here, we evaluate what effect different numbers of in-context examples have on XNLI and XCOPA for three languages in Figures 6(a) and 6(b). We observe while the performance increases sharply while moving from 0 to 2-4 examples, it is fairly stable after $k\geq 8$ , with the exception of Haitian Creole in XCOPA, where it continues to improve.

Effect of language-specific prompt tuning. As discussed in §2.3.1, we use English validation data for prompt selection in each dataset that we use for all languages. Here, we explore whether separately tuning the prompts for each language helps. For XNLI, we run this experiment on Urdu and Swahili, tuning over ten different prompt templates from Prompt-Source, but find that the same prompt that was tuned for English gets picked up for these two languages as well. For XCOPA however, different prompts are chosen when tuned on Haitian Creole and Tamil. This leads to an improvement in the test performance for Haitian Creole (from 72% to 75.6%, see Figure 6(c)). Interestingly for Tamil, we see the test performance actually drops slightly compared to the accuracy obtained with prompt selected on English data, which we conjecture might be due to the fact that the validation sets in XCOPA have only 100 examples that may not be sufficient for selecting optimal prompts.

Effect of Explanations. Ye and Durrett (2022b), showed for text-davinci-002, that prompting the model with explanations before the outputs (Explain-then-Predict) in the in-context examples can help improve few-shot performance substantially on English language datasets. Hence, here we evaluate if they help improve the multilingual performance of the GPT-3.5-Turbo model as well. We perform experiments on XStoryCloze and XCOPA datasets and use the explanations available in Super-NaturalInstructions (SNI)Wang et al. (2022)At the time of writing this paper, XStoryCloze wasn’t included in SNI, hence we use the few-shot examples and explanations available for StoryCloze datasetMostafazadeh et al. (2016), making the prompting setup Zero-Shot Cross-Lingual.. All the explanations that we used were written in English. For XStoryCloze, the results are plotted in Figure 6(d), and we observe that while there is a slight gain upon using explanations for Telugu, for all other languages the performance remains largely unchanged if not slightly worse. Interestingly, upon manual inspection of the model’s prediction, we observe that the model often first translates the problem to English and then proceeds with the explanation, without having prompted to do so. We have similar observations for the XCOPA dataset as well, where adding explanations doesn’t help improve performance and ends up hurting the performance by a slight margin (Figure 9)

Effect of the language of the prompt templates. While all our experiments were run using prompt templates written in English, we initially evaluated DV003 on Native-Language-Templates as well, which were obtained by translating English templates using Bing-Translator. As can be seen in Table 6, the performance is much worse when using templates in the native language compared to English. This is consistent with the results in Muennighoff et al. (2022) for BLOOMZ and Lin et al. (2022a) for XGLM, which also show better performance when using prompt templates in English.

A.8 Detailed Results

The results for across all tasks, languages and models included in our benchmarking exercise can are provided in Figures 10 (for Classification tasks), 11 (for QA Tasks), 12 (for XLSum), 14 (for PAN-X), 13 (for UDPOS), 15 (for Jigsaw), and finally 16 (for Wino-MT). The results for the Indic Datasets and the two code-mixed datasets GLUECoS NLI and En-ES-CS are provided in Tables 7 8 respectively.