BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual Transfer

Akari Asai, Sneha Kudugunta, Xinyan Velocity Yu, Terra Blevins, Hila Gonen, Machel Reid, Yulia Tsvetkov, Sebastian Ruder, Hannaneh Hajishirzi

cs.CL

Introduction

Recent advances in NLP primarily focus on the English language Blasi et al. (2022). Due to the lack of sufficient training data in most of the world’s languages Yu et al. (2022), prior work explores direct transfer of pretrained language models to new languages after fine-tuning on resource-rich languages (zero-shot cross-lingual transfer; Hu et al. 2020b). Transferring after training a model on a few examples (few-shot cross-lingual transfer) often boosts performance, especially in languages that are distant from the source language Lauscher et al. (2020); Hedderich et al. (2020).

In English, zero- or few-shot learning via in-context learning is an active area of research Beltagy et al. (2022); Schick and Schütze (2021a); Shin et al. (2020). In this learning paradigm, one prompts a large language model (LLM) with few-shot demonstrations or natural language instructions to adapt to a new task, without any parameter updates. Yet, few-shot transfer across languages is still under-explored Lin et al. (2021) in a wide range of tasks and languages. Moreover, it is unclear how effectively in-context learning performs in comparison to widely-used fine-tuning-based transfer methods under a comparable setup.

This work introduces a new benchmark called BUFFET: Benchmark of Unified Format FEw-shot Transfer Evaluation (Figure 1) to enable rigorous evaluations and advance research on few-shot cross-lingual transfer. Similar to a rich buffet, BUFFET curates a diverse mix of tasks: 15 different tasks—including classification, structured prediction, and natural language generation—across 54 languages. BUFFET has several unique characteristics that are not present in prior multi-task multilingual benchmarks (summarized in Table 1):

BUFFET provides a fixed set of few-shot examples for training and validation, allowing for fair comparisons across LMs and transfer methods.

BUFFET includes datasets annotated in each language or covering under-represented languages, which are often not included in existing multi-task benchmarks.

BUFFET combines diverse tasks into a unified text-to-text format and provides a set of English and machine-translated instructions for each task, removing the burdens of task-specific architecture changes or prompt engineering.

Using this new benchmark, we extensively evaluate the current state-of-the-art multilingual large language models (LLMs), including mT5 Xue et al. (2021), mT0 Muennighoff et al. (2022), BLOOM Scao et al. (2022), BLOOMZ Muennighoff et al. (2022), and ChatGPT Ouyang et al. (2022), using both fine-tuning and in-context learning approaches. In particular, BUFFET enables us to investigate the following research questions:

(RQ1) Is in-context learning competitive with fine-tuning in few-shot cross-lingual transfer? Notably, given the same small numbers of examples in the target languages, in-context learning on LLMs (including ChatGPT, the most powerful model we evaluate in this work) often under-performs much smaller specialized mT5-base models, as shown in Figure 1 (bottom left).

(RQ2) How well do different transfer methods perform across tasks and languages? The performance gap between in-context learning-based baselines and fine-tuning-based baselines is more significant in under-represented languages (Figure 1 bottom center). On NLI in indigenous languages of the Americas, ChatGPT or mT0-11B using in-context learning performs barely above random, while 580 million parameter mt5-base fine-tuned models retain strong performance. On the contrary, these LLMs perform well on generative tasks where a smaller task-specific model struggles, demonstrating their superiority in generating fluent text for diverse languages without abundant training data.

(RQ3) How does the choice of transfer setup affect different transfer strategies? BUFFET also enables us to perform an in-depth and extensive analysis of the effects of diverse demonstrations and instructions on the downstream transfer quality. Our observations indicate that the choice of few-shot training examples has a substantial influence on a model’s performance, particularly, with greater variability in in-context learning, compared to fine-tuning. We note that optimal transfer settings may differ across models. For example, instruction-tuned models often face challenges in effectively utilizing few-shot samples and their performance deteriorates as the number of demonstrations increases, possibly because they are optimized for the zero-shot instruction-tuned training scheme. This highlights the need for a standardized benchmark to facilitate fair comparisons and further studies to assess such transfer dynamics in non-English data.

Grounded in our analysis, we suggest avenues for future research in few-shot cross-lingual transfer for both dataset creation and model development. Our data and code are available online.https://buffetfs.github.io/

Background and Related Work

Due to the lack of annotated training data in many languages Blasi et al. (2022); Yu et al. (2022); Joshi et al. (2020), transferring models trained on resource-rich languages (e.g., English) to other languages has been actively studied in multilingual NLP. In this paper, our main focus is on few-shot cross-lingual transfer Lauscher et al. (2020), where a model is adapted using only a limited number of training or validation examples in the target language $L$ . Another popular paradigm is zero-shot cross-lingual transfer Artetxe et al. (2020a); Hu et al. (2020b) from English, where a model has access to training sets or instructions in English but not in the target language.

Various transfer methods have been investigated in the field, including the in-context learning methods (Section 2.3). Yet, limited research explores different transfer methods under comparable conditions. With our new benchmark, BUFFET, we facilitate fair comparisons between models and learning methods, establishing a basis for studying the dynamics of few-shot transfer across various languages (Section 2.2).

2 Benchmarks for Cross-lingual Transfer

To enable a scalable and rigorous evaluation across multiple tasks, prior work has proposed multi-task benchmarks that unify diverse existing datasets. XTREME Hu et al. (2020b), XTREME-R Ruder et al. (2021) and XGLUE Liang et al. (2020) focus on zero-shot transfer of models fine-tuned on English datasets. Despite English-based few-shot evaluation benchmarks, such as CrossFit Ye et al. (2021), in few-shot cross-lingual transfer, we lack a standardized evaluation benchmark to facilitate the comparison of models and learning methods at scale. BUFFET provides the first large-scale few-shot cross-lingual transfer suits to address the gap. Importantly, to mitigate the effects of the high-performance variance in few-shot cross-lingual transfer Zhao et al. (2021), we curate and aggregate results from multiple fixed $k$ -shot training instances for each task and language. Concurrent with our work, MEGA Ahuja et al. (2023) conducts experiments of few-shot cross-lingual transfer with a focus on classification and question answering tasks. BUFFET unifies diverse tasks including both discriminative and generative tasks. We also include datasets covering languages underrepresented in prior work (e.g., African and indigenous languages). Table 1 summarizes the key differences between BUFFET and prior benchmarks.

3 Methods for Cross-lingual Transfer

Multilingual pre-trained models Devlin et al. (2019); Xue et al. (2021); Conneau et al. (2020a) have the ability to adapt to new languages with no or few training instances in a target language Conneau et al. (2020b); Hu et al. (2020b); Wu and Dredze (2019). Lauscher et al. (2020) and Hedderich et al. (2020) report that particularly in languages that are distant from the source language, further fine-tuning model on few-shot samples greatly improves performance.

In-context learning Brown et al. (2020) aims at making an LM learn a new task by conditioning on a task description (instruction) and training examples (demonstrations). Despite active research on in-context learning Schick and Schütze (2021b); Min et al. (2022b), most prior work focuses only on English. Recent work Lin et al. (2021); Muennighoff et al. (2022) introduces pre-trained LMs trained on more multilingual pre-trained corpora or translated datasets and shows improved results. While prior evaluations often focus on classification or translation tasks Zhu et al. (2023); Vilar et al. (2022), more recently Shi et al. (2023), evaluate the use of instructions, demonstrations, and rationales in different languages across multiple reasoning tasks. However, how much LLMs with respect to in-context learning compete with the aforementioned fine-tuned approaches in a comparable setup and at scale has yet to be investigated, as they often use a large number of training examples in target languages Bang et al. (2023). We demonstrate even with a small number of training examples, fine-tuning methods are competitive with in-context learning for cross-lingual transfer.

Benchmark: BUFFET

We introduce a new standardized few-shot cross-lingual evaluation benchmark: BUFFET (Benchmark of Unified Format Few-shot Transfer Evaluation). BUFFET unifies diverse NLP tasks and provides fixed sets of few-shot samples per task to facilitate consistent comparisons (Table 2).

We create the BUFFET benchmark to establish a rigorous and equitable evaluation framework for few-shot cross-lingual transfer across a broad range of tasks and languages. We adhere to the following design principles with our benchmark.

BUFFET provides three different training and validation sets of $k$ -shots (e.g., $k=32$ ) per task for a non-classification task, or per class for a classification task, for each language.

Existing cross-lingual benchmarks often focus on classification or retrieval Hu et al. (2020b); Ruder et al. (2021); Liang et al. (2020). BUFFET encompasses a broad range of task types, such as classification, generation, extraction, and structured prediction tasks. By converting all tasks into the same text-to-text format, we eliminate the need for task-specific model modifications or template conversions.

BUFFET covers 54 typologically diverse languages, spanning 24 language families, including under-represented languages (e.g., indigenous languages of the Americas, African languages). The 36 out of 54 languages are not Indo-European languages. A full list of languages is available in Appendix Table 5.

Prior few- or zero-shot evaluations were often conducted on widely-used datasets translated from English (e.g., XNLI; Conneau et al. 2018, XCOPA; Ponti et al. 2020). Those datasets might exhibit undesired biases, such as translation artifacts or unnatural topic distributions Clark et al. (2020); Artetxe et al. (2020b). We collect both translation-based datasets and datasets that are annotated directly in each language (Table 2, Data curation).

2 BUFFET Construction Process

Following Ye et al. (2021), we unify all tasks into the same text-to-text format, where a model is expected to directly generate the desired outputs given diverse inputs Raffel et al. (2020). For each dataset in BUFFET, we unify instance representations of instruction, $k$ -shot instances for training and validation. Each training instance consists of an input and output. Figure 2 shows an overview. Section 3.2.1 provides the outline of the unification, and Section 3.2.2 provides a task-specific process.

By default, we use all of the languages included in the original datasets. For automatically aligned datasets with many test languages, such as XLSUM or WikiANN, we filter out languages that are not included in any other BUFFET datasets following suggestions by (Yu et al., 2022). On XLSUM, we further reduce the number of languages to reduce the inference costs while maintaining language diversities. For each language in each dataset, we randomly sample $k$ -shot instances (or demonstrations) for training and validation sets using the same random seeds.We use 100, 13, and 21 as seed numbers, following Ye et al. (2021). Once we sample the instances, we fix the training and validation sets. With large-scale automatically aligned datasets, we randomly sample 1,000 test instances in XLSUM and WikiANN and 2,000 test instances for Amazon Review, to reduce inference time costs across many languages and multiple sets of demonstrations.

We use English instructions from SuperNaturalInstructions Wang et al. (2022b) and PromptSource Bach et al. (2022). Among multiple annotated instructions, we sample the first instruction for a similar task that suits our text-to-text scheme. For some tasks, we modify the original instruction to make labels consistent with the names used in BUFFET For example, an instruction for PAWS-X says the class names are “repeated/not repeated” while in BUFFET we use “duplicated/not_duplicated” as labels, so we change the labels in the original instruction. or to remove task-specific dependencies in the input data field. See Appendix Table 6 for the full list of instructions.

Despite rapid progress of instruction-tuning in English LLMs Wei et al. (2022); Sanh et al. (2022); Mishra et al. (2022); Wang et al. (2022b), cross-lingual setups still lag behind due to a lack of instructions in the target languages. Prior work often translates instructions for the target tasks Lin et al. (2021); Shi et al. (2023). We provide translated instructions for 15 datasets in 54 target languages, translated by NLLB Costa-jussà et al. (2022), and manually translate the instructions into five languages.Manual translations are performed by bilingual volunteers.

2.2 Tasks and Dataset Curation

We first select eight popular NLP tasks and, for each task, we identify available datasets using a careful survey of multilingual datasets by Yu et al. (2022). Appendix Table 6 shows examples.

Natural Language Inference (NLI) involves determining the logical relationship (i.e., entailment, contradiction, neutral) between two text fragments, i.e., a premise and a hypothesis. In addition to the widely used XNLI Conneau et al. (2018), we gather NLI datasets that are annotated in each language or designed to cover extremely under-represented languages: AmericasNLI Ebrahimi et al. (2022), ParsiNLU-Entailment Khashabi et al. (2021), KLUE-NLI Park et al. (2021), and OCNLI Hu et al. (2020a). We use 16 examples for each class.

Paraphrase detection is the task of identifying whether two sentences have/do not have the same meaning (duplicate or not duplicated). We adopt PAWS-X Yang et al. (2019) and include 16 shots for each class as few-shot training and validation data.

Binary sentiment analysis identifies whether a text (e.g., a product review from Amazon) expresses positive or negative sentiment towards a topic. We use the Multilingual Amazon Review dataset Keung et al. (2020) and IndicNLU-Sentiment Aggarwal et al. (2022). For the former, we discard the neutral class (the reviews with a score of 3) and assign reviews with scores of 4 and 5 to the positive class and reviews with scores of 1 and 2 to the negative class. For both datasets, we sample 16 demonstrations per class.

We use two commonsense reasoning datasets, XCOPA Ponti et al. (2020) and XWinograd Muennighoff et al. (2022). Given a sentence and two options, a model selects one of the option labels, (A) or (B), based on which is better suited to the given context. Due to the smaller scale of the datasets, we sample 16 and 8 training instances in total for XCOPA and XWinograd, respectively.

Question Answering (QA) is the task of answering a question given a paragraph, where the answer is a sub-span of the paragraph. We use TyDiQA-GoldP Clark et al. (2020), which we refer to as TyDiQA for simplicity. Due to the longer average input length, we limit the number of exemplars to 8.

Named Entity Recognition (NER) is a representative sequence labeling task, where a system detects and classifies named entities in an input sentence. We adopt WikiANN Pan et al. (2017) and MasakhaNER Adelani et al. (2021). Though WikiANN covers 216 languages, we exclude languages that are covered only by WikiANN or XLSUM due to the aforementioned issues. We convert the task into a text-to-text format, where given an input sentence, a model extracts all named entities with named entity tags:This is more challenging than the standard sequence labeling setup since the model must reproduce the entity spans and generate appropriate tags. For example, the output for “Obama served as the 44th president of the United States.” would be “Obama ¡person¿ United States ¡location¿.” , , and .Although MasakhaNER supports other named entity tags and distinguishes the beginning and middle/end of the named entities, we discard named entity categories beyond the three types and merge the beginning and middle/end entity tags to make the task formulation consistent with WikiANN. We use 32 instances overall for few-shot transfer.

We use the XLSum Hasan et al. (2021) dataset to benchmark models’ ability to generate a summary given a news article. Due to the context window limit, we use only 1 shot for training in this task.

Question generation generates a question according to a given input passage and a corresponding answer Xiao et al. (2021). We convert the TyDiQA-GoldP dataset into a question generation task, which we refer to TyDiQA-QG. Given the gold paragraph and an answer, the system generates the original question. We use 8 examples for few-shot training.

3 BUFFET Evaluation

Table 2 (Metric) lists task-specific metrics. To mitigate the variance from different few-shot samples, for each language included in each task, we take the average of a model’s performance given three different sets of $k$ -shot instances. Subsequently, each dataset score is calculated as a macro-average of the per-language score Clark et al. (2020). Finally, following Liang et al. (2020), we take two separate average scores: (a) Avg. class score of all classification and QA tasks, and (b) Avg. generation score of all generation tasks.

3.2 BUFFET-Light

Conducting a comprehensive evaluation covering a wide range of languages and tasks in BUFFET, while undoubtedly necessary, can be a time-consuming process. We introduce BUFFET-light, which contains a representative subset of languages and tasks for a rapid assessment even in resource-limited scenarios. We carefully select languages and datasets to ensure that we cover a diverse range of languages and output formats, assuming limited resources. See the overview of BUFFET-light in Appendix Section A.2.

Benchmarking LMs on BUFFET

In this study, we investigate various transfer methods with and without parameter updates. To assess the benefit of $k$ -shot training examples in the target language, we also conduct experiments on zero-shot transfer methods. We assume that the model can optionally use instructions in the target language or another language, or full training sets in a high-resource language like English. This assumption is reasonable given the abundance of labeled datasets in high-resource languages Yu et al. (2022); Joshi et al. (2020) and the cheaper costs of instruction annotations. Table 3 provides an overview of different approaches, categorized according to the optional inputs they use during training or inference.

We explore several transfer approaches that require parameter updates.

Target fine-tuning (Target FT) trains models on few-shot samples for each language.

English fine-tuning (English FT) trains models on a source language (i.e., English) only and uses no target language data.

English+Target fine-tuning (Eng.+Tgt. FT) first trains models on large-scale English datasets and then fine-tunes models on few-shot samples of target languages.

We explore several in-context learning methods.

English in-context learning (English ICL) uses English instructions and demonstrations in the target languages.

Target ICL (Target ICL) uses both instructions and demonstrations in the target language.

Zero-shot English In-context learning (Z-EICL) uses only English instructions without demonstrations (neither in English nor in the target language), as in zero-shot transfer.

Unlike in English, where abundant instructions and instance annotations are available, for many languages we often lack annotated instructions Wang et al. (2022b). We use machine-translated instructions in BUFFET as the main baseline.

2 Language Models

A key aspect of language models is their pretraining strategies. In addition to conventional pretraining using unlabeled corpora Devlin et al. (2019); Brown et al. (2020), instruction-tuning has been actively studied; this approach trains an LLM on a massive number of tasks with instructions Muennighoff et al. (2022); Ouyang et al. (2022); Wei et al. (2022). In this work, we evaluate six diverse models pretrained with different strategies (Table 3).

Due to the high costs of fine-tuning for every $k$ -shot setting, we experiment with an efficient yet competitive mT5-base with 580 million parameters Xue et al. (2021).

We experiment with BLOOM-7B (7 billion parameters; Scao et al., 2022) and mT5-xxl (13 billion parameters; Xue et al., 2021). We also experiment with their instruction-tuned variants: BLOOMZ-7B and mT0-xxl Muennighoff et al. (2022), as well as the current state-of-the-art ChatGPT (gpt-3.5-turbo; Ouyang et al. 2022). Note that these models are trained on some of the datasets included in BUFFET. We do not exclude such overlapping datasets, but we indicate such seen tasks with ∗ in the main result table.It is unclear which datasets ChatGPT is trained on.

3 Experiment Details

In all settings, we fine-tune models on few-shot samples for 300 epochs for Target FT and 200 epochs for Eng.+Tgt. FT. When fine-tuning LMs on large-scale English datasets (for both Eng.+Tgt. FT and English FT), we train for three epochs. We use representative English datasets following Hu et al. (2020b): SQuAD Rajpurkar et al. (2016) for QA, MNLI Williams et al. (2017) for NLI, PAWS Zhang et al. (2019) for paraphrase detection, XLSUM Hasan et al. (2021) for summarization, COPA Arun and Balakrishnan (2018) for XCOPA, Winograd for XWinograd, the Amazon Multilingual Review English set for sentiment analysis, and the TyDiQA-QG English set for question generation.

We prompt LLMs with instructions and $k$ -shot demonstrations available in BUFFET. Different models have different maximum context window sizes: mT0 only accepts up to 1024 tokens, while BLOOMZ and ChatGPT accept up to 2048 and 4096, respectively. We add training instances up to the maximum token length for each model and discard instances that do not fit the context window. We found that mT0 often performs well-given zero or smaller numbers of few-shot samples. We use 4-shots for mT0 English ICL and Target ICL by default. We use greedy decoding for predictions. For tasks with a fixed set of pre-specified answer candidates, we compute the probability of option tokens by iterating all options except for ChatGPT without access to token probabilities. Due to the high inference costs, we evaluate ChatGPT only on BUFFET-Light,

Results and Analysis

Table 4 shows aggregated results of fine-tuned and in-context learning-based LMs on BUFFET. We show full experiment results on each task in the Appendix. Below, we summarize the key findings.

While in-context learning has shown remarkable performance in English, our comparison shows that few-shot cross-lingual transfer via in-context learning remains challenging; English ICL using BLOOM, BLOOMZ (7 billion) and mT0 (13 billion) often under-perform mt5-base (580 million) fine-tuned on English datasets (English FT or Eng.+Tgt. FT). However, when abundant English task data is not available, mT5-based fine-tuning methods (Target FT, or Eng.+Tgt. FT on XCOPA and XWinograd) often perform poorly and are outperformed by English ICL or Target ICL baselines. This implies that when lacking task-specific training data, prompting LLMs can be more effective.

Table 10 demonstrates that the zero-shot performance of instruction-tuned models is significantly higher than the zero-shot performance of non-instruction-tuned models: On average, both mT0-xxl and BLOOMZ-7B Z-EICL, demonstrate significantly better performance compared to their non-instruction tuned counterparts, namely mT5-xxl and BLOOM-7B Z-EICL, with margins of 12.7 and 23.9 points in Avg. class, respectively. It is worth noting that while the performance improvements on seen tasks contribute to these gains (indicated by *), mT0-xxl Z-EICL exhibits substantial advancements on unfamiliar tasks. This further confirms the effectiveness of instruction-tuning in zero-shot transfer, as discussed in prior studies Muennighoff et al. (2022); Wei et al. (2022); Mishra et al. (2022).

However, our study also highlights a surprising performance deterioration when moving from zero-shot to few-shot settings for instruction-tuned models: across tasks, mT0 performs worse in few-shot settings than in zero-shot settings (English ICL v.s. Z EICL). BLOOMZ shows performance gains from few-shot demonstrations; BLOOMZ E ICL achieves 44.3, outperforming BLOOMZ Z EICL by 5 points in Avg. class score. Yet, it also exhibits large performance declines on the tasks that are used during their instruction-tuning (TyDiQA, PAWS-X). Our hypothesis is that such instruction-tuned models are optimized to execute a new task solely based on an instruction, with no prior demonstrations Muennighoff et al. (2022), and may struggle to learn in context from few-shot demonstrations. We conduct controlled experiments in Section 5.2 for further analysis.

Figure 3 illustrates the performance of models on NER (WikiANN and MasakhaNER), NLI (XNLI, AmericasNLI), and QA (TyDiQA) tasks across different languages. The languages are sorted based on the token availability in the mC4 corpus,We use the token count statistics available at https://github.com/allenai/allennlp/discussions/5265. For languages that are not included during pretraining, we sort the language names alphabetically. with high-resource languages positioned on the left side. Our results indicate that the zero- or few-shot transferability of the model is often constrained in understudied languages. In NER and NLI tasks, a noticeable decrease occurs in performance from high-resource to low-resource languages. It’s important to note that several languages included in MasakhaNER or Americas NLI are not part of the pretraining process. Models such as mT5 English FT or ChatGPT English ICL exhibit strong performance in high-resource languages. However, their performance significantly drops in less-represented languages. For instance in Aymara (aym), ChatGPT achieves slightly higher performance than a random baseline, outperformed by mT5 Eng.+Tgt. FT by 13%. mT5 Eng.+Tgt. FT also significantly outperforms mT5 English FT in lower-resource languages, as indicated by the performance gap between the orange and blue lines in Figure 3. Notably, mT5 Eng.+Tgt. FT outperforms mT5 English FT by 30% in Hausa on MasakhaNER. This indicates that fine-tuning with only $k$ instances in target languages can still greatly helps in less-represented languages.

We also observe performance drops in Finnish, Korean, and Russian for BLOOM and BLOOMZ in TyDiQA. Finnish, Korean, and Russian are excluded from BLOOM pretraining,https://huggingface.co/bigscience/bloom which we attribute to these performance drops. Conversely, mT5 fine-tuning-based methods consistently display strong performance across languages. Interestingly, in Bengali, which is often considered less represented, BLOOMZ achieves performance comparable to mT5 fine-tuned models. We also observe the same trends in BLOOMZ. These results suggest pretraining setup may strongly affect downstream task performance even after instruction tuning.

As discussed, though ChatGPT significantly outperforms other LLMs with in-context learning, its performance often lags behind fine-tuning-based methods in some discriminative tasks, particularly in less-represented languages. ChatGPT, however, significantly outperforms fine-tuned models on tasks that require target language generations (e.g., question generation, QA) with the exception of summarization (XLSUM). On XLSUM, we found that ChatGPT often generates semantically correct summarizations in English rather than in the input article language, resulting in low ROUGE-2 scores. We do not observe that phenomenon in other LLMs (e.g., BLOOMZ); we show some ChatGPT output examples in the Appendix Table 25. Though more prompt engineering can boost ChatGPT’s performance in summarization Huang et al. (2023), we use the same prompts throughout the evaluations for a fair comparison. We also observe that when instructions are given in the target language, ChatGPT often outputs a summary in the language, as shown in improved XLSUM performance in ChatGPT Target ICL.

2 Analysis

Figure 4 shows model performance across the three different $k$ -shots and reveals a significant performance disparity in many of the tasks and languages. We observe the significant variance in fine-tuning-based transfer across different $k$ -shot samples, confirming Zhao et al. (2021). Importantly, we show that in-context learning is even more sensitive to different demonstrations than few-shot fine-tuning. For instance, for Amazon Review, the standard deviation for BLOOM E-CIL and mT5 Eng.+Tgt. fine-tuning is 2.2 and 0.2, respectively. We also analyze whether a demonstration set $k$ that achieves the best performance with a model also leads to the optimal performance for another model. Specifically, we compare the best $k$ -shots for each task and language for BLOOM and BLOOMZ English ICL. We found that in 49.7% of the cases, their optimal $k$ -shot demonstrations differ. These results emphasize the difficulty of comparing model performance in the absence of standardized $k$ -shot samples. On the bright side, these results provide insights into potential approaches for identifying optimal demonstrations that can enhance few-shot ICL performance.

Figure 5 demonstrates the impact of increasing the number of few-shot samples for in-context learning and fine-tuning, on four tasks: TyDiQA, TyDiQA-QG, WikiANN, and Amazon Review. Full results on the four tasks in a subset of the languages are available in Appendix D.3. Specifically, we vary the number of few-shot demonstrations, including 1, 4, and 8 (for the tasks with more than 8 shots), and assess the performance of BLOOM English ICL, BLOOMZ English ICL, mT0 English ICL and mT5 Eng.+Tgt. FT.

Increasing the number of few-shot examples has a notable positive impact on fine-tuning (mT5 fine-tuning) across different tasks. Similarly, non-instruction-tuned BLOOM also benefits from the inclusion of few-shot samples on most of the tasks. However, for instruction-tuned models (mT0 and BLOOMZ), we observe a significant decline in performance when additional demonstrations are added, which aligns with the findings in Table 4. Specifically, on mT0, we observe that the zero-shot performance surpasses the few-shot performance on TyDiQA and Amazon Review. Surprisingly, even on previously unseen tasks such as TyDiQA-QG and WikiANN, the addition of more than four demonstrations leads to a significant decline in performance.

It is worth noting that mT0 and BLOOMZ were exclusively trained with instructions and did not utilize demonstrations during training Muennighoff et al. (2022). We hypothesize that this training approach may cause the models to overfit the zero-shot instruction-based in-context learning scenario, thereby hindering their ability to effectively learn in-context information through few-shot demonstrations. Wei et al. (2022) also find that while few-shot demonstrations mitigate high variance of the zero-shot inference with instructions only, the optimal zero-shot performance with the best template often outperforms the best few-shot performance.

Figure 6 shows BLOOM-560 million, 1 billion, and 7 billion performance on a subset of the tasks. The transfer method is English ICL. As the model scales, the overall performance on few-shot in-context learning significantly improves, as found in English Brown et al. (2020), indicating that models’ cross-lingual few-shot transfer performance via in-context learning may improve as the model size increases. These findings are consistent with the results reported by Lin et al. (2021) on a set of classification tasks.

Conclusion and Discussion

In this work, we introduce BUFFET, a few-shot cross-lingual transfer benchmark that encompasses a diverse range of discriminative and generative tasks across a variety of typologically distinct languages. Through our comprehensive evaluation, involving six different transfer methods and various LLMs, we offer valuable insights into the strengths and limitations of these transfer methods and LLMs. Our analysis reveals that while LLMs utilizing in-context learning excel in generation tasks, they are often surpassed by smaller fine-tuned models specifically trained for target tasks. Furthermore, our findings highlight significant performance variations dependent on different transfer setups (e.g., demonstrations).

Moving forward, our findings suggest the following exciting opportunities for future research in the field of few-shot learning transfer across diverse languages:

Although instruction tuning can be beneficial for both zero-shot transfer, certain models, such as mT0, may become overly specialized for zero-shot instruction-tuning scenarios, leading to lower average few-shot performance than the optimal zero-shot performance. Although these models demonstrate impressive zero-shot performance, even on tasks they haven’t encountered before (such as XCOPA), they face challenges when it comes to tasks that involve generating outputs in less commonly used formats (like structured predictions). We believe that developing multilingual instruction-following models capable of effectively utilizing both instructions and demonstrations is crucial. Recent studies demonstrate that incorporating both instructions and demonstrations during instruction-tuning on English data can enhance the model’s performance Chung et al. (2022), allowing it to learn within context Min et al. (2022a). This type of training may potentially mitigate the issue of overfitting to specific formats. Hence, it is necessary to explore various instruction-tuning setups to further improve few-shot in-context learning, with a focus on cross-lingual transfer.

Additionally, while high-quality human-translated instructions are effective, numerous instruction repositories are still dominated by English instructions. Therefore, community efforts to increase the availability of multilingual instructions may assist in the development of more generalizable multilingual large-language models.

Our research reveals that smaller task-specific fine-tuned models, with intermediate training in English, can still outperform ChatGPT on discriminative tasks that require strict output formats. Conversely, ChatGPT outperforms fine-tuned models on tasks that necessitate more open-ended generations, such as question generation. In recent studies, InstructGPT Ouyang et al. (2022) has exhibited the ability to generate high-quality generations in English, even outperforming humans on some tasks Goyal et al. (2022). This impressive capacity for flexible generations has prompted active investigations into generating training instances from such LLMs, which have predominantly focused on English Wang et al. (2022a); Honovich et al. (2022). Some preliminary attempts have been made to explore task-specific data generation in certain target tasks, such as question answering Agrawal et al. (2022). However, there remains limited exploration on how to generate diverse task instructions and outputs for a variety of typologically diverse languages. We believe that using LLMs to generate data offers a promising solution to obtaining more annotated data for under-represented languages.

The impact of various instructions and demonstrations has been extensively examined in the context of English in-context learning, highlighting critical concerns such as sensitivity to prompt order Lu et al. (2022) and/or motivating methods for identifying optimal demonstrations Su et al. (2022). This research has found that demonstrations or instructions that are optimal for one model may not necessarily result in the best performance for another model. We anticipate that our benchmark will inspire and assist in further research into the relationship between language and instruction/demonstration for cross-lingual in-context learning.

Many of the diverse world languages are often excluded in widely used cross-lingual evaluation benchmarks, where recent papers show strong cross-lingual transfer capabilities. However, through our comprehensive analysis, we have discovered that even the most advanced LLMs currently available still face difficulties when dealing with less-represented languages. The most competitive instruction-tuned models, ChatGPT or mT0, show significant performance declines when it comes to indigenous languages, reaching a level akin to a random baseline.

We advocate for conducting more studies on diverse local languages, including under-represented languages and their dialects, as emphasized in previous works such as Aji et al. (2022); Kakwani et al. (2020). We note that datasets in such languages are often translated from English Yu et al. (2022), which may introduce translation biases Artetxe et al. (2020b) and fail to capture the linguistic nuances and interests of native speakers Clark et al. (2020); Asai et al. (2021). To address these challenges, it is important that further work be done to develop cross-cultural Natural Language Processing Hershcovich et al. (2022).

Most recent research on multilingual in-context learning predominantly focuses on discriminative tasks Muennighoff et al. (2022); Ahuja et al. (2023) or translation tasks Lin et al. (2021). Further exploration can expand these evaluations to more diverse and complex tasks, such as MTOP Li et al. (2021) or MGMS8K Shi et al. (2023), or knowledge-intensive tasks Asai et al. (2021) as new multilingual benchmarks are developed.

Limitations

As the first step toward standardized evaluation for few-shot cross-lingual transfer, BUFFET focuses on popular discriminative tasks and some generative tasks. It does not include many datasets that require complex reasoning tasks, as noted above. Since our main focus is to benchmark different LLMs and learning methods in a comparable format, we do not explore sophisticated prompting methods, which can further boost performance. We anticipate that BUFFET will encourage the LLM community to explore new methods to further improve in-context learning beyond English. We use instructions translated by the NLLB Costa-jussà et al. (2022) for Target ICL; such machine-translated instructions are prone to errors, especially in less-represented languages, that can affect the final performance.

Ethics Statement

While there has been significant research on in-context learning with LLMs, most of the focus has been on the English language. This raises questions about the applicability of findings from English few-shot NLP to few-shot cross-lingual transfer scenarios. To address this gap, BUFFET aims to provide a comprehensive and less biased evaluation framework. However, it is important to note that our benchmark dataset currently covers only 57 out of the approximately 6,000 world languages. Moreover, we do not specifically focus on finer-grained language varieties and dialects that are commonly spoken by underrepresented populations. In light of these limitations, we encourage future research to explore the effectiveness and limitations of widely-used transfer methods in a more diverse range of languages. This will help us gain a deeper understanding of the generalizability of transfer learning techniques across different linguistic contexts.

Acknowledgements

This research was supported by NSF IIS-2044660, ONR N00014-18-1-2826, ONR MURI N00014- 18-1-2670, DARPA MCS program through NIWC Pacific (N66001-19-2- 4031), and Allen Distinguished Award. AA is supported by the IBM fellowship. We are grateful to Orevaoghene Ahia for her help with ChatGPT evaluations. We thank our volunteer translators, Joongwon Kim, Usharani Injeti, and Sven Dorkenwald, for their help with translating instructions into different languages. Finally, we extend our appreciation to Jonathan H. Clark, Orevaoghene Ahia, Sandy Kaplan, and UW NLP researchers for their feedback on this draft.

References

Appendix

Appendix A Benchmark Details

This section will provide further details of the BUFFET construction process.

We show the list of the 55 languages included in BUFFET in Table 5. BUFFET covers 25 different language families, and also shows geographical diversities.

Table 6 shows the input and output examples in BUFFET. We reformulate all o the tasks with diverse formats into the same text-to-text format.

The full list of the instructions written in English is available in Table 7.

Table 8 shows the full list of the datasets with language names included in BUFFET.

A.2 BUFFET-Light

The goal of building the BUFFET-Light subset is to enable quick multilingual evaluation without losing the language and task diversity in the original BUFFET. To this end, we filter BUFFET so that we evaluate between 3 and 7 languages per task, and each language is included in at most three tasks.In addition to the high-resource languages per task, we also include low-resource languages when available (i.e., for NLI) to not unfairly inflate BUFFET-Light scores. This design choice allows us to consider 31 diverse languages across all tasks in BUFFET while reducing the number of evaluation settings by 66%.

The full list of tasks and languages in BUFFET are in Table 9.

Appendix B More Experimental Details

For English FT, we limit the number of English training samples to 100,000 and fine-tune mt5-base Xue et al. (2021) for 3 epochs. For the English FT baseline, we transfer this model directly to new languages, while for Eng.+Tgt. FT, we initialize the model checkpoint with the trained model weight and further fine-tune a model on few-shot samples for 300 epochs.

We set the maximum token length to 15 except for XLSum and TyDiQA-QG. For XLSum, we set the maximum token length to 100, and for TyDiQA-QG, we set the maximum token length to 50. We use greedy decoding throughout the experiments. For BLOOM-based model evaluations, we use a single RTX-100 GPU with 24 GB GPU memory. We use int8bit quantization to avoid GPU out-of-memory errors. To evaluate mT5 and mT0, we use TPU v3-8.

Appendix C Detailed BUFFET Results

This section includes the full list of the experimental results. Overall results on the full BUFFET are available in Table 10, and Figure 4 summarize overall performance across the eight tasks, on the BUFFET-Light subset. The overall trends on BUFFET-light remain the same as the original BUFFET. This indicates BUFFET-Light is a reliable and more efficient alternative for holistic evaluations for few-shot cross-lingual transfer.

Below, we present the performance breakdown for each dataset.

Table 11 shows the full results on AmericasNLI. Table 12 shows the full results on XNLI. Table 13 presents the full results on the other three entailment datasets annotated in each language, KLUENLI, OCNLI, and ParsiNLUEntailment.

On XNLI, English FT (zero-shot transfer) shows strong performance and often outperforms Eng.+Tgt. FT (few-shot transfer). Among ICL baselines, mT0 ZICL shows the best macro performance on XNLI. However, on AmericasNLI, all methods struggle, while Eng.+Tgt. FT shows the best macro performance on Americas NLI. The performance gap between English FT and Eng.+Tgt. FT get significantly larger, with the largest gap in Aymara (5.5%). Despite its strong performance on XNLI, mT0 ZICL struggles in Americas NLI (33.7% on average).

While mT0 ZICL shows robust performance across XNLI languages, ChatGPT shows a large performance gap between higher-resource languages and low-resource languages (57% in Greek v.s. 33% Urdu).

C.2 Paraphrase Detection

The results on PAWS-X results are available in Table14. Eng. FT shows the best performance on this task among non-instruction-tuned models. We hypothesize that as the languages included in PAWS-X are all relatively well-represented languages and the task is relatively simple, Eng. FT, which is not trained in the target languages, can achieve high performance. mT0 ZICL shows quite high performance, likely because the model is trained on PAWS-X Muennighoff et al. (2022).

C.3 Sentiment Analysis

The experimental results on Amazon Review Multi and Indic Sentiment are available in Tables 15 and 16. On both datasets, all models yield high accuracy across languages, except for mT5 ZEICL.

C.4 Commonsense

The experimental results on XCOPA are available in Table 17. On XCOPA, ChatGPT and mT0 (Z EICL) yield high performance across languages. ChatGPT achieves particularly notable performance in Italian (91.2%). On the other hand, all of the fine-tuning-based methods struggle, as the small size of the source datasets in English. This result indicates that for a task that lacks a large-scale training dataset even in high-resource languages, LLMs using in-context learning may often result in higher performance. We observed that mT0 English FT faces difficulties when applied to XCOPA. This could be attributed to the limited size of the XCOPA English set, which might not provide enough data for a smaller mT5-base model to acquire comprehensive task knowledge.

The experimental results on XWinograd are available in Table 18. Similar to XCOPA, on XWinograd, fine-tuning-based methods often struggle, while in-context learning with competitive LLMs yields strong performance.

C.5 Question Answering

TyDiQA experimental results are available in Table 19. Both the fine-tuning and ICL methods exhibit commendable performance on this particular task. It is intriguing to note that both mT0 and BLOOMZ demonstrate relatively lower efficacy in Korean, Finnish, and Russian. This can be attributed to the fact that these languages were not included during the pretraining phase.

C.6 Named Entity Recognition

Table 20 contains the results for WikiANN. We specifically present the few-shot results since we discovered that zero-shot baselines consistently exhibit extremely poor performance, often close to zero, primarily because generating the answer in the precise output format proves to be challenging.

It’s important to acknowledge that the BUFFET-Light WikiANN subset comprises languages that are relatively high-resource, which could potentially lead to an overestimation of ChatGPT’s performance. When comparing the best fine-tuning method with ChatGPT in the BUFFET-light languages, they generally perform competitively, with the exception of Finnish.

Results on MasakhaNER are available at Table 21. In this benchmark, all ICL methods, including ChatGPT, encounter difficulties, whereas Target FT and Eng.+Tgt. FT consistently demonstrates strong performance across various languages. Notably, by incorporating an additional 32 training examples, Eng.+Tgt. FT achieves a significant 34% improvement in performance for Hausa. These remarkable enhancements underscore the effectiveness of fine-tuning a specialized model on a limited set of training samples in target languages.

C.7 Generation

The experimental results for TyDiQA-QG are available in Table 22. On this task, ChatGPT and mT0 English ICL show superior performance than smaller fine-tuned models, demonstrating their competitiveness in generating fluent text in target languages.

XLSum results are available in Table 23. Despite strong generation capability, ChatGPT English ICL performance remains low. We found that when instructed in English, ChatGPT often generates summaries in English, not in the article language. We haven’t observed such behaviors on other tasks or other LLMs. ChatGPT Target ICL shows large improvements from English ICL, which has not been observed in other tasks. When instructions in the target language are given, ChatGPT almost always generates a summary in the target language.

Among non-instruction-tuned models, Eng.+Tgt. FT yields the highest average performance. It should be noted that mT0 and BLOOMZ are trained on XLSum. Nevertheless, their performance in some languages remains low.

Appendix D More Analysis

Figure 8 shows performance across languages on the three tasks, NLI, NER, and QA, adding two more LLMs: BLOOMZ and mT0.

D.2 Variances of Different k𝑘k-shots

In Section 5, we show that different sets of demonstrations can cause significant performance differences. We provide the full visualization results across different tasks.

D.3 Variances of the Varying Number of k𝑘k

We provide the full experimental results with a different number of $k$ . We evaluate BLOOM English ICL, BLOOMZ English ICL and mT5- Eng.+Tgt. Fine-tuning and mT0 English ICL experimental results on Amazon Review, TyDiQA, TyDiQA-AG, WikiANN, and in Figures 9, 10, 11 and 12, respectively.

On Amazon Review, All models except for BLOOM (pretraining only) show competitive zero-shot performance. BLOOM English ICL benefits from few-shot demonstrations while mT0 English ICL exhibit performance deterioration as adding more demonstrations across languages.

Among English ICL baselines, mT0 shows strong performance up to four demonstrations, although their performance gets really low once more demonstrations are added. Similar deterioration happens in BLOOMZ. On the contrary, BLOOM performance improves as more shots are added. Despite using only 32 shots.

Unlike in Amazon Review or TyDiQA, BLOOMZ English ICL shows performance improvements with more demonstrations in Arabic and Bengali, reaching the highest QG performance in Bengali with four demonstrations. On the contrary, both BLOOM and BLOOMZ show poor QG performance in Korean and Russian, possibly due to the lack of those languages during pretraining.

On WikiANN, all of the models gain performance improvements by adding at least one demonstration, possibly due to the difficulty of learning the exact output format expected given the instruction only. As in other datasets, mT0 reaches its highest performance with four demonstrations. mT5 Eng.+Tgt. FT exhibits performance drops with one shot, possibly due to their overfit to the single example.

D.4 Variances of Different Instructions

We investigate the effectiveness of different English instructions on question generation tasks for TyDiQA in 0 and 4-shot setting using mT0 and BLOOM as base models in Table 24. We compare four relevant instructions and one irrelevant instruction (an instruction for Amazon Review). In a zero-shot setting, instruction does not make much difference for both instruction-tuned and non-instruction-tuned models, since irrelevant instructions are sometimes better than the relevant prompt.

In a four-shot setting, whether the instruction is relevant does not make a huge difference for BLOOM, and we observed that random seeds impact the performance more, yet the performances do suffer a sharp loss if we are using irrelevant instruction in the instruction-tuned model. We also discovered that different models might favor different instructions for different languages, for example, in Swahili, 4-shot BLOOM favors the first instruction, while mT0 favors the fourth instruction.

D.5 Qualitative Results for Generation Tasks

Table 25 shows some qualitative results of ChatGPT English ICL and Target TCL on XLSum and TyDiQA. Given English instruction, ChatGPT often generates summaries in English, rather than in the article language. On the other hand, such cross-lingual generation behaviors don’t occur in QA tasks, and the model’s predictions with Target ICL and English ICL exhibit high overlap with each other. We hypothesize that ChatGPT’s this cross-lingual summarization behavior can be related to their private training corpus, and future work can further investigate this issue.