Language Models are Few-shot Multilingual Learners

Genta Indra Winata, Andrea Madotto, Zhaojiang Lin, Rosanne Liu, Jason Yosinski, Pascale Fung

Introduction

The progress in language model (LM) pre-training Peters et al. (2018); Devlin et al. (2019); Radford et al. (2019); Yang et al. (2019); Liu et al. (2019a); Brown et al. (2020); Liu et al. (2020a); Lewis et al. (2020); Raffel et al. (2020); Gao et al. (2020a) has led to the possibility of conducting few-shot learning, that is, learning a new task using a small number of examples without any further training or gradient computation. Few-shot learning alleviates the cost for extensive labeled data, which is beneficial since collecting high-quality labeled data is resource-intensive and expensive. It also reduces the cost for model fine-tuning, which requires tremendous GPU or TPU resources. Few-shot learning can be seen as a one-for-all plug-and-play computational model that can be applied to various natural language tasks, from sentiment analysis for text classification to story generation, provided only a small context Brown et al. (2020).

The idea of few-shot learning is also relevant to address the low-resource issue in non-English languages. Few-shot learning has been applied to NLP tasks Brown et al. (2020); Madotto et al. (2020b); Lu et al. (2021); Perez et al. (2021); Liu et al. (2021a, b); Cahyawijaya et al. (2021a). Common approaches to solve the low-resource issue are to pre-train models with self-supervised learning using unlabelled monolingual text data collected from various resources available online Wilie et al. (2020); Le et al. (2020); Martin et al. (2020); Eddine et al. (2020); Nguyen and Nguyen (2020); Scheible et al. (2020); Bhattacharjee et al. (2021); Lee et al. (2020); Cahyawijaya et al. (2021b); Park et al. (2021) and then apply pre-training on the source language and fine-tune on the target languages Schuster et al. (2019); Lin et al. (2019); Winata et al. (2019, 2021); Pfeiffer et al. (2020); Zheng et al. (2021); Lin et al. (2021b). Conversely, the few-shot learning does not need any training from the source and target languages. Figure 1 shows how it is possible to utilize pre-trained models on non-English languages, such as Spanish, as the performance is not random, and the performance increases as the models are given more samples. We conjecture that pre-trained models may be able to adapt to languages that are similar to English. However, for many language tasks, it is difficult to collect a large supervised training dataset as language experts (e.g., linguists or native speakers) are required to annotate the data.

Another line of work is to apply cross-lingual transfer on English with the same task as the target languages Ponti et al. (2018); Artetxe and Schwenk (2019); Liu et al. (2019b); Lauscher et al. (2020); Liu et al. (2020b, 2021c); Chen et al. (2021). However, such methods still need to apply a fine-tuning step to update the model for fast adaptation, which can be challenging for large pre-trained models – some models require substantial memory capacity – since the models have to be trained on high-performing machines. Different from the aforementioned method, in-context learning using a LM does not allow any parameter updates. Thus, the process does not need to compute and store the gradients for backward propagation.

In this work, we investigate the practicality of applying few-shot learning in the multilingual setting for four languages, English, French, German, and Spanish, on natural language understanding intent prediction tasks using publicly available LMs that are mainly trained on English data. We show that, given a few English examples as context, pre-trained LMs can predict not only English test samples, but also non-English ones (Figure 2). To the best of our knowledge, no existing works have studied these tasks in multilingual settings. We conjecture that the English LMs can still produce good results on languages that are closely related to English. We construct the inference for the multi-class prediction setup by extending the idea from Madotto et al. (2020b) of applying multiple binary predictions on each class. Instead of guiding the model to generate true or false like in their work, which is not consistent and sometimes generates other words –, we introduce maximum confidence prediction. This method considers the confidence of predicting a certain label to provide a prediction. We design this as a multiple-choice task in which the confidence of the prediction for all possible classes is compared. Each class’s confidence score is computed by normalizing the logits of generating the next boolean token given the prompt as the context. This method is considered to be more scalable than the simple $k$ -way few-shot learning, where we need to put all data in a single prompt, since we only have a fixed maximum sequence length and, in the deployment, each forward step can be run in parallel to speed up the process. To increase the difficulty of the challenge, we also propose a cross-lingual task, where the context and query are in different languages.

Overall, we find that conditional generative LMs, such as the GPT-2 Radford et al. (2019), GPT ${}_{\text{NEO}}$ models Gao et al. (2020a), and T5 models Raffel et al. (2020) have the capability to predict non-English languages, and adding more shots and using larger models achieves a substantial increment in performance, making it significantly better than random, which indicates the models are able to understand the prompt. We only focus on GPT and T5 models. T5 models do not perform as well as GPT models, which might be caused by the pre-training strategy. Experimental results in the cross-lingual setting demonstrate that pre-trained LMs make correct predictions. To summarize, our contributions are as follows:

We study few-shot learning in the multilingual setting on four languages without any gradient updates. We use the publicly available GPT and T5 LMs, and compare the results to those from the zero-shot and fine-tuning approaches.

We propose a simple and straightforward approach to perform few-shot learning on multi-class classification by applying binary prediction and considering the confidence of predicting the boolean tokens.

We display the zero-shot, one-shot, and many-shot proficiency of the LMs in the cross-lingual setting when the language of the prompt is different from the target language.

Few-shot Multilingual Learners

First, we briefly define the notation of the input and output of the task, and then we introduce our method to design prompts for few-shot in-context learning. The code is released at https://github.com/gentaiscool/few-shot-lm.

Let us define $D$ as the distribution over the dataset and $P$ as the prompt that we use as the input of the LM $\theta$ . The prompt $P=[D_{pos},D_{neg},Q]$ is a concatenation of few-shot samples: positive samples $D_{pos}$ , negative samples $D_{neg}$ , and the query $Q$ , where $D_{pos}$ , $D_{neg}$ $\sim$ $D$ . $D_{pos}$ is a sample with a label that is the same as the query, and $D_{neg}$ is a sample that is taken from the dataset $D$ with a label other than the query. $\theta$ takes $P$ as the input of the model, and the LM generates a word $y$ . We define the task $T_{s\rightarrow t}$ , where $s$ is the source language and $t$ is the target language.

In this paper, we focus on the intent detection task in the monolingual and cross-lingual settings. In the monolingual setting, the source language is the same as the target language, and in the cross-lingual setting, we take the source language as different from the target language ( $s\neq t$ ). We design our task as a multiple-choice problem, in which each sample has a label $l\in L$ , where $L$ is the set of possible labels. We predict the boolean (true or false) for each sample and take the highest prediction confidence.

2 Prompt Generation

We define the task by designing prompts to perform few-shot learning. We design our task as a binary classification for multi-class prediction by following Madotto et al. (2020b). The idea is to guide the model to predict the boolean tokens, true and false. We examine the usage of two types of LMs, GPT and T5 models, and we construct prompts specific to each model. We use a specific way to probe the LMs to perform the few-shot prediction since they are trained with different learning objectives. Table 1 shows the format of the prefix we use for the GPT and T5 models.

$X_{i}$ is one of the few-shot samples, and $X_{i}^{*}$ is the sample from other classes. For the GPT models, we only input the prefix by concatenating positive and negative samples with the query. Specifically for the T5 models, we add an additional token after the query and let the model predict that particular token during the generation step.

Figure 2 shows an example of how we generate the prompt in $k$ -shot settings. We create $L$ prompts and apply $L$ forward steps for each sample. For each prompt, $k$ positive and negative samples are randomly drawn from the dataset. It is worthwhile to note that the sampling method is similar to $k$ -way few-shot learning, but the samples are not merged into a single prompt. We do this because we want to give more shots as the prompt to the LMs as they have a limitation on the number of tokens they can accept as input (1,024 tokens in GPT-2 ${}_{\text{XL}}$ and 2,048 tokens in GPT ${}_{\text{NEO}}$ ). We add a special token \n as a separator between each sample, as shown in Table 1.

3 Maximum Confidence Prediction

To get the final prediction of each sample, first, we compute the score of predicting the next boolean (true or false) given the prompt $X_{i}$ for label $i$ : $P_{\theta}(y=\texttt{true}|X_{i})$ and $P_{\theta}(y=\texttt{false}|X_{i})$ from the prediction distribution. Then, we normalize the score to get the probability of generating the true token to measure how much confidence the LM has to predict label $i$ . We collect all the confidence scores over all label options and choose the highest confidence score among them, as follows:

where $b\in\{\texttt{true},\texttt{false}\}$ . We take the label with the highest confidence score as $\text{MC}(X,L)$ .

4 Choices of Samples

For in-context learning, choosing the order of samples is essential Lu et al. (2021). Here, we examine the impact of the order of the samples. We construct the probing set in two ways: (1) shuffle the few-shot samples and measure the variance in performance after changing their order, and (2) arrange the positive samples before the negative samples. We find that the latter works well, specifically on the T5 models.

Baselines

In this work, we compare the few-shot learning performance with other common approaches: zero-shot, zero-shot cross-task, and fine-tuning.

One way to solve zero-shot prediction is by using entailment models to calculate the entailment score between sequences and labels. Given a pre-trained LM $\psi$ with an entailment head, a set of hypotheses $H$ , and possible labels $L$ , the model accepts two inputs, the hypothesis $h\in H$ and label $l\in L$ , and generates the entailment score given any combinations of the hypothesis and label $P_{\psi}(y=\texttt{entail}|h,l)$ :

2 Zero-shot In-Context Learning

This approach is very similar to our few-shot approach. It does not need any samples, and the model is only given natural language instruction. However, instead of using the prompt like in the few-shot setting, we can set up the prompt in a question-and-answer (Q&A) format as follows:

3 Fine-tuning

Fine-tuning is the most common approach to updating a pre-trained model’s weights when training with a labeled dataset. The advantage of this approach is strong performance since we give supervised signals with the correct labels to the model. For fine-tuning, we use the same sets of few-shot samples as in the in-context learning. In Section 4.2, we provide the hyper-parameters used in the experiments.

Experiments

We use an English natural language understanding (NLU) dataset, SNIPS Coucke et al. (2018), and two multilingual NLU datasets, MTOP Li et al. (2021) and Multilingual NLU (MultiNLU) Schuster et al. (2019). MTOP includes four languages, English (en), French (fr), German (de), and Spanish (es), and Multilingual NLU includes two languages, English (en) and Spanish (es). We measure the model performance by calculating the average and standard deviation of the accuracy with three runs.

2 Experiment Settings

We set up the experiment in two settings: monolingual and cross-lingual. In the monolingual setting, we test the ability of the model to conduct few-shot in-context learning on four languages: English (en), French (fr), German (de), and Spanish (es). In the cross-lingual setting, we test its ability to predict a query from a non-English language with the English context (en $\rightarrow$ XX). In the few-shot in-context learning, we use $k$ -way-few-shot classification, taking $k$ samples. For each model, we take $k\in[0,5,K]$ , where $K\leq 40$ is the largest number of few-shot samples that can be passed to the model as input and is divisible by 10 without exceeding the maximum input token limit. We utilize an NVIDIA Tesla V100 16GB GPU to run the inference so that the model is ensured to fit in a single GPU, and we use 16-bit precision.

We run experiments on a variety of publicly available models:The models except GPT ${}_{\text{NEO-J}}$ are taken from https://huggingface.co/. The GPT ${}_{\text{NEO-J}}$ model is taken from https://github.com/kingoflolz/mesh-transformer-jax/ four sizes of GPT-2 models (0.1B, 0.3B, 0.8B and 1.6B), three sizes of GPT ${}_{\text{NEO}}$ models (1.3B, 2.7B, and 6B), and two sizes of T5 models (0.8B and 3B). Table 3 shows the details of each pre-trained model.

Baselines

We use the same sets of few-shot samples for the baselines. We run fine-tuning on the pre-trained models mBERT Devlin et al. (2019) and XLM-R Conneau et al. (2020), and also compare our models with the zero-shot cross-task models using pre-trained models XLM-R, fine-tuned on XNLI Conneau et al. (2018), and BART, fine-tuned on MNLI Williams et al. (2018);The XLM-R model fine-tuned with XNLI data can be accessed at https://huggingface.co/joeddav/xlm-roberta-large-xnli. The BART model fine-tuned with MNLI data can be accessed at https://huggingface.co/facebook/bart-large-mnli a random baseline; and state-of-the-art results reported on each dataset. For the finetuning, we use a learning rate of 5e-5 with a decay of 0.9 for every epoch, and a batch size of 32. We apply an early stopping after 5 epochs without any improvement on the validation set.

Results and Analysis

Tables 2 and 4 show the results in the monolingual and cross-lingual settings, respectively. The tables show that the performance improvement is highly related to the size of the pre-trained model, and the performance gap between the fully trained state-of-the-art model and the few-shot learning models is decreasing when we use larger models, indicating the usefulness of utilizing models of bigger sizes. The performance of the models with few-shot learning is considered promising as they are not trained at all and the best model’s performance gap with the fine-tuned model is less than 10%.

Comparing the performance of generative models to fine-tuning, it is clear that we can achieve higher accuracy without any training. However, in this experiment, we acknowledge GPT and T5 models we use for in-context learning are larger than the models we fine-tune, and few-shot learning is much more efficient since the models are not required to store the intermediate memory. In terms of inference speed, the few-shot models require more time to run an inference step, which may cause a bottleneck when the number of few-shot samples is relatively large. This is the limitation of this method, and reducing the inference time is an open research area to improve the efficiency of in-context learning.

Zero-shot cross-task baselines.

Surprisingly, the zero-shot cross-task models are able to predict the samples much better than the random baseline, particularly on English tasks. Overall, the XLM-R ${}_{\text{LARGE}}$ model performs better than the BART ${}_{\text{LARGE}}$ models in all tasks except SNIPS.

GPT vs. T5 models.

In general, the GPT models outperform the T5 models in all language pairs and datasets in a head-to-head comparison: Both GPT-2 ${}_{\text{LARGE}}$ and T5 ${}_{\text{LARGE}}$ have a similar number of parameters (0.8B), but they have a significant performance difference. A similar pattern can also be observed on larger models, such as GPT ${}_{\text{NEO}}$ 2.7B and T5 ${}_{\text{3B}}$ 3B. Although the T5 models perform worse than the GPT models, they do not have a maximum token size for the input, as the GPT models do, which is one of the advantages of using them. On the other hand, we find that changing the sample order tremendously affects the performance of the T5 models. As shown in Tables 2 and 4, the performance increases substantially when we sort the few-shot samples based on their label (i.e., first all positive and then all negative examples). Conversely, the GPT models suffer loss in performance. Thus, we can make the conclusion that changing the sample order may produce high variance in the results, as also shown in Lu et al. (2021).

Effectiveness on non-English languages.

Based on the results, the performance of the models is lower in the non-English languages than in English. These results are expected since the pre-trained models are mostly trained on English data. However, the differences in performance are marginal. This finding may indicate that our few-shot learning method can be effectively utilized for languages that are in the same language family as English, such as French, German, and Spanish, but this will require further investigation in the future.

Cross-lingual results.

Based on the results in Table 4, we can see that the generative models are able to use the context from English to predict the sample in non-English languages. The cross-lingual setting is considered harder than the monolingual one since the models need to contextualize and understand the source and target languages to predict the test samples correctly. In general, the trend of the results in the cross-lingual setting is similar to the monolingual setting. In the MTOP dataset, we find that the models generally achieve higher performance for en $\rightarrow$ es than for the other two target languages (de and fr). In MultiNLU, our GPT ${}_{\text{NEO-J}}$ closes the gap with the existing state-of-the-art baseline with fine-tuning from Liu et al. (2020b) underperforming it only by a close margin of around 4.2%, and the GPT ${}_{\text{NEO-J}}$ performance is only less than 3% worse than that of the Translate-Train model. These results show a promising new direction in the zero-shot cross-lingual research that can be applied to other datasets and language pairs.

2 Ablation Study

To further understand how much data we need for the in-context learning, we conduct experiments with different numbers of few-shot samples, including zero-shot experiments on the MTOP and MultiNLU datasets.

Figures 4, 4, 6, and 6 illustrate the results with different numbers of samples on the MTOP dataset in the monolingual setting. We show a different set of k-shot results for each model according to the maximum samples that can be used in the model as input. The results consistently improved as the number of shots increases. Interestingly, the QA style’s zero-shot strategy can outperform random prediction only on two or three models in each language, and the others are worse. The fine-tuning results on MTOP are thus far worse than those of few-shot learning.

MultiNLU dataset.

Figures 8 and 8 illustrate the results with different numbers of samples on the MultiNLU dataset in the monolingual setting. The results on MultiNLU for the models with fine-tuning are closer to those of few-shot learning than those on the MTOP dataset. The reason may be the number of labels that the MTOP dataset has compared to MultiNLU. As a result, the zero-shot performance on the GPT models is sometimes worse than that of the random baseline.

Related Work

Recent work on few-shot in-context learning uses LMs to solve NLP tasks Petroni et al. (2019); Brown et al. (2020); Gao et al. (2020b); Madotto et al. (2020b); Zhao et al. (2021); Schick and Schütze (2021); Lin et al. (2021a). In this approach, we select the appropriate prompts to trigger the LMs to behave so that they can predict the desired output Liu et al. (2021b). However, the prompts have to be engineered to allow the LM to generate a text appropriate to solve the task. Learning to calibrate the few-shot results is also essential to reduce the model’s performance variance Zhao et al. (2021), and the selection criteria in choosing the prompts are also important Perez et al. (2021). In another stream of work, Shin et al. (2020); Li and Liang (2021) proposed an automated method to create prompts for a diverse set of tasks by gradient-based tuning instead of manually searching for a good prompt. Using such a method, may allow us to find an optimal prompt easier, it is very difficult to discover the optimal prompts for complicated natural language processing tasks, such as semantic parsing Liu et al. (2021b).

2 Pre-trained Language Models

Recent advances in pre-trained LMs have been focused on building pre-trained encoders, such as BERT Devlin et al. (2019), RoBERTa Liu et al. (2019a), ELMO Peters et al. (2018), ULMFiT Howard and Ruder (2018), ELECTRA Clark et al. (2019), XLM Conneau and Lample (2019), and XLM-R Conneau et al. (2020); Goyal et al. (2021), decoder-only models, such as GPT models Radford et al. (2019); Brown et al. (2020) and encoder-decoder models, such as T5 Raffel et al. (2020), BART Lewis et al. (2020), and their multilingual versions, mT5 Xue et al. (2021) and mBART Liu et al. (2020a).

Pre-trained encoders have been used to improve the contextualized representations of multilingual systems in various NLP tasks, for example, dialogue systems Liu et al. (2020b, 2021d); Li et al. (2021), code-switching sequence labeling Aguilar et al. (2020); Winata et al. (2021); Winata (2021), and multilingual speech recognition Datta et al. (2020); Winata et al. (2020). Meanwhile, the pre-trained encoder-decoder models, have been used for various sequence generation tasks, such as summarization Raffel et al. (2020), conversational agents Lin et al. (2020b, a); Madotto et al. (2020a); Wu and Xiong (2020); Hosseini-Asl et al. (2020); Lin et al. (2021b), and knowledge grounding Chen et al. (2020); Zhao et al. (2020).

Conclusion

This paper demonstrates the multilingual skills of pre-trained LMs, GPT and T5, in conducting in-context learning without parameter updates. This work is our initial attempt to show the effectiveness of in-context learning in the multilingual and cross-lingual setting. It covers four different languages and explores the possibility of conducting efficient inference on low-resource tasks. We find that LMs can predict samples correctly, significantly better than random prediction, in cross-lingual tasks with no training examples of the target languages. We would like to further investigate the applicability of this method to other tasks and languages in future work.

Acknowledgment

We want to thank Bryan Wilie and Samuel Cahyawijaya for their support in accessing the cloud service. We also sincerely thank Zihan Liu and ML Collective members for helping with the discussion about this project.

References

Appendix A Full k-shot Results

This appendix shows the results on few-shot monolingual and cross-lingual settings on SNIPS, MTOP, and multilingual NLU datasets over a different number of samples.