Multilingual LLMs are Better Cross-lingual In-context Learners with Alignment

Eshaan Tanwar, Subhabrata Dutta, Manish Borthakur, Tanmoy Chakraborty

Introduction

The emergence of large-scale, pretrained, Transformer-based language models (LLMs) has marked the commencement of an avant-garde era in NLP. Departing from the traditional methods of neural language learning with temporally separated training-testing phases for downstream tasks,

pretrained LLMs have shown the ability to infer labels from test inputs conditioned on the training data within a single pass. This is known as In-context learning – an LLM is prompted with a few input-output pairs from the training data (commonly referred to as demonstrations) followed by the test input; for generative tasks (summarization, text-to-code, chain-of-thought reasoning, etc.) the LLM is then required to produce an output; for classification tasks, the probabilities of the next tokens predicted by the LLM are mapped to the label space. All of this is done without updating the parameters of the LLM. In-context learning is particularly promising for two different aspects. Firstly, it reduces the need for task-specific training data, and thus, the cost of human annotation. Secondly, while the LLM was trained in a compute-intensive environment, the removal of the need for task-specific gradient-based weight updates can significantly reduce the carbon footprint of automated NLP/NLU since the inference-time compute-necessity is orders of magnitude smaller than that of the training/finetuning phases. Multiple recent advancements have been proposed to optimize the ICL ability of the LLMs Lin et al. (2021); Chowdhery et al. (2022); Liu et al. (2022); Zhang et al. (2021).

Challenges in cross-lingual ICL: Given that there is an order-of-magnitude discrepancy in the availability of annotated data in a high-resource language vs. a low-resource one, the ability to learn from the high-resource source context to solve tasks in low-resource targets sounds enticing. Yet, the application of ICL in a cross-lingual setting remains largely unexplored. Previous attempts at multilingual ICL (Zhang et al., 2021; Winata et al., 2021) use randomly selected input-label pairs to construct the prompt-context. This limits the ability of an LLM to infer from the context. As Xie et al. (2022) suggested, ICL emerges as the ability to infer target labels from the pretraining distribution conditioned upon the context; each input-label pair in the prompt-context are, in turn, sampled from the prompt token distribution. Theoretically, the expected prediction error decreases as the number of examples in the prompt increases. However, such infinitely long prompts are practically infeasible to attain. Xie et al. (2022) imposed that a distinguishability of the prompt-concept, shared across the prompt-examples, from all other possible concepts is essential for an optimal predictor. A random sampling of prompt examples is unlikely to construct a prompt with distinguishable concepts. Furthermore, given $(x_{i},y_{i})$ and $(x_{i+1},y_{i+1})$ as two consecutive input-label pairs in the prompt-context, the transition probability from $y_{i}$ to $x_{i+1}$ is a low-probability one under the pretraining distribution Xie et al. (2022). The transition becomes even more improbable if we are to simply append a test example to the prompt-context of a different language. Consider the following example of ICL prompting for cross-lingual sentiment classification:

The text segments are concatenated from left-to-right and top-to-bottom; therefore, two English input-label pairs are followed by a Spanish test input. There are irremovable, token-level low-probability transitions from the labels to the next input sentences. On top of this, we have three completely unrelated sentences juxtaposed together with an abrupt change in language. Intuitively, it is less likely for an LLM to be able to map the third input to its correct label, positiva (positive in Spanish) following the very much convoluted patterns presented in English.

Proposed approach: We seek to develop prompt-design strategies for ICL in a cross-lingual setting that can overcome the foregoing challenges. A two-way alignment of the source and target examples is proposed. We start with injecting semantic coherence into the prompt-context by selecting similar examples; this aligns the labeled demonstrations as well as the test inputs to share a set of common concepts. Next, we seek to enforce an alignment of task-level signals across languages. We introduce manually-designed task-specific mappings from the source language to the target language, thereby providing the LLM with a ‘natural’ transition from the former to the latter. Together, these two approaches constitute our proposed prompts-selection strategy, X-InSTA (Cross-lingual In-context Source-Target Alignment, see Figure 1 for working examples). X-InSTA shows a staggering $\bf 18\%$ relative improvement over random prompt selection averaged across three different text classification tasks in multiple different languages with English being the source language. Careful perturbations to these alignment methods disclose the importance of label space structure induced by LLMs for cross-lingual ICL.

Our contributions are summarized belowCode available at https://github.com/EshaanT/X-InSTA:

We propose X-InSTA, a novel method of aligning prompt examples in a cross-lingual scenario. To the best of our knowledge, this is the first attempt to push prompt design techniques for ICL in cross-lingual settings beyond the trivial strategy of random example selection.

We present the first, in-depth analysis of the role of semantic similarity between prompt examples for cross-lingual ICL.

A novel concept of task-based prompt alignment is presented. We show its efficacy with 44 different source-target language pairs and empirically relate this to the underlying structures of multilingual representations of the LLM.

Prompting Techniques

In this section, we lay out a step-by-step approach to aligning semantic coherence and task-based signals across source-target examples for ICL prompts.

Let $D_{s}=\{(x^{i}_{s},y_{s}^{i})\}_{i}$ be a monolingual labeled dataset in language $s$ , realized as a collection of input examples and their labels, $x^{i}_{s}\in X_{s}$ and $y_{s}^{i}\in Y_{s}$ , respectively. Here $Y_{s}$ is the natural language label space in language $s$ . We have another collection of input examples, $D_{t}=\{x^{i}_{t}\}_{i}$ , with examples in language $t$ . One can define a cross-lingual text classification task with source and target languages being $s$ and $t$ in the following manner. First, we select $k$ input-label pairs from $D_{s}$ to construct the prompt-context, $C$ :

where $[sep]$ denotes a separator token (e.g., newlines), and $\oplus$ denotes the concatenation operator. The problem of in-context prediction then translates to inferring the label $y_{t}\in Y_{t}$ , where $Y_{t}$ is the natural language label space in language $t$ corresponding to the test input $x_{t}\in D_{t}$ conditioned on the prompt-context $C$ , as follows:

i.e., we select the maximum probability label in the target label space generated by the model as the token next to the test input $x_{t}$ appended to the context $C$ . The source and target label spaces, $Y_{s}$ and $Y_{t}$ , share a one-to-one mapping among each other in terms of translation from $s$ to $t$ .

One of the most widely-used methods of constructing the context $C$ , which we will henceforth call random prompting, is to randomly select $(x^{i}_{s},y^{i}_{s})$ from $D_{s}$ and concatenate together. We explore this method in our analysis, and it serves as a baseline for our experiments.

2 Semantic Alignment

Chang et al. (2022) showed that multilingual models encode these languages in a shared embedding space, while still preserving several language-sensitive semantic information. Despite the language difference between source and target inputs, $x_{s}$ and $x_{t}$ , it is then likely that their semantic similarities will be reflected in their hidden representations constructed by LLM. Therefore, we hypothesize that choosing semantically similar examples to construct the prompt-context would help the model do in-context inference. That is, if ${\bf e}_{t}$ is the embedding of the target and ${\bf e}_{s}$ that of the source, the higher the similarity score between them, the better sentence $x_{s}$ will serve as a demonstration for the target sentence $x_{t}$ .

Inspired by Liu et al. (2022), we extract prompt examples directly dependent on the test input distribution. Here we utilize multilingual sentence-transformers Reimers and Gurevych (2020) to extract the sentence embedding of the test input $x_{t}\in D_{t}$ and the source inputs $X_{s}$ . Based on the cosine similarity between the target input $x_{t}^{j}$ and source inputs $x^{j}_{s}\in X_{s}$ , we then extract the top $k$ demonstrations (see Algorithm 1). While the target input and the demonstration differ in language, we hypothesize that by pairing semantically similar context demonstration and input sentence, the LLM would be able to improve its reasoning ability and subsequently, the final task performance (see Table 11 in Appendix D for examples of such aligned demonstrations).

3 Task-based Alignment

Despite the semantic coherence enforced within the prompt-context via the previously mentioned method, the source and target label spaces, $Y_{s}$ and $Y_{t}$ , remain superficially disconnected. For fine-tuning, techniques like meta-learning Nooralahzadeh et al. (2020), and adapters Parović et al. (2022) have been used to bridge this gap. For in-context prompting in which context matters the most, we propose to do so by adding a manually designed statement that gives the LLM task-specific information like target language and target label space.

Task-based alignment is done by appending a manually-designed statement, called task aligner to context. This aligner is supposed to inform the LLM about the mapping from the source label space $Y_{s}$ to the target label space $Y_{t}$ . We do task alignment by first manually creating $D_{l}=\{L_{s,t}\}$ for a given task and source-target language pairs $s$ and $t$ as a collection of statements in the source language that emphasizes what the target label and language are. For example, when the source is English and the target is Spanish, “In Española bad means malo and good means bueno” will be the said task aligner that gives the information that the target language is Española (Spanish) and the target labels are malo and bueno (bad and good, respectively). Next, we construct the prompt-context by randomly selecting $k$ source language examples, followed by the task aligner from this source-target pair from $D_{l}$ (see Algorithm 2). For more examples of task-aligned prompt design, please refer to Tables 11 and 12 in Appendix D.

4 X-InSTA

We finally move on to our proposed method X-InSTA that combines semantic alignment with the task-based one. It first selects source examples from $D_{s}$ with top- $k$ similarity scores as mentioned in Section 2.2. Additionally, we select task-aligners from $D_{l}$ depending on the source and target languages and the task. Finally, we construct the prompt context by concatenating the selected examples followed by the task-aligner. The final label inference can be described as

where $\operatorname{sim}(x^{i}_{s},x_{t})\geq\operatorname{sim}(x^{i+1}_{s},x_{t})$ , and $L_{s,t}\in D_{l}$ is the task aligner for source and target languages $s$ and $t$ , respectively for the given task.

Results and Analysis

We experiment on three datasets – Multilingual Amazon Reviews Corpus (MARC) Keung et al. (2020), Cross-language sentiment classification (CLS) Prettenhofer and Stein (2010), and HatEval Basile et al. (2019), spanning over twelve language-task pairs and totalling $44$ cross-lingual setups (refer to Appendix A for further description of the datasets). The results on MARC, CLS and HatEval are shown in Tables 1, 2, and 3, respectively. For our main experiments, we make use of XGLM Lin et al. (2021) 7.5 billion variant. We experiment with various models with random prompting and select XGLM 7.5B for its performance superiority on various tasks (refer to Table 8 in Appendix B). For further details on the experimental setup, please refer to Appendix C and Table 10 for the language abbreviations used.

Semantic Alignment: The improvement introduced by semantic alignment of the prompt-context over randomly-selected source examples is eminent in Tables 1, 2, and 3. On the MARC dataset, we observe a 14% improvement in macro F1 scores averaged across different languages. This observation is consistent across all target-source pairs on other datasets as well — a gain of 10% on Hateval, and 6% on CLS. This improvement over random example selection is consistent across all language pairs (except English-to-German in CLS) considered in this experiment. This is particularly noteworthy and one might lead to the conclusion that dynamically selecting prompt examples based on semantic similarity aligns the LLM to become a better in-context learner irrespective of the task and the languages.

Task-based Alignment: Just by adding a task aligner, we not only outperform random prompts but also bring substantial improvements for similarity prompting, even though it is not dynamically varying with input sentences. The improvement is 18% in CLS, 8% in HatEval, and 15% in MARC, in terms of macro F1 scores averaged over different language pairs.

However, some languages like German in MARC and English in HatEval produce near-random predictions in all the set-ups we experimented with. This might be due to the model’s inability to perform ICL on these tasks in a cross-lingual manner for these languages. Previous studies observed such phenomena in monolingual ICL Webson and Pavlick (2022); Lin et al. (2021); cross-lingual ICL has its added nuances that make it even more difficult.

We also see a performance drop in the case of Mandarin in MARC (Table 1) while adding a task aligner. We investigate the performance drop and near-random results of German further.

X-InSTA: This prompting mechanism inherits both the benefits of semantic and task-based prompting, hence giving the best results in most language pairs. But similar to task-based alignment, X-InSTA also performs badly on some target languages. The improvement is 23% on MARC, 22% on CLS, and 14% on HatEval. We also note that no specific language can be used as the best source language.

2 Why does Task Alignment Work?

Next, we seek to validate the performance boost achieved via task-based aligners along with an attempt to explain the drop in performance with Mandarin and German. We vary the task aligner and note its effect on the output. We do so in five different variations along with the original method (see Table 12 in Appendix D for detailed examples of each scenario):

No aligner prompt added: Same as random prompting.

Making the label space uniform: Across all source-target setups, we set the source-label distribution as output for the target too, reducing the need for task alignment.

Only language information: Only giving the language information to LLM, without providing any further label information. An example of such an aligner would be ‘The following post is in French language’, in a case when the source is English, and the target is French.

Providing aligner but of a third unrelated language: We set the aligner of a third language. For example ‘In Spanish bad means malo and good means bueno.’, in a case when the source is English and the target is French.

Incorrect aligner: Making the aligner incorrect corresponding to the label space. For example ‘In French bad means bien and good means mal.’, in a case when the source is English and the target is French.

It’s all about the label information: In Table 4, we note the importance of label space information. Providing the model with language information does improve the performance; however, the improvement is minuscule compared to the improvement achieved via task aligners. This label information, even when of an unrelated third language, still helps the model predict better. This might be due to the fact that the model looks more rigorously at label space for inference. Therefore, this showcases the importance of labelling information while going cross-lingual.

Why drop in some languages? It is noteworthy that in Table 4, the task aligner works best for all target languages except for German and Mandarin. Both of these languages give the best results in uniform label space, i.e., when $y_{t}$ is made the same as $y_{s}$ . This points to the inability of the LLM to align the label space of different source languages to these target languages. In making the label space uniform, we lose certain language-specific signals, but this may also be seen as a way of reducing task alignment. Only for German and Mandarin do we see this trade-off as beneficial; in all other cases, the loss of language-specific features of $y_{t}$ leads to a drop in performance.

3 Role of semantic alignment

To understand the role of semantic alignment, we ran an experiment in which instead of choosing $k$ nearest neighbor of $x_{t}$ , we chose the most dissimilar sentences. Table 5 shows that there is a sharp decrease in performance as compared to random prompting for all languages, with German as an exception. The average fall is 8% whereas using semantic alignment gives a gain of 10% w.r.t. random prompting.

4 Automated aligner generation

We also expand our analyses to automatically generate the aligner using mT5 Xue et al. (2021). It is trained using a span generation task using sentences like ‘Paris France’. The mT5 model is trained to fill the mask token by generating spans like ‘is capital of’. In our usage, mT5 will fill the between the input target test $x_{t}$ , and prompt context $C$ in the source language to align the semantics of both. We summarize our procedure for automatic alignment generation in Algorithm 3.

Due to the computational cost of generating the intermediate prompt for each source-target input pair, we experiment with English as the only source language in all three datasets. Table 6 summarizes the results of using an automated aligner. We note that the automated aligner leads to better results than random prompting, and delivers results competitive to semantic prompting in some languages. However, it fails to incorporate any task-specific signals, therefore failing to beat task-based alignment. One can note the limitations of this approach in terms of the different pretraining distributions of the in-context learner and the aligner generator (XGLM and mT5, respectively, in this scenario). The hypothesized role of the aligner was to construct a ‘natural’ transition from the source context to the target input for a particular task. Since mT5 generates these aligners independently without any access to the pretraining distribution of XGLM, the disparity manifests with sub-optimal results.

5 Error Analysis

We present four examples in Table 7, highlighting the four major errors we notice while using X-InSTA, stemming from the following factors:

Static task-aligner: In example #1, slurs are used by all the posts. In the context examples, they are being used as hate speech; whereas in the target, it is not directed at any individual and thereby, should not be identified as hate speech. However, the model labels it otherwise. Here, the apparent semantic similarity is misdirecting the model, and the static nature of the task aligners is not able to guide it to understand the nuances of the task.

Cultural differences: None of the alignment methods introduces common knowledge or cultural knowledge in the prompt. To classify the tweet in example #2, one must have a grasp of hate focused on migration.

Input length: Both the context prompt and the input sentence are just too long in example #3. In this case, no matter how better we design the aligner, we cannot fit it within the maximum input length of $1024$ tokens. One cannot keep on increasing the max-length to accommodate this pitfall, as that might lead to higher computation costs. A possible solution can be found in the direction of Transformer architectures suitable for longer input sequences.

Lack of human-like commonsense: In example #4, alignment of the semantics and the task constructed a good prompt, but the model predicted it wrongly by getting confused by the sarcasm in the first demonstration. To bridge this pitfall, we need to bring more knowledge of humor or commonsense to make the model understand what is obvious to us.

It should be noted that the majority of these errors are stemming from the incapability of the LLM itself. Advancements in language model designs may lead to betterment in future models.

Related Works

In-context learning (ICL): Brown et al. (2020) introduced a new approach, called in-context few-shot learning using the GPT-3 model. Subsequent efforts have been made to enhance the effectiveness of ICL. Hendrycks et al. (2020) evaluated the breadth and depth of model understanding to determine its weaknesses and strengths. Techniques such as selecting semantically-similar examples, using differentiable soft prompts for backpropagation, and adjusting prompts to eliminate bias in predictions have been implemented to optimize the input prompt (Liu et al., 2022; Zhang et al., 2021; Zhao et al., 2021). These efforts have primarily been directed toward improving the performance of ICL in a monolingual setting.

Multiple recent studies have sought to explain the emergence of ICL by assigning different roles to the LLM. Xie et al. (2022) provided the notion of LLMs doing Bayesian inference conditioned upon the prompt context to predict the test label. Our work is much in line with this hypothetical model since alignment over the semantics and the task-based signals across languages are motivated by the quest for better alignment between the prompt and the pretraining distribution and warranting a shared, distinguishable concept as Xie et al. (2022) argued. Additionally, von Oswald et al. (2022) sought to identify LLMs doing gradient-descent as meta-optimizers while learning in context. Li et al. (2023) described ICL as implicit model selection.

Multilingual models: Recent studies on multilingual tasks have focused on creating multilingual versions of popular pre-trained language models. These include mBERT (Devlin et al., 2018), mBART (Liu et al., 2020), XLM-R (Conneau et al., 2020), and mT5 (Xue et al., 2020), which are derived from models like BERT (Devlin et al., 2018), BART (Lewis et al., 2020), RoBERTa (Liu et al., 2019), and T5 (Raffel et al., 2019), respectively. However, fine-tuning these large models for each task is infeasible due to computational limitations. While ICL has been attempted for cross-lingual downstream tasks, these methods only involve random sampling of demonstrations for prompt construction (Zhang et al., 2021; Winata et al., 2021). Shi et al. (2022) addressed the problem of cross-lingual text-to-sql conversion using ICL. However, their method relies on translating the input text in the source language to the target language before generating the corresponding SQL code. Agrawal et al. (2022) demonstrated the effects of similar example selection in a few-shot machine translation setting which is much similar to our proposed semantic alignment. To the best of our knowledge, there is no study on optimizing prompts for cross-lingual NLP tasks using ICL.

Conclusion

In this work, we described the first-ever attempt in the direction of cross-lingual prompt design for in-context learning. We found that a random selection of labeled training examples to construct the prompt-context limits the capability of a multilingual LLM to infer target labels. Instead, aligning the semantics as well as the task-specific textual signals across the source and the target language inputs in the prompt demonstrates superior performance in cross-lingual text classification. Based on these findings, we introduced X-InSTA, a novel method of in-context prompt design for cross-lingual text classification. X-InSTA improves upon random prompt selection substantially across multiple different cross-lingual tasks.

We found that the dynamicity of similarity-based example selection is able to guide the LLM to learn better in-context predictors irrespective of the language pair under consideration. On the other hand, language pairs with proper alignment in the label space get more out of the task-based alignment. These findings may serve as paving stones toward better cross-lingual ICL methods that incorporate an automated, dynamic transition from the source to target distributions.

Limitations

Since this work relies on the in-context learning ability of large language models, the challenges associated with computational resources to load an LLM ensue. Due to resource constraints, we could not use larger or commercially available LLMs to validate if the advantages of X-InSTA translate to those models as well.

As we observed in Section 3.5, the static nature of the aligners poses a limitation on X-InSTA. Moreover, these aligners are manually designed. Therefore, task-specific, trial-and-error style manual intervention is needed. We believe a better understanding of the pretraining distribution of the multilingual LLMs can pave the way toward better automated alignment methods.

There are multiple shortcomings of monolingual ICL that entail its cross-lingual counterpart and X-InSTA does not address them; issues like knowledge hallucination, limited common-sense reasoning, inconsistency in retrieving factual associations, etc.

Ethics statement

Our proposed method, X-InSTA, delivers improvements in cross-lingual in-context learning. Since in-context learning ability is emergent in language models over billion parameters in size, this can cause potential discrimination in the usage of these methods based on the availability of access to computational resources. Research groups with limited access to computational resources will be handicapped while resourceful groups will be able to investigate and advance the future directions of this research.

We did not use any private or sensitive information throughout this research. However, if any private information was leaked to an LLM during the pretraining stage, X-InSTA does not provide any privacy filtration. Therefore, privacy concerns of the underlying model can potentially manifest with the outputs provided by X-InSTA.

As we dissected the erroneous predictions in Section 3.5, the lack of knowledge of cultural differences among different languages is a serious challenge within the LLM and this limits the performance of X-InSTA. Therefore, any potential deployment of our proposed method should be done under the lens of such considerations. This is even more delicate in case tasks like hate-speech classification which was one of the tasks that we explored in this work. Wrongfully identifying a hate speech as non-hate or vice versa in a low-resource target language based on culturally different language usage cues present in the prompt-context in a high-resource languages is a possibility; this may lead to unwarranted cultural appropriation and/or undemocratic gatekeeping.

References

Appendix A Dataset Details

Multilingual Amazon Reviews Corpus: MARC Keung et al. (2020) is a large-scale multilingual corpus of Amazon reviews of customers. The corpus consists of six distinct languages – German, English, Spanish, French, Japanese, and Mandarin. Each language has a training set of size $200K$ that we use for selecting our demonstrations and a test set of $40,000$ reviews classified as positive or negative.

Cross-language sentiment classification: CLS Prettenhofer and Stein (2010) is a multilingual corpus of four languages – German, English, French, and Japanese. It consists of reviews on DVD, music, and books, with a training set and a test set of $2,000$ sentences for each language classified into negative and positive.

Hateval: HatEval Basile et al. (2019) consists of two languages – English and Spanish, classified into hate or non-hate. The test set contains $3,000$ posts for English and $1,600$ for Spanish, with the training set size being $5,000$ for Spanish and $10,000$ for English.

Appendix B Model Variants

We experiment with multiple different LMs in their base versions (i.e., random prompting) to gauge their ability, namely XGLM 7.5B, XGLM 1.7B, and Bloom 7.1B. Table 8 contains the performance of these models on a subset of the test data used (namely, CLS and HatEval with English as the source language). As we can see, XGLM 7.5B appears to outperform other models by a significant margin on multiple different tasks, and therefore, is used for the rest of the experiments.

Appendix C Hyperparameters

All codes were written using PyTorch. We used the Huggingface repository for loading the LLM and sentence transformer for extracting semantic similarity. Sklearn was used for calculating the F1 score. Table 9 describes values of different hyperparameters and compute resources used.

Appendix D Miscellaneous

D.2 Prompt Examples

We show a few example prompts (demonstrations and test input) in Table 11. Additionally, in Table 12, we demonstrate a few examples of different task-aligners used for the analysis in Section 3.2.