Bactrian-X: Multilingual Replicable Instruction-Following Models with Low-Rank Adaptation

Haonan Li, Fajri Koto, Minghao Wu, Alham Fikri Aji, Timothy Baldwin

Introduction

Fine-tuning large language models (LLMs) with instruction–response pair datasets has demonstrated remarkable zero-shot generalization capabilities for open-source and closed-source models (Sanh et al., 2022; Wei et al., 2022; Ouyang et al., 2022; OpenAI, 2023). Although the LLMs are often pre-trained using multilingual texts, the instruction-tuning for open-source models is restricted to English (Taori et al., 2023; Chiang et al., 2023; Wu et al., 2023), bringing into question its multilingual generalizability. Closed-resource models such as OpenAI GPT-4 (OpenAI, 2023) and Google BARD,https://bard.google.com/ despite performing impressively over high-resource languages, are still lacking in terms of multilingual generalizability under monolingual instruction tuning.

The scarcity of instruction–response pair datasets in languages beyond English is hinders multilingual instruction tuning. The existing xP3 dataset Muennighoff et al. (2022), which was used to fine-tune BLOOM and mT5, employs English instructions. Although Muennighoff et al. (2022) also experiments with xP3mt — machine-translated instructions — it focuses on classic NLP tasks such as summarization and question answering, rather than general instructions. Additionally, both xP3 and xP3mt use template-based prompts, and hence lack variation.

To investigate general instruction tuning in a multilingual setting, we introduce Bactrian-X, containing parallel instruction–response pairs across 52 languages that were automatically constructed by translating instructions from Alpaca (Taori et al., 2023) and Dolly (Conover et al., 2023) via the Google Translate API.https://translate.google.com/ As we detail in Section 3, we use the output distillation trick to obtain corresponding responses by leveraging ChatGPT outputs, conditioned on the translated instructions. With 67K instruction–response pairs for each language, the total number of instances in Bactrian-X reaches 3.4M.

In contrast to previous multilingual instruction models such as BLOOMZ (Muennighoff et al., 2022) which are subject to full fine-tuning via parameter updates across all layers, this study highlights the potential of parameter-efficient fine-tuning techniques, specifically LoRA (Hu et al., 2022). LoRA uses adapters with substantially fewer parameters than base LLMs, making them more practical and adaptable for real-world applications. Specifically, in this work, we introduce BX ${}_{\text{BLOOM}}$ and BX ${}_{\text{LLaMA}}$ models, which build upon the BLOOM (Scao et al., 2022) and LLaMA (Touvron et al., 2023) models, and find them to be better than the associated instruction-tuned models: BLOOMZ Muennighoff et al. (2022) and Alpaca Taori et al. (2023).

We conduct a comprehensive series of experiments covering a range of zero-shot multilingual NLP tasks, including XCOPA (Ponti et al., 2020), XStoryCloze (Lin et al., 2022), XWinograd (Muennighoff et al., 2022), our own multilingual sentiment analysis dataset SentimentX, and EXAMS (Hardalov et al., 2020). The consistently high results across these tasks highlight the effectiveness of our multilingual instruction dataset and adapter technique for instruction tuning in languages beyond English. To further validate our findings, we use GPT-4 as an evaluator based on the methodology proposed by Chiang et al. (2023), and additionally conduct human evaluation with native speakers. All results confirm that our proposed models outperform the vanilla foundation models and existing instruction-tuned models.

Related Work

LLMs such as GPT-3 Brown et al. (2020), PaLM Chowdhery et al. (2022) and LLaMA Touvron et al. (2023) Hoffmann et al. (2022); Scao et al. (2022); Zeng et al. (2023) have revolutionized NLP. Research has demonstrated that fine-tuning LLMs with instruction prompts can improve their capacity to perform unseen/novel tasks Wei et al. (2022); Sanh et al. (2022); Ouyang et al. (2022); Chung et al. (2022); Muennighoff et al. (2022). Recently, Wang et al. (2022); Taori et al. (2023) showed that machine-generated instructions can be used for instruction tuning. Wu et al. (2023) created a large-scale dataset with 2.6M instructions, and demonstrated that relatively small language models also benefit from the instructions. Prior work has predominantly been on English, and instruction-tuning in languages beyond English remains limited. The closest work to ours is BLOOMZ Muennighoff et al. (2022), which finetunes BLOOM Scao et al. (2022) and mT5 Xue et al. (2021) on the xP3 and xP3mt multilingual instruction datasets. However, both xP3 and xP3mt are based on human-written templates, and lack the variability of an organic multilingual dataset. Our work, instead, constructs a parallel general instruction dataset by translating English instructions into 51 languages and generating responses via ChatGPT Ouyang et al. (2022). To the best of our knowledge, our Bactrian-X instruction dataset is the largest general-purpose multilingual instruction dataset to date.

Parameter Efficient Fine-Tuning (PEFT)

Fine-tuning all parameters of an LLM (e.g. Alpaca Taori et al. (2023), Vicuna Chiang et al. (2023) and LaMini-LM Wu et al. (2023)) is computationally expensive, and adapters Houlsby et al. (2019) offer a more cost-effective alternative. PEFT updates a small number of parameters during fine-tuning, and achieves comparable performance to fully fine-tuned counterparts Houlsby et al. (2019); Guo et al. (2021); Lester et al. (2021); Ben Zaken et al. (2022). Hu et al. (2022) introduced Low-Rank Adaptation (LoRA), which incorporates trainable rank decomposition matrices into transformer layers Vaswani et al. (2017) during fine-tuning without introducing additional latency during inference. They demonstrate that by fine-tuning with less than 1% of the model parameters, LoRA outperforms several fully fine-tuned LLMs, including GPT-3 Brown et al. (2020), on various tasks.

In recent work, Taori et al. (2023) use the LoRA trick to fine-tune LLaMA Touvron et al. (2023), resulting in the Alpaca model, but did not carry out comprehensive evaluation. In this work, we also leverage the LoRA technique to develop a range of monolingual and multilingual adapters, with a much larger instruction–response dataset, across 52 languages. We provide empirical analysis based on automatic and human evaluation to demonstrate the effectiveness of our method.

Bactrian-X Dataset

In this section, we detail the dataset creation process and provide an overview of the resulting data, focusing on the quality of translated instructions and generated responses.

We construct the Bactrian-X dataset in two steps: instruction translation, and response generation (see Figure 1).

We use English instructions developed for Alpaca (52K) and Dolly (15K), and use the Google Translate API to translate them into 51 different languages, based on the languages used for mBART-50 Tang et al. (2020). The Alpaca instructions were automatically generated by GPT-3.5 Ouyang et al. (2022) via the self-instruct technique Wang et al. (2022), while the Dolly dataset was manually curated by thousands of Databricks company employees. Prior to the translation process, we identify instructions containing programming-related content based on a keyword-matching method and exclude them from the translation process. The total cost for translating the instructions was approximately USD$10,000.

Response Generation

For each translated instruction, we use ChatGPT (gpt-3.5-turbo) to obtain a response.The response generation was conducted during April 16–21, 2023. For English, we pair the instruction with the original response. Translating responses into the 51 languages is costly. Moreover, potential issues such as “translationese” and non-native answer styles may arise from relying solely on translated responses. The total cost for generating responses amounts to around $3,000 USD. We leave the comparison between the translated responses and the ChatGPT-generated responses to future work.

2 Exploratory Data Analysis

We analyzed the tokenized texts in the 52 languages using the mBART-50, LLaMA, and BLOOM tokenizers, and present the statistics in Table 1. Since mBART-50 is trained on all 52 languages, the tokenizer is trained on all the languages, and the average number of tokens is thus relatively smaller than LLaMA and BLOOM. However, for languages unseen by BLOOM and LLaMA, the tokenized texts are 2 to 3 times longer compared to mBART-50. This suggests that for these unseen languages, both BLOOM and LLaMA models require a larger sequence length for semantically similar input texts, posing a challenge for effective adaptation with the LoRA adapter.

Instruction Quality

To test the quality of the translated instructions, we verified the quality of 100 randomly-sampled instances for each language by performing back-translation into English using the Google Translate API. We evaluate the quality of the back-translated instructions relative to the originals based on BLEU Papineni et al. (2002); Post (2018),nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp| version:2.3.1 chrF++ Popović (2017),nrefs:1|case:mixed|eff:yes|nc:6|nw:2| space:no|version:2.3.1 and the trained metric COMET Rei et al. (2020).Unbabel/wmt22-comet-da The worst BLEU score of 28 is for Mongolian–English translation, but as seen in Table 2, most language pairs achieved BLEU scores above 40, indicating high quality and reliability of the Bactrian-X instructions.

Response Quality

To evaluate response quality, we conducted human evaluations in three high-resource languages — Arabic (ar), Indonesian (id), Chinese (zh) — and three low-resource languages — Burmese (my), Tamil (ta), and Tagalog (tl). For each language, two native-speaker annotators are asked to assess the fluency and informativeness of the responses given the question, except Tagalog, which had only one annotator. The quality assessment guideline is provided in Appendix A, and the results are shown in Figure 2, with an inter-annotator agreement (IAA) averaged by language of 0.70 and 0.69 for fluency and informativeness, respectively. The results showed that high-resource languages consistently achieved over 80% satisfactory ratings (A and B), while some low-resource languages like Tamil and Burmese had a significant proportion of lower ratings (C and D). This suggests that the outputs generated by ChatGPT are lacking for some low-resource languages. We leave the improvement of data quality for low-resource languages to future work.

Bactrain-X Models

Given limitations of computation resources, we use base LLMs with 7B and 13B parameters only. First, we trained three multilingual Bactrian models (BX) over the parallel dataset in 52 languages: BX ${}_{\text{LLaMA}}$ (7B, 13B), and BX ${}_{\text{BLOOM}}$ (7B).We do not train BX ${}_{\text{BLOOM}}$ (13B) because BLOOM (13B) is not available. While our primary results are based on the BX models, we additionally train some 7B monolingual Bactrian models (BM) for analysis in Section 5: 14 BM ${}_{\text{LLaMA}}$ and 18 BM ${}_{\text{BLOOM}}$ . All models will be made publicly available in our model repository.

We train our LoRA adapters (Hu et al., 2022) using PyTorch with the HuggingFace PEFT implementation (Mangrulkar et al., 2022; Wolf et al., 2020). Hyperparameters used for training the different models can be found in Appendix C (Table 7). In our evaluation, we compare each multilingual BX model with: (1) the corresponding vanilla models, and (2) the instruction-tuned models Alpaca Taori et al. (2023) and BLOOMZ Muennighoff et al. (2022). Details of these models are provided in Appendix B.

Evaluation on NLP Benchmarks

In order to thoroughly evaluate our Bactrian-X models, we conducted experiments on various multilingual downstream NLP tasks. We first introduce the benchmark datasets we used, and then present the evaluation results in two categories: language understanding tasks (Section 5.2) and knowledge-intensive tasks (Section 5.3).

To probe the zero-shot language understanding capability of the different models, we evaluate on the following test sets:

XCOPA Ponti et al. (2020): a multilingual resource designed for causal commonsense reasoning, encompassing 11 languages. The task involves predicting the correct next sentence from two options based on cause and effect question types.

XStoryCloze Lin et al. (2022): a translation of the English story cloze dataset Mostafazadeh et al. (2016) into 10 languages. The objective is to select one sentence as a plausible ending (closure) from two options, given a four-sentence story as the premise.

XWinoGrad Tikhonov and Ryabinin (2021); Muennighoff et al. (2022): a multilingual benchmark for commonsense reasoning, made up of Winograd Schema Challenge problems in six languages.https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html The task involves selecting the most plausible sentence from options that differ slightly.

SentimentX: a sentiment classification dataset comprising 3-way sentiment labels collected from various sources, in the following languages: Arabic (ar) Alturayeif et al. (2022), Spanish (es),http://tass.sepln.org/2020/ Japanese (jp) Hayashibe (2020), Russian (ru),https://github.com/antongolubev5/Russian-Sentiment-Analysis-Evaluation-Datasets Indonesian (id) Koto et al. (2020), Javanese (jav) Winata et al. (2023), Sundanese (sun) Winata et al. (2023), and Swahili (sw) Muhammad et al. (2023).

We also measure how much knowledge the model encodes using the EXAMS benchmark:

EXAMS Hardalov et al. (2020): a multilingual question-answering dataset made up of multiple-choice questions from high school examinations in 16 languages. It covers subjects from natural science (e.g., physics), social science (e.g., history), to humanities (e.g., philosophy). Given that all our experiments are zero-shot, we merge the train, validation, and test sets into a single evaluation dataset, and exclude questions without four multiple choice options, resulting in a total of 20,559 questions.

2 Language Understanding Tasks

The average performance across all languages for XCOPA, XStoryCloze, XWinograd, and SentimentX is presented in Table 3. During inference, we use translated prompts and sentiment labels in the respective languages, obtained from the Google Translate API. We observe that integrating LoRA with the base models of LLaMA and BLOOM, and training over the multilingual instruction datasets, consistently improves performance over the base models. Improvements can also be observed over existing instruction-tuned models such as Alpaca-LoRA, on most tasks. For the larger models, we observe further enhancements again, as seen for BX ${}_{\text{LLaMA}}$ (13B) over LLaMA (13B).

From the third block, we observe that BX ${}_{\text{BLOOM}}$ performs better than the full fine-tuned BLOOMZ model on three out of five tasks. Although the performance difference is relatively small, it is worth noting that BX ${}_{\text{BLOOM}}$ is fine-tuned only using the LoRA adapter on a smaller multilingual dataset (2.5M samples), whereas BLOOMZ is fully fine-tuned using a larger dataset of 78M samples. Additionally, BLOOMZ is fine-tuned on xP3, which is designed to handle NLP downstream tasks, while Bactrian-X is more general purpose.

In Figure 3, we present the average performance of the 7B models over languages that the base models were not exposed to in pre-training. For XCOPA, XStoryCloze, XWinograd, and SentimentX, the LLaMA model is not exposed to 10, 8, 2, and 5 languages, resp., while the BLOOM model is not exposed to 7, 2, 2, and 4 languages, respectively. We observe that our proposed models improve on the zero-shot performance of the base models across all tasks, and also surpass the performance of existing instruction-tuned models, with the exception of BLOOM over XStoryCloze. A notable improvement can be seen in the SentimentX dataset, implying that our models are more suited to non-English instructions and non-English sentiment labels.

Monolingual vs. Multilingual Fine-tuning

For each of the 52 languages in Section 3.2, we compared the performance of monolingual BM models against the multilingual BX models. To ensure a fair benchmark, we exclude unseen languages in calculating the average score. Table 4 presents the average performance for each dataset, revealing that the monolingual BM models consistently outperform the multilingual model for both LLaMA and BLOOM. Particularly notable improvements are observed for XWinograd and SentimentX. For example, the monolingual BM ${}_{\text{BLOOM}}$ achieves an impressive overall increase of $+10.3$ compared to the multilingual model for SentimentX.

3 Knowledge-intensive Task

The last column of Table 3 shows the results on EXAMS, averaged across languages. We find that the BX ${}_{\text{LLaMA}}$ models (7B and 13B) outperform their corresponding base models, while BLOOMZ outperforms our BX ${}_{\text{BLOOM}}$ . We observe that multilingual instruction tuning seems to be more promising on larger models, as seen in BX ${}_{\text{LLaMA}}$ (13B) improving substantially over LLaMA by 5.5% on average, while the margin for BX ${}_{\text{LLaMA}}$ (7B) is only 0.9%. It is noteworthy that BX ${}_{\text{LLaMA}}$ (13B) also outperforms LLaMA (30B) on the EXAMS benchmark in Table 12 in Appendix D, underlining the effectiveness of multilingual instruction tuning.

The EXAMS dataset comprises a range of subject areas, such as natural science and social science. We present a breakdown of the results across subject areas for the 13B models in Table 5. It is evident that there are substantial performance improvements over the social sciences and other subject areas during fine-tuning, but comparatively lesser gains for natural science. This could be attributed to our dataset containing fewer instructions and questions related to natural sciences, or the inherent difficulty of learning natural science concepts or reasoning abilities through instruction fine-tuning.

Evaluation on Open-ended Questions

As LLMs continue to develop, existing NLP benchmarks may not be up to the task of evaluating model capabilities. To address this, we use GPT-4 OpenAI (2023) as an evaluator to compare model outputs, supplemented by human evaluations.

We adopt a challenging set of 80 questions covering 8 categories from Chiang et al. (2023) for open-ended question evaluation. These questions are translated into 51 languages, and we use different models to generate responses (see Appendix E for examples). Following Chiang et al. (2023), we provide two answers from different models in a single prompt, and ask GPT-4 to rate the answers over a scale of 0 to 10 from various aspects including helpfulness, relevance, accuracy, and the level of detail (see Figure 4 for an example prompt for GPT-4 evaluation). To ensure fairness, we interchange the order of the provided answers, and assign scores twice for each question. We exclude vanilla BLOOM and LLaMA from open-ended question evaluation, and instead compare BX ${}_{\text{BLOOM}}$ against BLOOMZ, BX ${}_{\text{LLaMA}}$ against Alpaca, and BX ${}_{\text{BLOOM}}$ against BX ${}_{\text{LLaMA}}$ , given the superiority of instruction-tuned models in previous studies (Chiang et al., 2023; Muennighoff et al., 2022). We select 5 questions from each category, resulting in 40 questions per language. Given cost restrictions and availability of human annotators, we conducted GPT-4 evaluation over 12 languages and human evaluation over 6 languages.

Figure 5 shows the results of the three model pairs, clearly indicate that GPT-4 has a preference for BX ${}_{\text{LLaMA}}$ over Alpaca and similarly favors BX ${}_{\text{BLOOM}}$ over BLOOMZ. Regarding the comparison between the two BX models, BX ${}_{\text{LLaMA}}$ performs better overall.

Since GPT-4 assigns a quantitative score to each response on a scale of 0–10, we calculate the average score for each model from all comparison pairs and present a breakdown of results separately for each language group (see Figure 6) and question type (see Figure 7).

Analyzing the results based by language group (see Figure 6), we can make several observations. First, multilingual pre-training plays a critical role for multilingual instruction-following models. In groups 1 and 3, BX ${}_{\text{LLaMA}}$ outperforms BX ${}_{\text{BLOOM}}$ , while in group 2, BX ${}_{\text{BLOOM}}$ performs substantially better. This difference can be attributed to variations in language coverage during pre-training, as both models are fine-tuned on the same dataset. Second, multilingual instruction-tuning is critical. BX ${}_{\text{LLaMA}}$ , fine-tuned on our multilingual dataset, outperforms Alpaca, which is only fine-tuned on English instructions, across all evaluated languages. From group 4, we observe that if a language is not included in pre-training, multilingual instruction-tuning alone is insufficient to achieve strong performance. Additionally, both BX ${}_{\text{BLOOM}}$ and BLOOMZ are initialized by BLOOM but fine-tuned on different instruction datasets. BLOOMZ is fine-tuned on xP3, a multilingual instruction dataset based on hand-written templates and downstream NLP tasks. In this free generation evaluation, BX ${}_{\text{BLOOM}}$ performs much better than BLOOMZ, highlighting the limitations of human-written instructions in terms of diversity. Overall, multilinguality in both pre-training and instruction-tuning is vital for the effectiveness of multilingual instruction-following models. These findings reinforce our contributions in this work.

Question Type

When considering different question types (see Figure 7), the Bactrian-X models consistently outperform all base models. A noteworthy observation is that “fermi” and “math” questions, which require strong reasoning capabilities, prove to be challenging for all multilingual LLMs. This observation underlines the fact that numerical reasoning task in a multilingual setup remains an under-explored area, requiring further research.

2 Human Evaluation

We conducted human evaluation of the outputs of four models (LLaMA, BX ${}_{\text{LLaMA}}$ , BLOOMZ, and BX ${}_{\text{BLOOM}}$ ) for the six languages as before, namely three high-resource languages — Arabic (ar), Indonesian (id), Chinese (zh) — and three low-resource languages — Burmese (my), Tamil (ta), and Tagalog (tl). Native-speaker annotators were asked to rank the outputs of these models based on their overall quality, from 1 (best) to 4 (worst). Prior to annotation, models are shuffled and their identities are not visible to the annotators.

The average Spearman rank correlation between annotators is $\rho=0.78$ across languages, indicating high inter-annotator agreement.

The human evaluation results, averaged across languages and models, are presented in Table 6. Overall, we observe that our models BX ${}_{\text{BLOOM}}$ and BX ${}_{\text{LLaMA}}$ are better than their instruction-tuned counterparts BLOOMZ and Alpaca, once again emphasizing the effectiveness of our multilingual dataset and language adaptation technique. In particular, BX ${}_{\text{BLOOM}}$ achieves superior performance for ar, id, zh, and ta, which are languages included in the pre-training of BLOOM. On the other hand, BX ${}_{\text{LLaMA}}$ performs the best over my and tl, which are unseen languages for both base models.

Conclusion

In this paper, we have introduced Bactrian-X, a comprehensive multilingual parallel dataset comprising 3.4 million instruction–response pairs across 52 languages. To enhance the multilingual capabilities of base LLMs, we also introduced a collection of lightweight adapters trained on Bactrian-X. Experiments on various multilingual NLP tasks demonstrate that models fine-tuned on the Bactrian-X dataset outperform both their corresponding vanilla models and also models fine-tuned on other monolingual/multilingual instruction datasets. By making our dataset and models available, we hope to expedite the advancement of LLMs for multilingual purposes, promoting progress in natural language processing across a broader set of languages.

Limitations

Our work is subject to several limitations that should be addressed in future research: (1) Our focus was limited to 7B and 13B models, without exploring scaling rules or other base models such as mT5 (Xue et al., 2021). Further investigation into different model variations could provide valuable insights. (2) In our experiments, the maximum sequence length for multilingual models was set to 768 sub-word units. This smaller context size, compared to models with lengths of 1024 or 2048, may restrict the model’s ability to effectively leverage long-range context. Additionally, certain languages that were not well supported by the model tokenizers could face challenges with such a small context size. (3) We did not thoroughly investigate the presence of hallucination, toxicity, and fairness in our models or the base models due to the unavailability of an appropriate evaluation suite. Nonetheless, it is important to acknowledge that our models, as well as the base models, are likely to be susceptible to these concerns. Future research should address these issues to ensure responsible and unbiased model behavior. We acknowledge these limitations and propose that future work should focus on addressing them to advance the utility and deployment-safety of the models.

Ethical Considerations

While our instruction-tuning datasets and models offer several advantages, it is essential to recognize their limitations. Despite efforts made by ChatGPT to alleviate ethical concerns, it is still possible for the model to generate responses that are discriminatory, biased, or contain false information, particularly in multilingual settings. Hence, our models, when fine-tuned on the dataset, may inadvertently learn or propagate these problematic patterns.

To address these concerns and minimize potential harm, we are dedicated to mitigating the risks associated with the use of our models in future research. We strongly advocate for the responsible use of our models to prevent any unintended negative consequences.

References

Appendix A Annotation guidelines for response quality checking

We asked the human experts to rate fluency and informativeness separately, following the guidelines in Figure 8 and Figure 9 separately.

Appendix B Base models

LLaMA (Touvron et al., 2023): a series of base models proposed by Meta, encompassing a parameter range of 7B to 65B. The models were primarily trained on English, but include around 4.5% of text from 20 different languages in the training data, enabling some level of support for multilingual tasks.

Alpaca (Taori et al., 2023): a fine-tuned variant of the LLaMA model on 52K English instruction-following data instances generated through self-instruct techniques (Wang et al., 2022). In initial human evaluation, the 7B Alpaca model was observed to attain similar behavior to the text-davinci-003 model (130B) on the self-instruct instruction-following evaluation suite (Wang et al., 2022).

BLOOM Scao et al. (2022): a collection of pretraiend multilingual language models created by BigScience, trained on the ROOTS corpus, which encompasses data from 46 languages.

BLOOMZ (Muennighoff et al., 2022): derived from BLOOM and fine-tuned using the crosslingual task mixture (xP3) dataset, and capable of zero-shot instruction-following in dozens of languages.

Appendix C Hyperparameters for Bactrian-X models

The hyperparameters for the Bactrian-X models are shown in Table 7. It is important to note that during the fine-tuning process, the instructions are masked, and the loss is computed only for the responses. This approach effectively prevents the models from learning “translationese” and allows it to focus on distilling ChatGPT’s responses.

Appendix D Complete results for the multilingual benchmark

We present the full zero-shot results for the multilingual benchmark in Table 8 (XCOPA), Table 9 (XStoryCloze), Table 10 (XWinograd), and Table 11 (SentimentX). Please refer to Table 13, Table 14, Table 15, Table 16 for details of the data distributions used for evaluation.

Appendix E Model output examples in 9 different languages

Figure 10, Figure 11, Figure 12 show responses from different models to questions in non-English languages. We randomly selected one example for each of Spanish, French, Portuguese, Arabic, Indonesian, Chinese, German, Italian, and Russian.