Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback

Viet Dac Lai, Chien Van Nguyen, Nghia Trung Ngo, Thuat Nguyen, Franck Dernoncourt, Ryan A. Rossi, Thien Huu Nguyen

Introduction

Pre-trained on massive data, large language models (LLMs) with hundreds of billions of parameters can unlock new emergent abilities that cannot be achieved with smaller models Wei et al. (2022). Large generative models such as GPT-3 Rae et al. (2021) and OPT-175B Zhang et al. (2022) represent some of the most recent advances in natural language processing (NLP), introducing a new learning paradigm to prompt LLMs to successfully solve a range of challenging tasks in zero-shot and few-shot fashions Kung et al. (2022); Choi et al. (2023); Jiao et al. (2023); Guo et al. (2023). However, as LLMs are trained with the autoregressive learning objective, they might exhibit unintended behaviours from human expectations Tamkin et al. (2021); Weidinger et al. (2021); Kenton et al. (2021); Bommasani et al. (2021). To overcome this issue, instruction fine-tuning has been proposed as a prominent approach to align LLMs with human intentions in instructions and conversations Christiano et al. (2017); Stiennon et al. (2020); Sanh et al. (2021); Wei et al. (2021); Ouyang et al. (2022). Instruction-tuned LLMs can demonstrate significantly improved capabilities in following human instructions and avoiding the production of toxic, biased, or inaccurate texts. As such, two major techniques for instruction tuning feature supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) that are leveraged by the best commercial LLMs such as ChatGPThttps://openai.com/blog/chatgpt/ and GPT-4https://openai.com/research/gpt-4 to deliver outstanding dialog performance.

Another issue with LLMs pertains to the massive scales and closed-source nature of the commercial LLMs that greatly restrict accessibility and the extent of interactions with the technology. To this end, there have been growing efforts from the open-source community to create more accessible LLMs with affordable scales while securing competitive performance as the proprietary LLMs, e.g., LLaMA Touvron et al. (2023), StableLM StabilityAI (2023), Falcon Almazrouei et al. (2023), and MTP MosaicML (2023). Instruction fine-tuning has also been applied to these open-source language models to improve their abilities to engage with human, and different instruction datasets have been collected either from human annotation or outputs from commercial LLMs to facilitate the tuning process, e.g., Alpaca Taori et al. (2023), Vicuna Chiang et al. (2023), LaMini-LM Wu et al. (2023), and Dolly Conover et al. (2023).

However, the instruction-following abilities of existing open-source LLMs have been developed mainly for English and some popular languages (i.e., using instruction datasets for those languages), failing to support many other languages of the world to democratize the technologies to a broader population Taori et al. (2023); Chiang et al. (2023); Wu et al. (2023). To overcome this challenge, a few contemporary work has explored instruction tuning of multilingual LLMs for multiple languages, i.e., Phoenix Chen et al. (2023) and Bactrian-X Li et al. (2023). However, their multilingual instruction tuning efforts are limited to only supervised fine-tuning (SFT) techniques, which is unable to examine reinforcement learning with human feedback (RLHF) to further boost the performance for multilingual LLMs.

To fill in this gap, our work aims to develop Okapi, a open-source framework with RLHF-based instruction-tuned LLMs for multiple languages to shed light on their performance compared to the SFT methods in the multilingual settings. Okapi will emphasize on less studied languages and open-source LLMs to better democratize the benefits of instruction-tuned LLMs and provide resources for future research in this area. In particular, an example in the instruction datasets involves an instruction, an input text, and a desired response output/demonstration. In SFT, the pre-trained LLMs are fine-tuned over the instruction triples (instruction, input, output) via supervised learning to promote their alignment with human expectations. In RLHF, generated outputs from the SFT-tuned LLMs are first ranked to provide training signals for reward functions. Afterward, the SFT-tuned models will be further optimized via reinforcement learning utilizing rewards from the trained reward models. As such, RLHF has been successfully employed to create effective commercial LLMs (e.g., InstructGPT, ChatGPT), owning to its ability to learn beyond positive examples associated with only desired demonstrations. By leveraging the reward models, RLHF can observe lower ranking scores for less accurate demonstrations to obtain richer training signals for LLMs. To our knowledge, Okapi is the first work to perform instruction tuning with RLHF for open-source LLMs over multiple languages.

To develop Okapi, we need to overcome the scarcity of necessary instruction datasets in multiple languages to train and evaluate RLHF models. Motivated by the 52K instructions from Alpaca Taori et al. (2023), we leverage Self-Instruct Wang et al. (2023) to generate 106K additional instructions in English, introducing a larger dataset to facilitate RLHF evaluation. Afterward, we utilize ChatGPT to translate the instructions into a diverse set of 26 languages, which can handle instruction examples with programming code via appropriate prompts to enhance translation quality. In addition, we introduce a translation-based prompt for ChatGPT to produce rankings for multiple responses of the same instructions from the LLMs, which will be used to train the reward models for RLHF experiments. Finally, to measure the performance of the fine-tuned LLMs in different languages, we translate three benchmark datasets for LLMs in the widely-used HuggingFace Open LLM Leaderboard HuggingFace (2023); Gao et al. (2021) into 26 languages, i.e., ARC Clark et al. (2018), HellaSwag Zellers et al. (2019), and MMLU Hendrycks et al. (2021), using ChatGPT. These datasets challenge LLMs on diverse aspects, e.g., science reasoning, commonsense inference, world knowledge, and problem-solving, thus providing comprehensive evaluations for our models. To summarize, our contribution in this work is as follows:

Developing RLHF-tuned LLMs in multiple languages: We present Okapi, the first instruction-tuned LLM framework, which are RLHF-based and open-source for multiple languages. Our framework covers 26 diverse languages, including some under-studied and low-resource languages for NLP, e.g., Telugu, Ukrainian, Nepali, and Kannada. Using BLOOM Scao et al. (2022) and LLaMA Touvron et al. (2023) as the base pre-trained LLMs, our experiments illustrate that RLHF generally performs better than SFT for multilingual instruction tuning. Our experiments also highlight the greater challenges of low-resource languages for multilingual instruction-tuning of LLMs that should be better focused in future research.

Resource creation for instruction-tuned LLMs in multiple languages: To cater to our experiments with multilingual RLHF, we create instruction resources for 26 different languages, including ChatGPT prompts, instruction datasets, response ranking data, benchmark datasets, and fine-tuned LLMs. We release our data, resources, and models to contribute to the development and research of multilingual instruction-tuned LLMs in the future. The resources for our Okapi framework can be found at: https://github.com/nlp-uoregon/Okapi.

Data Preparation

A key requirement for our development of instruction-tuned LLMs with RLHF involves instruction, ranking, and evaluation datasets in multiple languages, especially for low-resource languages. To this end, we perform a comprehensive data collection process to prepare necessary data for our multilingual framework Okapi in 26 languages, divided into four major steps: English instruction generation, instruction translation, ranking data production, and evaluation data creation.

An instruction example to tune LLMs often has three components: an instruction to specify the task, an input text, and an associated output text (i.e., demonstration or label) Ouyang et al. (2022). As such, current public instruction datasets for LLMs mainly cover English or some popular languages, which are not suitable for our experiments. Also, we note that a few recent instruction datasets such as xP3 Muennighoff et al. (2022) and Flan Chung et al. (2022); Longpre et al. (2023) include multilingual data; however, their instructions are still written in English. Additionally, these datasets tend to be converted from NLP task datasets with template instructions, which cannot reflect the flexibility of human-written prompts to encourage effective instruction following in different languages Wang et al. (2023). Consequently, our goal is to develop instruction datasets with instructions, inputs, and output texts in multiple languages to better realize general prompts from human.

To achieve this goal, our strategy is to first obtain English instructions and then translate them into other languages. The benefits of our approach concern consistent instruction content across languages to facilitate performance comparison while taking advantages of translation systems to enable examination for more languages. As such, there have been several English instruction datasets collected by the open-source community to support instruction tuning of LLMs with different approaches, e.g., Alpaca Taori et al. (2023), Dolly Conover et al. (2023), and LaMini-LM Wu et al. (2023). However, to conveniently scale our data and introduce variations of general instructions, we follow the instruction generation method in Alpaca, which in turn employs the Self-Instruct procedure in Wang et al. (2023), to produce our English dataset.

Starting with a pool of 175 human-written seed instructions in English over different topics, at each time, Alpaca samples several instructions from the seeds to form an in-context example to prompt the text-davinci-003 model of OpenAI for new instruction generation. The generated instructions are then compared with previous instructions using the ROUGE score, and instructions whose scores are greater than a threshold will be retained. Overall, Alpaca releases 52K instructions for tuning LLMs. In this work, we apply the same Self-Instruct procedure as Alpaca to extend its 52K instructions to a larger dataset for our RLHF-based models in Okapi. In particular, we generate 106K additional English instructions from Alpaca with two notable extensions. First, we introduce 30 new human-created instructions into the seed set from Alpaca to increase its diversity and coverage. Among others, our new instructions involve prompts for relation extraction, event extraction, event summarization, and logical questions that are not recognized in Alpaca. Second, instead of generating the new instructions from scratch, we condition our generation process on the 52K instructions from Alpaca so a new instruction is only saved if it is different enough from Alpaca’s and previous instructions per the ROUGE score criteria. Figure 1 shows the top 10 most common root verbs and their top direct noun objects in the 106K generated instructions. These verbs and nouns represent 11.4% of the entire set, which exhibits diverse intents and patterns in our instructions for Okapi.

2 Instruction Translation

Given the 158K English instructions from Alpaca and our generation process, we aim to translate them into multiple other languages to obtain data for our multilingual models in Okapi. Table 1 presents 26 selected languages in our framework. Using the data ratios $r$ of the languages in CommonCrawlhttp://commoncrawl.org to classify languages as in previous work Bang et al. (2023); Lai et al. (2023), our study encompasses a diverse set of languages, including 8 high-resource languages ( $r>1.0$ ), 11 medium-resource languages ( $r>0.1)$ , and 7 low-resource languages ( $r<0.1)$ . Notably, several of our languages, such as Marathi, Gujarati, and Kannada, have received limited attention in NLP.

We utilize ChatGPT to translate the 158K English instructions into 26 target languages for Okapi. Compared to traditional machine translation systems, an advantage of ChatGPT for translation is the ability to use prompts to specify different expectations for the translated texts to facilitate diverse types of instructions. For example, we can instruct ChatGPT to preserve code in the instruction examples about programming as we expect code to be the same in the instructions of different natural languages. In addition, as ChatGPT has been fine-tuned on instruction-style data, we expect that it can capture the context to better translate our instructions. Figure 2 shows our prompt to translate English instruction data with ChatGPT.

It is important to note that we directly translate the instruction, input text, and associated output in each English instruction example of our data. This is in contrast to the other multilingual instruction-tuning approaches Li et al. (2023) that only translate instructions and input texts into a target language (using Google Translate); ChatGPT is then prompted to generate response outputs in the target language for the instructions and input texts. The intuition for our approach concerns various potential issues of ChatGPT, e.g., hallucination, bias, mathematical reasoning, and toxic content Bang et al. (2023); Borji (2023), that can be exaggerated if ChatGPT is used to produce responses in non-English languages for different types of tasks/instructions Lai et al. (2023). The diverse nature of the possible tasks/instructions will also make it more challenging to devise appropriate solutions for these problems in multilingual settings. By generating the instructions and response outputs in English, we aim to capitalize on the greater performance of LLMs for different NLP tasks in English to avoid the exaggeration issues and achieve higher quality instructions in various dimensions. By transitioning to other languages only via the translation task with ChatGPT, we can also dedicate our effort to overcome diverse multilingual challenges for instruction tuning to the translation task, which can allow convenient and effective solutions for further improvement. Table 2 presents the average lengths of translated prompts and response outputs for each language in our data. Translations from Alpaca’s original instructions and our new generated data are shown separately for convenient comparison.

3 Ranking Data Production

To perform RLHF for a LLM in Okapi, we need to obtain ranked response outputs from the model for the same instruction and input to train a reward model. Concretely, given a LLM $M$ and a dataset $S=\{{inst_{k},input_{k}}\}_{k=1}^{N}$ with $N$ pairs of instructions $inst_{k}$ and input texts $input_{k}$ for a target language, we first prompt $M$ to generate $T$ output responses $output_{k}=\{output_{k}^{1},\ldots,output_{k}^{T}\}$ for each pair of instruction and input text $(inst_{k},input_{k})$ ( $T>1$ ). Afterward, the responses in $output_{k}$ are ranked according to their fitness and quality for the instruction $inst_{k}$ and input text $input_{k}$ . This ranking data $\{{inst_{k},input_{k},output_{k}}\}$ can then be leveraged to train a reward model to compute a score for each triple of an instruction, an input text, and a potential response output using contrastive learning Ouyang et al. (2022).

In this work, we also employ ChatGPT to rank the response outputs for multilingual LLMs. Similar to the motivation for our translation-based approach to obtain instruction data in multiple languages, our ranking strategy first asks ChatGPT to translate the instructions and responses $\{{inst_{k},input_{k},output_{k}}\}$ in a target language into English. The ranking of the responses is then done over the translated English data to exploit the greater quality of ChatGPT for English and limit different challenges associated with multilingual ranking to the translation task. To this end, we engage with ChatGPT in a two-turn dialog to obtain ranking for each example $\{{inst_{k},input_{k},output_{k}}\}$ in the target language. The first turn is to translate the example into English using the prompt in Figure 3 while the second turn follows up with the first turn to instruct ChatGPT to rank the English translated responses using the ranking prompt in Figure 4. Our two-turn approach allows ChatGPT to condition on the translated English data in the first turn for ranking while ensuring the same format for the ranking output in the second turn for convenient parsing. Overall, we obtain ranked response outputs for 42K instructions sampled from the 106K generated instructions for each language in Okapi.

4 Evaluation Data Creation

The HuggingFace Open LLM Leaderboard HuggingFace (2023) recently adopts a suite of tasks and datasets in the Eleuther AI Language Model Evaluation Harness framework Gao et al. (2021) to facilitate performance assessment and tracking of newly developed LLMs. We employ three datasets in this leaderboard i.e., AI2 Reasoning Challenge (ARC) Clark et al. (2018), HellaSwag Zellers et al. (2019), and MMLU Hendrycks et al. (2021), to evaluate the model performance for our Okapi framework. All the datasets are organized as multiple-choice question-answering tasks although they focus on different types of knowledge and reasoning aspects. ARC involves 1170 grade-school science questions; HellaSwag provides 9162 commonsense inference questions that are easy for humans, but difficult for many state-of-the-art models; and MMLU assesses accuracy for 13062 questions over various branches of knowledge (STEM, humanities, social sciences, and more). Nevertheless, although the LLM community has widely adopted the leaderboard for performance examination, the datasets are only provided for English, thus unable to evaluate LLMs for the languages in our work. To this end, we translate the examples of the three datasets into 26 selected languages using ChatGPT and the translation prompt in Figure 2. The translated datasets are then reserved to evaluate the LLMs in our Okapi framework.

Reinforcement Learning with Human Feedback

We follow three steps to develop a fine-tuned LLM with RLHF for each target language in our Okapi framework: supervised fine-tuning, reward model training, and reinforcement learning.

Supervised Fine-tuning (SFT): Starting with a multilingual pre-trained LLM as the base, e.g., BLOOM Scao et al. (2022), we fine-tune the base model with our instruction dataset for the target language using supervised learning. In Okapi, the base model is fine-tuned for three epochs via the autoregressive objective. Our training process uses a cosine learning rate schedule with $200$ warm-up steps, an initial learning rate of 2 $e$ -5, a batch size of $128$ , and a weight decay of $0.05$ . Finally, instead of leveraging approximation techniques for efficient fine-tuning, we fine-tune the entire base LLM for all of its parameters with SFT to accurately understand the model performance for multilingual settings.

During the RL training phase, we keep the entire LLM frozen and solely train the top four layers for five epochs. We employ the AdamW optimizer with $\beta_{1}=0,9$ , $\beta_{2}=0.95$ , and $eps=1e-8$ . The KL coefficient $\beta$ is set to $0.05$ , while the weight decay is $0.1$ , and the learning rate is $1e-6$ . In each PPO iteration, we work with a batch size of $32$ and a clip threshold of $0.2$ in Okapi.

Experiments

Our Okapi framework utilizes two multilingual LLMs: BLOOM Scao et al. (2022) and LLaMA Touvron et al. (2023) as the base models for the fine-tuning processes. We focus on their 7B-parameter versions to facilitate the computing resources and achieve fairer comparison. For each base model and target language, we carry out both SFT-based and RLHF-based instruction-tuning for the model in the following manners:

SFT: The base model is fine-tuned over the 158K translated instructions (i.e., 52K from Alpaca and 106K from our generation) in the supervised manner.

RLHF: The base model is first fine-tuned with supervised training over 52K translated instructions from Alpaca. Afterward, a reward model is trained to score generated responses for input prompts using contrastive learning over the ranked responses for the 42K translated instructions in Section 2.3. Note that the ranked responses are sampled from the SFT-tuned base model over 52K translated Alpaca instructions from previous step. Finally, given the reward model, the SFT-tuned base model is further optimized via reinforcement learning over 64K remaining translated instructions from our generation set Ouyang et al. (2022).

The translated datasets ARC, HellaSwag, and MMLU are exploited to evaluate the performance of the models in Okapi. Following the HuggingFace Open LLM Leaderboard, the Eleuther AI Language Model Evaluation Harness framework Gao et al. (2021) is used to compute the model performance over the datasets for each language in our framework. As a reference, we also report the performance of the base models BLOOM and LLaMA in the experiments. Finally, for BLOOM, we further compare with BLOOMZ Muennighoff et al. (2022), which is the fine-tuned version of BLOOM over the cross-lingual task mixture dataset xP3 with millions of multilingual instructions to achieve instruction-following ability.

Evaluation: Tables 3, 4, and 5 present the performance of the models on the ARC, HellaSwag, and MMLU datasets (respectively) when BLOOM is used as the base model. Similarly, Tables 6, 7, and 8 report the performance with the base model LLaMA over the three datasets. In the tables, in addition to the average scores over all languages for the models, we also include the average scores for each group of languages (i.e., rows “Ave Group” for high-, medium-, and low-resource languages) to facilitate the comparisons. As some of our selected languages (especially the low-resource ones) are not supported by LLaMA, our tables for the experiments with LLaMA will omit those languages (see Table 1).

The first observation from the tables is that RLHF is generally better than SFT for multilingual fine-tuning of LLMs over different tasks, base models, and language groups. The improvement of average performance over all languages can go up to 2.5% on the HellaSwag dataset with LLaMA, thus demonstrating the advantages of RLHF over SFT for fine-tuning multilingual LLMs. It is also evident from the tables that the RLHF-tuned models can significantly improve the performance of the original base models (i.e., BLOOM and LLaMa) for almost all the language groups and tasks, which further highlights the quality of the generated instruction data and the effectiveness of RLHF.

Additionally, we observe that the average performance improvement achieved through RLHF is more substantial for the ARC and HellaSwag datasets, while it is less pronounced for the MMLU dataset. Based on the nature of the datasets, we attribute this phenomenon to the better alignment between our instruction data for fine-tuning with the necessary knowledge and reasoning skills in ARC and HellaSwag than those in MMLU. In particular, ARC and HellaSwag mainly test the abilities of the models on basic knowledge (i.e., from 3rd grade to 9th) and commonsense inference while MMLU focuses on professional knowledge in different areas (e.g., STEM, social sciences, humanities). As our instructions are generated with the seeds similar to Alpaca’s styles Taori et al. (2023), they tend to emphasize on general knowledge and basic inference skills, thus more aligning with the ARC and HellaSwag datasets. To this end, the generated instructions cannot well activate/complement the language and knowledge skills related to MMLU from the LLMs to attain meaningful improvement from instruction tuning.

Comparing the performance of the models across language groups, we find that the models tend to achieve the highest performance for the high-resource languages, followed by the medium-resource and low-resource languages across different base models. The performance improvement of RLHF for low-resource languages is also the least (based on the base model BLOOM), promoting it a challenging area for further research. Interestingly, our fine-tuned BLOOM models with 158K generated instructions can significantly outperform BLOOMZ over almost all the languages for the ARC, HellaSwag, and MMLU datasets using either SFT or RLHF. For example, the average performance of RLHF is 4.8% better than those for BLOOMZ over HellaSwag. As BLOOMZ has fine-tuned BLOOM over more than 78M multilingual instructions converted from NLP datasets Muennighoff et al. (2022), it demonstrates the higher quality of our generated instructions for multilingual instruction tuning of LLMs.

Related Work

We consider two dimensions of related work in this study, i.e., multilingual tuning and multilingual evaluation.

Multilingual Tuning: With the introduction of the Transformer architecture Vaswani et al. (2017), various language models have been explored to boost performance for NLP tasks, including the encoder models BERT Devlin et al. (2019) and RoBERTa Liu et al. (2019), the decoder models GPT Radford et al. (2019); Brown et al. (2020), and the encoder-decoder models BART Lewis et al. (2020) and T5 Raffel et al. (2020). These language models are often trained first over English data, and then extended to other languages in two main approaches: monolingual and multilingual models. In the monolingual approach, a language model is trained specifically for a particular language, e.g., for Spanish MMG (2021), Japanese Wongso (2021), French Martin et al. (2020); Kamal Eddine et al. (2021), Polish Resources and Technology Infrastructure (2021), Sweddish Moell (2021), and Hindi Parmar (2021). In contrast, the multilingual approach explores a single language model that is trained on multilingual texts to serve multiple languages and achieve knowledge transfer for lower-resource languages, e.g., the encoder-only models mBERT Devlin et al. (2019), XLM-RoBERTa Conneau et al. (2020), the decoder-only models mBART Liu et al. (2020) and mT5 Xue et al. (2021), and the decoder-only models BLOOM Scao et al. (2022) and LLaMA Touvron et al. (2023).

Based on the pre-trained language models (PLMs), the most advanced methods for NLP in different languages involve fine-tuning the PLMs on training data of the downstream tasks Min et al. (2023), leading to state-of-the-art performance for multilingual Sentence Splitting Nguyen et al. (2021a), Dependency Parsing Kondratyuk and Straka (2019), Question Answering Huang et al. (2019), and Named Entity Recognition Pires et al. (2019) (among others). Additionally, fine-tuning multilingual PLMs (such as XLM-RoBERTa) has proven to be an effective technique to enable zero-shot cross-lingual transfer learning across languages for various NLP tasks. This convenient approach allows for a seamless extension of NLP models to encompass larger sets of languages Wu and Dredze (2019); Karthikeyan et al. (2020); Wu et al. (2022); Nguyen et al. (2021b); Guzman-Nateras et al. (2022).

Instruction tuning can be considered as a special type of fine-tuning techniques for PLMs where generative PLMs (e.g., GPT) are further trained with instruction data to accomplish instruction following and response alignment with human expectations. Supervised fine-tuning (SFT) is the most common instruction tuning approach that is leveraged by all of the existing LLMs, including ChatGPT, Apaca Taori et al. (2023), Vicuna Chiang et al. (2023), and LaMini-LM Wu et al. (2023). Reinforcement learning from human feedback can also be used to further improve the instruction following abilities of LLMs Wei et al. (2021); Ouyang et al. (2022) although this technique has been less explored by current open-source LLMs due to the challenges in obtaining ranking data for the reward models. For multilingual learning, instruction tuning is only applied in the form of SFT for non-English languages using multilingual LLMs, e.g., BLOOM and LLaMA, in a few contemporary work Chen et al. (2023); Li et al. (2023); Muennighoff et al. (2022). RLHF has not been studied for instruction tuning for non-English languages.

Multilingual Evaluation: A major hurdle for research in multilingual learning pertains to the scarcity of evaluation datasets for NLP tasks in various languages that hinders model development and measurement. As such, prior research has invested substantial efforts to tackle this challenge, introducing multilingual datasets for a diversity of NLP tasks. These tasks include Universal Dependencies Nivre et al. (2016), Named Entity Recognition with CoNLL 2002 and 2003 Sang and Meulder (2002, 2003), Natural Language Inference with XNLI Conneau et al. (2018), Information Retrieval with TyDi Zhang et al. (2021), Question Answering with XQuAD Artetxe et al. (2020), Summarization with XWikis Perez-Beltrachini and Lapata (2021), Event Extraction with MEE Pouran Ben Veyseh et al. (2022), and many other tasks with XGLUE Liang et al. (2020) and XTREME Hu et al. (2020). However, these multilingual datasets are not specifically designed for evaluation of generative LLMs as our focus in this work.

To this end, the Eleuther AI Language Model Evaluation Harness Gao et al. (2021) provides an unified framework to evaluate generative language models over different knowledge and reasoning skills. The HuggingFace Open LLM Leaderboard HuggingFace (2023) leverages four key benchmarks from this framework, i.e., ARC Clark et al. (2018), HellaSwag Zellers et al. (2019), and MMLU Hendrycks et al. (2021), and TruthfullQA Lin et al. (2022), which have been widely adopted to measure progress of LLMs. However, these datasets are not usable for our multilingual framework as they only support the evaluation for English.

Conclusion

We present the first framework, called Okapi, on instruction tuning for LLMs in multiple language using reinforcement learning from human feedback (RLHF). To address the scarcity of necessary data for multilingual instruction tuning, we introduce instruction and response-ranked data in 26 diverse languages to enable the training of supervised fine-tuning models, reward models, and reinforcement learning frameworks for multilingual LLMs. Our experiments reveal the benefits of RLHF for multilingual fine-tuning of LLMs and the challenging problems of low-resource languages in this area for future research.

Limitations

Despite our efforts to develop and evaluate instruction-tuned LLMs in multiple languages using reinforcement learning from human feedback, our work suffers from several limitations that can be improved in future work. First, although we have attempted to cover a diverse set of 26 languages, there are still many other languages in the world that are not considered in our work. Future work can extend our system to include more languages, especially for low-resource languages, to gain a more comprehensive understanding for the language generalization of the instruction tuning methods and better democratize the technologies. Second, our system only leverages the base models BLOOM and LLaMA with 7B parameters. While this approach can facilitate the computing infrastructure of a larger group of institutions for further research, it will be beneficial to support other types of multilingual base models, e.g., the encoder-decoder model mT5 Xue et al. (2021), and other model scales (e.g., the 13B and 65B models) to strengthen the system. Third, to obtain instruction and evaluation data for the development, we automatically generate instructions in English and translate them into multiple languages using ChatGPT. We also rely on ChatGPT to obtain response-ranked data for the reward models in RLHF. Although our approach enables the extension to multiple languages with affordable development costs, the generated and translated data might involve unexpected noise. Additionally, they might not perfectly reflect human-provided instruction data in different languages. To this end, future work can improve our system with human-generated instruction and evaluation data to further examine instruction tuning for multilingual LLMs. Finally, our evaluations only investigate the performance of the models on benchmark datasets for generative LLMs, which focus on testing diverse knowledge, reasoning skills, and truthful generation. Other important concerns of generative models such as hallucination, toxicity, and biases are not evaluated explicitly in our experiments. Future work can study these issues to better characterize instruction tuning methods in the multilingual settings.