LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions

Minghao Wu, Abdul Waheed, Chiyu Zhang, Muhammad Abdul-Mageed, Alham Fikri Aji

Introduction

Large language models (LLMs) with instruction tuning have demonstrated remarkable capabilities in generating high-quality outputs for a diverse set of applications (Ouyang et al., 2022; Wei et al., 2022; Sanh et al., 2022; Chung et al., 2022; OpenAI, 2023). These models typically consist of billions of parameters, demanding substantial computational resources for both training and inference (Brown et al., 2020; Thoppilan et al., 2022; Hoffmann et al., 2022; Chowdhery et al., 2022). Kaplan et al. (2020) suggest that the performance of LLMs scales proportionally with the size of the model and the dataset. However, scaling up these models presents challenges, including concerns about the energy consumption and environmental impact (Strubell et al., 2019). Additionally, limited access to computing resources becomes a significant obstacle for many NLP practitioners seeking to leverage large models effectively, impeding the progress of the NLP community (Nityasya et al., 2020).

In this work, we introduce LaMini-LM, a collection of language models that stand out due to their smaller size compared to the majority of existing instruction-tuned models. We develop LaMini-LM models by employing sequence distillation (also known as offline distillation) (Kim and Rush, 2016) from LLMs. While previous studies (Taori et al., 2023; Chiang et al., 2023; Anand et al., 2023) have attempted similar approaches, there are several gaps in the current literature that we aim to address. These gaps include: (i) the provision of a small-scale distilled dataset, (ii) limited diversity in the dataset, (iii) a restricted number of models (typically only one), and (iv) a lack of comprehensive evaluation and analysis regarding the performance of the models. Additionally, it is important to note that many distilled models resulting from previous work remain computationally demanding. These recent models typically range from 7B to 13B parameters, which presents challenges for deployment in resource-constrained settings. Therefore, our objective is to develop a solution that overcomes these limitations and facilitates easier deployment in such settings.

To address these challenges, we undertake several steps as shown in Figure 1. Firstly, we create a large-scale offline-distillation instruction dataset, consisting of 2.58M examples. We curate these instructions from diverse existing datasets, including self-instruct Wang et al. (2022a), P3 Sanh et al. (2022), FLAN Longpre et al. (2023), and Alpaca Taori et al. (2023). To augment the dataset, we use the Example-Guided Instruction Generation technique with gpt-3.5-turbo to generate additional diverse instructions that match human-written prompts in style and quality.We use gpt-3.5-turbo-0301 in this work. We also employ the Topic-Guided Instruction Generation technique to enhance instruction diversity by incorporating specific topics of interest from Wikipedia. Finally, we utilize gpt-3.5-turbo to generate responses for each instruction. The resulting dataset is called the LaMini instruction dataset.

After creating the dataset, we fine-tune multiple smaller language models with different sizes (ranging from 61M to 7B) and architectures (encoder-decoder and decoder-only). We also conduct extensive experiments and analyses, setting our work apart from previous research. We evaluate their performance on diverse NLP downstream tasks and incorporate human evaluation to assess the quality of model outputs. Given the growing power of language models, we recognize the potential risks they pose. Hence, we evaluate our LaMini language models for hallucination and toxicity. The toxicity assessment utilizes an existing test suite, while we curate a separate test suite with 40 carefully crafted questions to specifically probe hallucination risks. Through these comprehensive analyses, we gain deep insights into the models’ strengths and weaknesses, enabling us to better understand their potential applications and risks.

Our contributions can be summarized as follows:

We introduce the LaMini instruction dataset, consisting of over 2.58M examples. To the best of our knowledge, this dataset is currently the largest instruction dataset available. Notably, it is $50\times$ larger than the dataset released by Taori et al. (2023).

We investigate the process of distilling knowledge from large language models (LLMs) into many different models (T5, GPT, LLaMA, Cerebras) of various sizes (from 61M up to 7B parameters), resulting in a family of distilled language models.

We conduct extensive experiments and evaluations on both our proposed models and several publicly available LLMs across various downstream NLP tasks and general-purpose prompts.

We additionally provide analysis on hallucination and toxicity. To facilitate the detection of hallucinations, we also develop a new set of hallucination-inducing questions.

Related Work

Supervised fine-tuning with natural language instructions empowers the large language models (LLMs) to achieve remarkable zero-shot performance on a diverse set of applications (Weller et al., 2020; Gupta et al., 2022; Wu and Aji, 2023; Lyu et al., 2023; Rozière et al., 2023; Wu et al., 2024). Prior studies demonstrate that fine-tuning vanilla language models with human-written instructions can effectively enable them to follow general language instructions Mishra et al. (2022); Wang et al. (2022b); Wei et al. (2022); Sanh et al. (2022); Ouyang et al. (2022); Scialom et al. (2022); Chung et al. (2022); Muennighoff et al. (2022); Wang et al. (2023a). Moreover, a recent study by Wang et al. (2022a) demonstrates that model-generated instructions can be used for instruction tuning, resulting in significant improvements in vanilla language models’ responsiveness to instructions. Inspired by these findings, other works have focused on instruction tuning vanilla language models using model-generated instructions Taori et al. (2023); Chiang et al. (2023); Anand et al. (2023); Li et al. (2023); Wang et al. (2023b). In this study, we present the largest instruction dataset generated by gpt-3.5-turbo to date. We then fine-tune a collection of language models to create our LaMini-LM models.

Knowledge Distillation

Knowledge distillation is a technique that trains a smaller model, called the student, by leveraging knowledge from a larger model, the teacher Hinton et al. (2015). One common method is to train the student to match the teacher’s representation, such as logits, output probability, or intermediate activation Sanh et al. (2019); Jiao et al. (2020); Mirzadeh et al. (2020); Wang et al. (2020); Zhao et al. (2022). For sequence-to-sequence models, sequence-level distillation was introduced by Kim and Rush (2016), where a synthetic output generated by the teacher model is used to train the student. This approach is efficient as it only requires running the teacher model once. Previous research has shown the effectiveness of sequence-level distillation Costa-jussà et al. (2022); Behnke et al. (2021); Bogoychev et al. (2020). In our work, we adopt sequence-level distillation using the output of gpt-3.5-turbo to train our model. Our approach stands out by training on a significantly larger dataset and distilling it into much smaller models. Additionally, we provide various student models as part of our contributions.

Dataset Generation

Our approach involves the distillation of knowledge from large language models through sequence/offline distillation Kim and Rush (2016). In this process, the student model learns from the outputs of a teacher model. To create our dataset, we make use of various existing resources of prompts, including self-instruct Wang et al. (2022a) and Alpaca Taori et al. (2023) as well as random subsets of P3 Sanh et al. (2022) and FLAN Longpre et al. (2023). Leveraging these resources, we generate a dataset consisting of 2.58M pairs of instructions and responses using ChatGPT. Furthermore, we perform an exploratory analysis of the resulting text to gain additional insights.

This section introduces two strategies for generating instructions: the example-guided strategy and the topic-guided strategy. Furthermore, we describe our approach to generating responses.

Inspired by the works of Wang et al. (2022a) and Taori et al. (2023), we develop a prompt for generating instructions. Our approach involves presenting a prompt with a few examples and constraints, as demonstrated in Appendix A. We include only three random examples and a limited number of constraints within each prompt. Instead of explicitly specifying language restrictions, output length limitations, or instruction types, our instruction to gpt-3.5-turbo is to generate a variety of examples that align with the provided examples and adhere to the desired output format. To optimize the generation process, we randomly sample three seed tasks from self-instruct and generate 20 instructions at once. These instructions are referred to as $\widehat{\boldsymbol{X}}_{\textrm{SI}}$ .We denote the model-generated text as $\widehat{\boldsymbol{X}}_{\{\cdot\}}$ or $\widehat{\boldsymbol{Y}}_{\{\cdot\}}$ and the human-written text as $\boldsymbol{X}_{\{\cdot\}}$ or $\boldsymbol{Y}_{\{\cdot\}}$ , except for $\boldsymbol{Y}_{\textrm{P3}}$ and $\boldsymbol{Y}_{\textrm{FLAN}}$ that are also generated by gpt-3.5-turbo. When the selected instructions are associated with specific inputs, we concatenate them using a colon “:” symbol in the format “ $instruction:$ input”. For datasets P3 and FLAN, we randomly select three examples from the same subset. Our preliminary study indicates that gpt-3.5-turbo requires a minimum of two examples to generate desirable instructions. To ensure more consistent output formatting, we include an additional example. Examples from P3 and FLAN tend to be longer compared to those from self-instruct (see Table 1). To ensure that we stay within the output length limit, we generate only 10 instructions at a time for P3 and FLAN.We refer to the original set of prompts from P3 and FLAN as $\boldsymbol{X}_{\textrm{P3}}$ and $\boldsymbol{X}_{\textrm{FLAN}}$ , respectively. The instructions generated from these prompts are denoted as $\widehat{\boldsymbol{X}}_{\textrm{P3}}$ and $\widehat{\boldsymbol{X}}_{\textrm{FLAN}}$ , respectively. Additionally, we denote the prompts from Alpaca as $\widehat{\boldsymbol{X}}_{\textrm{A}}$ , although they are not utilized in this stage.

Topic-Guided Instruction Generation

It is of concern that gpt-3.5-turbo may not have the desired ability to generate diverse text without explicit guidance. The data analysis presented in Table 1 reveals that we have approximately 270K unique instruction-response pairs in $\widehat{\boldsymbol{D}}_{\textrm{SI}}$ , while there are only 200K unique instructions. To address this concern, we employ a strategy of collecting common topics from Wikipedia to provide guidance during the generation process. Initially, we gather a total of 2.2M categories from Wikipedia. These categories are then filtered based on two criteria. Firstly, we select categories consisting of fewer than three words. Secondly, we choose categories that have more than 10 sub-categories and 50 pages associated with them. During the generation of instructions guided by these topics, we intentionally avoid using lengthy category titles, as we observe that they are more likely to be related to specific topics and responses generated by gpt-3.5-turbo for such instructions may contain factual errors and misinformation in our preliminary study. For instance, the category “machine learning” contains 35 sub-categories and 200 pages,https://en.wikipedia.org/wiki/Category:Machine_learning while the category “Rock music groups from Ohio” contains 5 sub-categories and 50 pages.https://en.wikipedia.org/wiki/Category:Rock_music_groups_from_Ohio After filtering, we obtain a list of $3.5K$ categories that serve as common topics. An example of the prompt with topics is presented in Appendix A. In this study, we exclusively generate topic-guided instructions using the seed tasks from the self-instruct dataset, denoted as $\widehat{\boldsymbol{X}}_{\textrm{t,SI}}$ . We made this decision based on the observation in our preliminary study that gpt-3.5-turbo often encounters difficulties in generating necessary context for instructions, while examples from P3 and FLAN typically contain extensive contextual information. In order to ensure the quality of the generated instructions, we confine our topic-guided instruction generation to the $\widehat{\boldsymbol{X}}_{\textrm{t,SI}}$ subset. Leveraging the provided topics, we generate approximately 280K instruction-response pairs within $\widehat{\boldsymbol{X}}_{\textrm{t,SI}}$ , containing 276K unique instructions.

2 Response Generation

To perform sequence-level distillation, we generate responses from the instructions described in the previous section. We generate the responses for all the generated instructions, including $\widehat{\boldsymbol{X}}_{\textrm{SI}}$ , $\widehat{\boldsymbol{X}}_{\textrm{t,SI}}$ , $\widehat{\boldsymbol{X}}_{\textrm{P3}}$ , $\widehat{\boldsymbol{X}}_{\textrm{FLAN}}$ . As we observe that gpt-3.5-turbo is less capable of providing the necessary context for the instructions, we also directly generate responses for the collected instructions, including $\widehat{\boldsymbol{X}}_{\textrm{A}}$ , $\boldsymbol{X}_{\textrm{P3}}$ and $\boldsymbol{X}_{\textrm{FLAN}}$ . Hence, we denote the resulting pairs as $\widehat{\boldsymbol{D}}_{\textrm{SI}}=\{\widehat{\boldsymbol{X}}_{\textrm{SI}},\widehat{\boldsymbol{Y}}_{\textrm{SI}}\}$ , $\widehat{\boldsymbol{D}}_{\textrm{t,SI}}=\{\widehat{\boldsymbol{X}}_{\textrm{t,SI}},\widehat{\boldsymbol{Y}}_{\textrm{t,SI}}\}$ , $\widehat{\boldsymbol{D}}_{\textrm{P3}}=\{\widehat{\boldsymbol{X}}_{\textrm{P3}},\widehat{\boldsymbol{Y}}_{\textrm{P3}}\}$ , $\widehat{\boldsymbol{D}}_{\textrm{FLAN}}=\{\widehat{\boldsymbol{X}}_{\textrm{FLAN}},\widehat{\boldsymbol{Y}}_{\textrm{FLAN}}\}$ , $\widehat{\boldsymbol{D}}_{\textrm{A}}=\{\widehat{\boldsymbol{X}}_{\textrm{A}},\widehat{\boldsymbol{Y}}_{\textrm{A}}\}$ , $\boldsymbol{D}_{\textrm{P3}}=\{\boldsymbol{X}_{\textrm{P3}},\boldsymbol{Y}_{\textrm{P3}}\}$ and $\boldsymbol{D}_{\textrm{FLAN}}=\{\boldsymbol{X}_{\textrm{FLAN}},\boldsymbol{Y}_{\textrm{FLAN}}\}$ . The complete dataset $\boldsymbol{D}_{\textrm{ALL}}$ is the union of all the instruction-response pairs.

3 Exploratory Data Analysis

In this section, we conduct an exploratory analysis of the generated text, focusing on various aspects of the dataset, including basic statistics, diversity, and human evaluation.

The dataset statistics are presented in Table 1. As mentioned earlier, we find that gpt-3.5-turbo often struggles to provide sufficient context in the generated instructions. This is evident from the average length comparison between $\widehat{\boldsymbol{X}}_{\textrm{P3}}$ and $\widehat{\boldsymbol{X}}_{\textrm{FLAN}}$ against $\boldsymbol{X}_{\textrm{P3}}$ and $\boldsymbol{X}_{\textrm{FLAN}}$ , where the former two are considerably shorter. Additionally, we observe that when instructions are generated from the same source (e.g., self-instruct), the corresponding responses exhibit similar lengths.

Semantic Diversity

analyze the semantic diversity of the generated instructions, we randomly select 50K instructions from $\widehat{\boldsymbol{X}}_{\textrm{SI}}$ , $\widehat{\boldsymbol{X}}_{\textrm{A}}$ , $\widehat{\boldsymbol{X}}_{\textrm{P3}}$ , and $\boldsymbol{X}_{\textrm{P3}}$ . To compute their sentence embeddings, we employ the Sentence Transformer Reimers and Gurevych (2019).Model signature: all-mpnet-base-v2 The t-SNE visualization of the instruction sentence embeddings is presented in Figure 2, allowing us to explore their distribution. We observe that $\widehat{\boldsymbol{X}}_{\textrm{SI}}$ exhibits greater diversity than $\widehat{\boldsymbol{X}}_{\textrm{A}}$ as shown in 2(a) and $\widehat{\boldsymbol{X}}_{\textrm{P3}}$ is slightly more diverse than $\boldsymbol{X}_{\textrm{P3}}$ as shown in 2(b). These observations indicate that the enhanced generative capabilities of gpt-3.5-turbo contribute to the increased diversity in the generated instructions.

Lexical Diversity

To assess the lexical diversity, we employ the Moving-Average Type-Token Ratio (MATTR) metric Covington and McFall (2010) with a window size of 50, because each subset of $\boldsymbol{D}_{\textrm{ALL}}$ varies in size and MATTR is unaffected by text length.As presented in Table 2, the model-generated instructions $\widehat{\boldsymbol{X}}_{\{\cdot\}}$ from gpt-3.5-turbo exhibit lower diversity compared to the human-written instructions $\boldsymbol{X}_{\{\cdot\}}$ and the instructions $\widehat{\boldsymbol{X}}_{\textrm{A}}$ generated by text-davinci-003. We also observe that $\widehat{\boldsymbol{X}}_{\textrm{t,SI}}$ and $\widehat{\boldsymbol{Y}}_{\textrm{t,SI}}$ display higher diversity than $\widehat{\boldsymbol{X}}_{\textrm{SI}}$ and $\widehat{\boldsymbol{Y}}_{\textrm{SI}}$ , showcasing the effectiveness of topic-guidance. Furthermore, when comparing with each subset, $\boldsymbol{D}_{\textrm{ALL}}$ exhibits the highest lexical diversity.

Human Evaluation

We follow the human evaluation protocol given by Wang et al. (2022a), which categorizes the quality of the generated text into four levels from A (best) to D (worst). More details about the human evaluation protocol are presented in Appendix C. To evaluate the quality of the generated text, we randomly select 400 examples from each subset within $\boldsymbol{D}_{\textrm{ALL}}$ and have 8 external human experts rate the generated text. Overall, both the generated instructions and responses demonstrate a high level of quality, as depicted in Figure 3. However, we observe that when generating instructions using topic-guided instruction generation, gpt-3.5-turbo is susceptible to producing erroneous responses for these instructions. Furthermore, gpt-3.5-turbo is likely to produce wrong answers for the instructions based on P3 and FLAN.

Experiments

We present LaMini-LM, a family of language models instruction-tuned on our 2.58M instructions dataset $\boldsymbol{D}_{\textrm{ALL}}$ . We train two types of models, encoder-decoder and decoder-only, for architectural comparison. The size for both categories of models ranges from 61M to 7B to facilitate size comparison. The underlying models for initialization are from seven sources, including T5 Raffel et al. (2020), Flan-T5 Chung et al. (2022), Cerebras-GPT Dey et al. (2023), GPT-2 Radford et al. (2019), GPT-Neo Gao et al. (2021a), GPT-J Wang and Komatsuzaki (2021), and LLaMA Touvron et al. (2023). The details of our LaMini-LM series are summarized in Table 3. Training hyperparameters are described in Appendix D.

2 Model Evaluation

We then evaluate the performance based on several downstream NLP tasks as well as human evaluation on user-oriented instructions.

We conduct a zero-shot evaluation on the downstream NLP tasks for our LaMini-LM. We use language model evaluation harness Gao et al. (2021b) to evaluate our instruction-tuned models.https://github.com/EleutherAI/lm-evaluation-harness We select 15 diverse NLP tasks, covering QA, sentiment analysis, paraphrase identification, natural language inference, coreference resolution, word sense disambiguation, and sentence completion. The details for these NLP tasks are in Appendix E.

Human Evaluation on User-Oriented Instructions

The downstream NLP tasks focus on academic-oriented classification. To evaluate our LaMini-LM and baseline models practically, we use user-oriented instructions from Wang et al. (2022a). These instructions cover 71 commonly used app use-cases, totaling 252 instructions. Unlike the downstream NLP tasks, many questions have more than one correct answer, so human evaluation is also necessary to benchmark model performance. We follow the guidelines as in Appendix C to measure response quality, which rates the generated text into four levels from A (best) to D (worst). To balance annotation cost and instruction diversity, we include at most 2 instructions per app and filter out those covered in downstream NLP tasks like natural language inference, sentiment analysis, and summarization. The resulting test set for human evaluation contains 114 instructions. We form a team of 8 external human experts, each evaluating responses to 15 instructions across all models. Considering subjectivity in human annotation, we maintain consistency by having the same annotator score all the responses for a given instruction, following the same standard. Additionally, we anonymize the model name during human evaluation to avoid biases from our human evaluators.

Results and Discussions

In this section, we provide evaluation results and a discussion of LaMini-LM for both automatic evaluation on the downstream NLP tasks and human evaluation on user-oriented instructions.

For downstream NLP tasks, as shown in Figure 4, it is evident that larger models generally exhibit improved average performance. However, this increasing trend starts to diminish as the model size increases. Remarkably, some of our LaMini language models even surpass or achieve comparable performance to LLaMA-7B Touvron et al. (2023) and Alpaca-7B Taori et al. (2023). Additionally, we present the average performance of LaMini-LLaMA-7B in Figure 4, which significantly outperforms both LLaMA-7B and Alpaca-7B. These findings highlight the critical significance of the instruction dataset. Breakdown results be found in Appendix F.

Human Evaluation

We present the human evaluation results in Figure 5. Consistent with the trends observed in downstream NLP performance, larger models tend to exhibit better performance. Notably, encoder-decoder models from T5 demonstrate exceptional performance despite their relatively small size. However, we acknowledge the existence of a substantial gap between our LaMini language models and gpt-3.5-turbo. We attribute this gap to the quality of pre-trained LLMs and instruction datasets used by these models.

Foundation Model Choice

As shown in Figure 4 and Figure 5, the encoder-decoder LaMini language models outperform the decoder-only LaMini language models, particularly with limited parameters (<500M). Our LaMini-Flan-T5-248M even performs on par with LLaMA-7B. Thus, further exploration of the encoder-decoder architecture for language models is recommended due to their potential, as evidenced by our experiments. Additionally, the comparisons between LaMini-GPT and LaMini-Cerebras models of similar size reveal that LaMini-GPT performs significantly better on downstream NLP tasks and human evaluation. Similarly, vanilla GPT-2 models outperform comparable-sized Cerebras-GPT models, indicating a positive correlation between initial model performance and performance after instruction tuning. Finally, although the Flan-T5 models excel in downstream NLP tasks, they struggle with general user-oriented instructions. This deficiency can be mitigated by further fine-tuning with suitable instructions, underlining the necessity of thoughtful dataset design.

Utility of Subsets

To assess the efficacy of subsets in our LaMini instruction dataset, we randomly chose 52K examples from each subset, along with the original datasets Alpaca, P3, and FLAN. We fine-tune T5 and GPT-2 models on the sampled datasets in this experiment, as Flan-T5 models have been fine-tuned on the FLAN dataset. As shown in Table 4, the results demonstrate that the models fine-tuned on the self-instruct-related dataset (namely $\boldsymbol{A}$ , $\widehat{\boldsymbol{D}}_{\textrm{SI}}$ , $\widehat{\boldsymbol{D}}_{\textrm{t,SI}}$ , and $\widehat{\boldsymbol{D}}_{\textrm{A}}$ ) only exhibit marginal improvements. Conversely, those fine-tuned on either P3- or FLAN-related subsets (namely $\boldsymbol{P}$ , $\boldsymbol{F}$ , $\widehat{\boldsymbol{D}}_{\textrm{P3}}$ , $\widehat{\boldsymbol{D}}_{\textrm{FLAN}}$ , $\boldsymbol{D}_{\textrm{P3}}$ , and $\boldsymbol{D}_{\textrm{FLAN}}$ ) exhibit significantly better performance. Referring to the human evaluation results in Figure 5, we find that self-instruct-related datasets have a significant impact on human evaluation, while P3- and FLAN-related datasets offer more benefits for downstream NLP tasks. This discrepancy highlights the significance of considering both evaluation types in dataset construction.

Hallucination and Toxicity

LLMs often generate hallucinations, producing text that is either factually incorrect or incoherent. To investigate this problem, we simplify it as a “question rejection” challenge, treating it as a binary classification task. The goal is to determine whether an LLM can accurately identify and reject unanswerable or inappropriate questions. An ideal model should reject a question with a justified explanation (if provided). To achieve this, we created the LaMini-Hallucination test set,https://huggingface.co/datasets/MBZUAI/LaMini-Hallucination which consists of four categories: “did not happen (DNH)”, “far future (FF)”, “nonsense (NS)”, and “obscure (Ob.)”. Each category contains 10 questions. All questions are listed in Appendix H. We use recommended models listed in Table 3 to address these questions and evaluate the quality of generated responses through human evaluation. The evaluation results regarding hallucination are presented in Table 5. After fine-tuning our LaMini language models on the LaMini instruction dataset, we notice significant improvements in preventing hallucinations compared to Alpaca, which fails to reject all questions. However, it is important to acknowledge that there is still a notable disparity between current open-sourced LLMs and proprietary LLMs when it comes to tackling the hallucination issue. Additionally, we observe that current open-sourced LLMs struggle particularly with answering “did not happen” and “nonsense” questions. This study emphasizes that although current instruction-tuned language models, including our own and other open-sourced LLMs, exhibit strong performance, they still face significant challenges regarding hallucinations.

Toxicity

LLMs have been observed to demonstrate a tendency to generate toxic language, making their safe deployment challenging. To assess this issue with our LaMini-LM models, we utilize the RealToxicityPrompts dataset Gehman et al. (2020). We randomly select 1K non-toxic prompts (toxicity score < 0.1) and 1K toxic prompts (toxicity score > 0.9) from this dataset. Using the instruction prefix “Complete the sentence:”, we generate outputs using recommended LaMini models and their baselines. We then employ the OpenAI Moderation API detect the toxicity of the generated outputs, as shown in Table 6.https://platform.openai.com/docs/guides/moderation/overview When examining text generation models, it is generally observed that the encoder-decoder models (LaMini-Flan-T5 series) tend to produce text with lower toxicity in comparison to the decoder-only models (LaMini-GPT series and LaMini-LLaMA-7B). However, when fine-tuned on our LaMini instruction dataset, the encoder-decoder models exhibit an increased tendency to generate toxic text, whereas the decoder-only models are less inclined to produce toxic content. This highlights a notable distinction in these models after instruction-tuning. We leave the further investigation as future work.

Conclusion

In this study, we present a large-scale instruction dataset derived from gpt-3.5-turbo, containing over 2.58M examples. We refer to this dataset as the LaMini instruction dataset, which currently holds the distinction of being the largest dataset of its kind. Our research focuses on distilling knowledge from LLMs into smaller, more efficient model architectures. We introduce a family of language models called LaMini-LM, consisting of 6 encoder-decoder models and 11 decoder-only models with different sizes (ranging from 61M to 7B). Through a comprehensive evaluation, including automatic evaluation of downstream NLP tasks and human evaluation of general usage, hallucination, and toxicity, we demonstrate that our proposed models achieve comparable performance to Alpaca Taori et al. (2023) while being significantly smaller in size. For the hallucination problem, we carefully curate 40 questions and find out that current LLMs still face significant challenge in this area. Our work sheds light on the process of distilling knowledge from LLMs to significantly smaller models and the potential of training efficient yet effective language models.

Limitations

In this paper, we explore instruction tuning on various small-size language models and performe evaluation across multiple benchmarks. However, our work still has some limitations:

Model Variations: Compared to previous studies that often only offer a single model without comprehensive evaluation, our work stands out by providing thorough analysis across multiple models with varying configurations. However, our current model selection is somewhat limited, consisting of T5, GPT-2, Cerebras-GPT, GPT-Neo and LLaMA as our base models. To enhance our understanding of performance trends and enable more meaningful comparisons with prior research, it would be advantageous to expand our exploration to include more models.

Single Turn Dialog: Although our training data and user-oriented evaluation primarily focus on "dialog-like" instructions, it is essential to acknowledge that our models are not currently optimized for handling multi-turn dialogues.

Error Propagation: Our models have undergone training utilizing condensed knowledge obtained from gpt-3.5-turbo, thereby inheriting the potential risks associated with it. The presence of hallucination and toxicity in LaMini-LM models is evident from the findings presented in Section 6. Furthermore, our evaluation involving human feedback revealed unsatisfactory performance of LaMini-LM models in coding, mathematical problem-solving, and tasks demanding logical reasoning skills.

We leave these limitations to be addressed in the future work.

Ethical Consideration

We demonstrate that training small language models on large-scale instruction can significantly enhance their performance on downstream NLP tasks, as well as in human evaluation. These instruction-tuned models exhibit superior performance compared to significantly larger models and are particularly adept at engaging in open-ended conversation. Despite these advantages, it is important to acknowledge that these instruction-tuned models are not fully aligned with human objectives. They may frequently generate discriminatory responses and propagate biases or other forms of discrimination originating from the teacher model. Moreover, as we detail in Section 6, these models often generate false information, which may have unintended consequences.

To mitigate any potential harm arising from the use of these models, we intend to minimize the risks associated with their use in future research. We advocate for the responsible use of our models to prevent any harm.

We acknowledge that we only use ChatGPT to improve the language of this work.

References

Appendix A Prompt with Topics

We present an example prompt for the Example-Guided Instruction Generation in Figure 6. For the Topic-Guided Instruction Generation, besides three random examples, we sample three random topics from the common topic list and present an example prompt in Figure 7.

Appendix B Response Generation

The Python code used to generate the response can be found in Figure Figure 8. Before asking gpt-3.5-turbo to generate responses, we firstly send a message as the “system” that requires gpt-3.5-turbo to respond the instructions as concise as possible to avoid the overly lengthy responses.

Appendix C Human Evaluation Protocol

We present the human evaluation protocol as well as the corresponding example for each rating level in Table 7. All the human evaluators in this work are external to the authors and have at least a master’s degree from an English-speaking country.

Appendix D Training Hyperparameters

Our model fine-tuning process involves training all models for 5 epochs using a batch size of 1024, with the exception of LaMini-GPT-J-6B and LaMini-LLaMA-7B. Due to limitations in computational resources, these two models are only fine-tuned for 6K steps, which is equivalent to 2.5 epochs. For our encoder-decoder models, we use a learning rate of $5\times 10^{-4}$ following Chung et al. (2022). For our decoder-only models, we follow the same configuration as Alpaca Taori et al. (2023) including the learning rate of $2\times 10^{-5}$ . We use HuggingFace’s transformers for training. Moreover, we use the same prompt wrapper as Alpaca Taori et al. (2023), hence we also wrap our instruction similarly during inference. We perform all of our experiments on 8 $\times$ V100 (32G) and 8 $\times$ A100 (40G) GPUs. Our models are publicly available.

Appendix E Automatic Evaluation Datasets

We present the details of 15 downstream NLP tasks, including the number of test examples and the corresponding evaluation metrics, in Table 8.

Appendix F Automatic Evaluation Results

The breakdown results given by LaMini-T5, LaMini-Flan-T5, LaMini-Neo, LaMini-Cerebras and LaMini-GPT are presented in Table 9,Table 10,Table 11,Table 12 and Table 13 respectively. We also present the breakdown results given by LaMini-GPT-J-6B and LaMini-LLaMA-7B in Table 14.

Appendix G Qualitative Analysis

Revised: In this study, we compare the model responses obtained through user-oriented human evaluation, as presented in Table 15 and Table 16. Our qualitative analysis reveals that the responses generated by LaMini-LM tend to be shorter than those generated by the Alpaca-7B model. This discrepancy can be attributed to the constraint we imposed on the gpt-3.5-turbo model during the response generation process described in Section 3.2, which prioritizes concise responses. As shown in Table 15, LaMini-LM responds correctly to the given instructions and generates coherent responses with minor errors, while Alpaca fails to respond appropriately. However, it is important to note that LaMini-LM exhibits hallucination in its responses, whereas Alpaca generates responses with accurate information. These examples highlight that current language models are still prone to generating hallucinated and nonfactual information. We further evaluate the hallucination issue of LaMini-LM and its baselines in Section 6, and provide a more comprehensive discussion on the limitations of LaMini-LM in Section 8.

Appendix H Hallucination-Inducing Questions

We carefully craft 40 hallucination-inducing questions as shown in Table 17.