Large Language Models Can Self-Improve

Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, Jiawei Han

Introduction

Scaling has enabled Large Language Models (LLMs) to achieve state-of-the-art performance on a range of Natural Language Processing (NLP) tasks (Wang et al., 2018; 2019; Rajpurkar et al., 2016). More importantly, new capabilities have emerged from LLMs as they are scaled to hundreds of billions of parameters (Wei et al., 2022a): in-context few-shot learning (Brown et al., 2020) makes it possible for an LLM to perform well on a task it never trained on with only a handful of examples; Chain-of-Thought (CoT) prompting (Wei et al., 2022b; Kojima et al., 2022) demonstrates strong reasoning ability of LLMs across diverse tasks with or without few-shot examples; self-consistency (Wang et al., 2022b) further improves the performance via self-evaluating multiple reasoning paths.

Despite these incredible capabilities of models trained on large text corpus (Brown et al., 2020; Chowdhery et al., 2022), fundamentally improving the model performances beyond few-shot baselines still requires finetuning on an extensive amount of high-quality supervised datasets. FLAN (Wei et al., 2021; Chung et al., 2022) and T0 (Sanh et al., 2022) curated tens of benchmark NLP datasets to boost zero-shot task performances on unseen tasks; InstructGPT (Ouyang et al., 2022) crowd-sourced many human answers for diverse sets of text instructions to better align their model to human instructions. While significant efforts were committed on collecting high-quality supervised datasets, human brain, on the contrary, is capable of the metacognition process (Dunlosky & Metcalfe, 2008), where we can refine our own reasoning ability without external inputs.

In this paper, we study how an LLM is able to self-improve its reasoning ability without supervised data. We show that using only input sequences (without ground truth output sequences) from multiple NLP task datasets, a pre-trained LLM is able to improve performances for both in-domain and out-of-domain tasks. Our method is shown in Figure 1: we first sample multiple predictions using few-shot Chain-of-Thought (CoT) (Wei et al., 2022b) as prompts, filter “high-confidence” predictions using majority voting (Wang et al., 2022b), and finally finetune the LLM on these high-confidence predictions. The resulting model shows improved reasoning in both greedy and multi-path evaluations. We call the model fine-tuned in this way as Language Model Self-Improved (LMSI). This is similar to how a human brain sometimes learns: given a question, think multiple times to derive different possible results, conclude on how the question should be solved, and then learn from or memorize its own solution. We empirically verify our method using a pre-trained PaLM-540B LLM, where our method not only improves training task performances (74.4%\rightarrow82.1% on GSM8K, 78.2%\rightarrow83.0% on DROP, 90.0%\rightarrow94.4% on OpenBookQA, and 63.4%\rightarrow67.9% on ANLI-A3), but also enhances out-of-domain (OOD) test tasks (AQUA, StrategyQA, MNLI), achieving state-of-the-art performances in many tasks without relying on supervised ground truth answers. Lastly, we conduct preliminary studies on self-generating additional input questions and few-shot CoT prompts, which could further reduce the amount of human effort required for model self-improving, and ablation studies on important hyperparameters of our approach. We hope our simple approach and strong empirical results could encourage more future work by the community to investigate optimal performances of pretrained LLMs without additional human supervision.

Our contributions are summarized as follows:

We demonstrate that a large language model can self-improve by taking datasets without ground truth outputs, by leveraging CoT reasoning (Wei et al., 2022b) and self-consistency (Wang et al., 2022b), achieving competitive in-domain multi-task performances as well as out-of-domain generalization. We achieve state-of-the-art-level results on ARC, OpenBookQA, and ANLI datasets.

We provide detailed ablation studies on training sample formatting and sampling temperature after fine-tuning, and identify critical design choices for most successful self-improvement by LLMs.

We study two other approaches for self-improvements, where the model generates additional questions from finite input questions and generates few-shot CoT prompt templates itself. The latter achieves 74.2% on GSM8K, which is the state-of-the-art zero-shot performance, against 43.0% by Kojima et al. (2022) or 70.1% through its naive extension with Wang et al. (2022b).

Related Work

Augmenting a machine learning model with explanations has been studied in existing literature extensively. For example, in the supervised learning setting, a model can be fine-tuned using human-annotated rationales (Zaidan et al., 2007; Ling et al., 2017b; Narang et al., 2020; Camburu et al., 2018; Cobbe et al., 2021; Chung et al., 2022). A few works have also looked at how explanations can help the models in various settings, e.g., in-context learning (Lampinen et al., 2022) and in distillation (Pruthi et al., 2022). In this paper, we focus more on the unsupervised learning setting, where we do not assume we have a rationale-augmented training dataset available, since human-annotated rationales can be expensive.

Few-shot explanations improves reasoning in LLMs.

Recently, a lot of progress has been made towards improving LLMs’ reasoning abilities via prompting or in-context learning. Wei et al. (2022b) propose Chain-of-Thought prompting, which prompts the language model to generate a series of natural-language-based intermediate steps, and show it can help language models better solve complex and multi-step reasoning tasks. Wang et al. (2022b) improve Chain-of-Thought prompting by sampling multiple diverse reasoning paths and finding the most consistent answers via majority voting. Kojima et al. (2022) propose to prompt the language model with “Let’s think step by step” to generate reasoning in a zero-shot fashion. Zhou et al. (2022a) further decompose the questions into multiple sub-questions, and ask the language model to solve each sub-question sequentially.

Refining explanations.

More recent work proposes to further refine the generated reasoning paths as some of them could be unreliable. For example, Ye & Durrett (2022) calibrate model predictions based on the reliability of the explanations, Jung et al. (2022) show that inducing a tree of explanations and inferring the satisfiability of each explanation can further help judge the correctness of explanations. Li et al. (2022b) show that sampling a diverse set of prompts from the training data, and a voting verifier can be used to improve model’s reasoning performance. Zelikman et al. (2022) proposes better rationale generation by augmenting ground truth answers as hints when predicted answers are incorrect. Our work is orthogonal to these lines of work, as we utilize refined explanations from Wang et al. (2022b) for fine-tuning the model for self-improvement, and could readily incorporate these other refinement techniques for generating higher-quality self-training data. Our work is similar to Zelikman et al. (2022) where we both propose to fine-tune a model on self-generated CoT data, but our method does not require ground truth labels and shows stronger empirical results with multi-task generalization.

Self-training models.

One related line of work is self-training (see a survey from Amini et al. (2022)). The key idea is to assign pseudo labels from a learned classifier to unlabeled data, and use these pseudo-labeled examples to further improve the original model training, e.g., (RoyChowdhury et al., 2019; Xie et al., 2020; He et al., 2020; Chen et al., 2021). Different from such prior work, our proposed self-improvement framework uses CoT prompting plus self-consistency to obtain high-confidence solutions on a large set of unlabeled data to augment the fine-tuning process.

Distillation and dark knowledge.

Our method also tangentially relates to rich literature on distillation (Ba & Caruana, 2014; Hinton et al., 2015), where a student network imitates a teacher network’s classifier predictions on input examples. A key detail is to learn from soft targets instead of hard predicted labels, as softmax outputs with a high temperature reveal more detailed relative class likelihoods, colloquially known as dark knowledge (Hinton et al., 2015; Korattikara Balan et al., 2015). Recent studies (Zelikman et al., 2022; Snell et al., 2022; Eisenstein et al., 2022) show that dark knowledge within LLMs can be retrieved with more computation at inference time, such as adding informative instructions into the input sequence, and output CoT generation (Wei et al., 2022b; Kojima et al., 2022). In our work, we explicitly show that imperfect CoT reasoning (which may lead to incorrect answer) can be used directly for self-improving language models as evidenced in our experiments in Sections 5.2 and 5.3.

Method

The overview of our method is illustrated in Fig. 1: We are given a pre-trained Large Language Model (LLM) MM and a question-only training dataset Dtrain={xi}i=1D\mathcal{D}^{\mathtt{train}}=\{x_{i}\}_{i=1}^{D} with few-shot Chain-of-Thought (CoT) examples (Wei et al., 2022b). We apply multiple path decoding with a sampling temperature T>0T>0 for generating mm reasoning paths and answers {ri1,ri2,,rim}\{r_{i_{1}},r_{i_{2}},\dots,r_{i_{m}}\} for each question xix_{i} in Dtrain\mathcal{D}^{\mathtt{train}}, and use majority voting (self-consistency) to select the most consistent, highest confidence answer (Wang et al., 2022b). We then keep all reasoning paths that lead to the most consistent answer, apply mixed formats of prompts and answers for augmentation, and fine-tune the model on these self-generated reasoning-answer data. We consider our approach as making the model self-improve. In the following sections, we detail important designs within our method, along with additional approaches for the model to self-improve without supervised data.

Self-consistency (Wang et al., 2022b) brings large improvements on reasoning tasks (e.g., 56.5%74.4%56.5\%\rightarrow 74.4\% on GSM8K test set), and the gap between greedy decoding and diverse decoding shows there is a potential for further improving the reasoning ability of MM, using the self-selected high-confidence reasoning paths as training data.

For each training question xi{x_{i}}, we sample mm CoT reasoning paths, denoted as {ri1,ri2,,rim}\{r_{i_{1}},r_{i_{2}},\dots,r_{i_{m}}\} (see Table 1 for examples). Since MM is prompted with the CoT examples from Wei et al. (2022b), we apply the same output parsing with “The answer is” to generate their predicted answers {yi1,yi2,,yim}\{y_{i_{1}},y_{i_{2}},\dots,y_{i_{m}}\}. The most consistent answer, which is not necessarily a correct answer, is selected by majority voting, denoted as y~i=arg maxyijk=1mI(yij=yik)\tilde{y}_{i}=\operatorname*{arg\,max}_{y_{i_{j}}}\sum_{k=1}^{m}\mathbb{I}(y_{i_{j}}=y_{i_{k}}). For all the training questions, we filter the CoT reasoning paths that reach y~\tilde{y} as the final answer to be put into the self-training data, denoted as Dselfconsistent={xi,ri~}\mathcal{D}^{\mathtt{self-consistent}}=\{x_{i},\tilde{\bm{r}_{i}}\}, where ri~={rij1jm,yij=y~i}\tilde{\bm{r}_{i}}=\{r_{i_{j}}|1\leq j\leq m,y_{i_{j}}=\tilde{y}_{i}\}.

Since we do not use any ground truth labels to filter out cases where y~iyi\tilde{y}_{i}\neq y_{i}, it is important that the self-generated CoT reasoning paths are mostly reliable and incorrect answers do not hurt the self-improvement of the model. We plot the relation between the accuracy and confidence of self-generated CoT paths for each question in GSM8K training set in Fig. 2. The confidence is the number of CoT paths leading to y~\tilde{y} divided by the total path number mm. The y-axis shows the accuracy of y~\tilde{y} under a certain confidence. The circle area and the color darkness shows the number of questions under a certain confidence. We can observe that confident answers are more likely to be correct, which means that when a question has many consistent CoT paths, then the corresponding y~\tilde{y} is more likely to be correct. On the other hand, when y~\tilde{y} is wrong, it is likely to be supported by fewer CoT paths, and brings little noise to the training samples.

2 Training with Mixed Formats

To prevent the language model from overfitting to specific prompts or answer styles, we create four different formats for each reasoning path to be mixed in the self-training data, shown in Table 2. In the first format, a few Chain-of-Thought examples (questions followed by reasoning paths leading to the correct final answers) are prepended to the new question, while the language model output is trained to be the same with the filtered CoT reasoning paths. In the second format, we use examples of questions and their direct answers as standard prompting, and the language model output is supposed to also only contain the direct answer. The third and fourth format are similar to the first and second format, except that no example of question-answer pairs are given, so that the model will learn to think on its own in an in-context zero-shot manner. In the third format, where we want the model to output CoT reasoning without prepending examples containing CoT reasonings, we append “Let’s think step by step.” at the end of the input sequence, to guide the language model to generate step-by-step CoT reasoning paths (Kojima et al., 2022). The mixed formats of training samples are then used to fine-tune the pre-trained language model MM.

3 Generating Questions and Prompts

Given a set of training questions and a few human-written Chain-of-Thought (CoT) examples as prompts, our proposed approach enables model self-improvement. However, when the amount of training questions or CoT examples is limited, our method may not generate sufficient training samples for language model self-training. Collecting questions from the web requires human engineering. To further reduce human effort, we investigate how to self-generate more training questions as well as example prompts.

Previous work (Yoo et al., 2021; Meng et al., 2022) discuss few-shot data augmentation by generating diverse training samples using LLMs. However, those methods are designed for classification tasks and require ground truth label for each few-shot example. We use a simple yet effective approach to generate diverse questions (without ground truth answers) for in-domain questions. Specifically, we randomly select several existing questions, concatenate them in a random order as input prompt, and let the language model generate consecutive sequences as new questions. We repeat the process to obtain a large set of new questions, then use self-consistency (Wang et al., 2022b) to only keep the questions that have a highly confident answer. Those questions are then used as self-generated training questions.

Prompt Generation

Given a set of questions, humans can write CoT examples as reasoning paths leading to the final answer. In zero-shot setting without manual prompts, we can generate these CoT paths using the model itself. Following Kojima et al. (2022), we start the answer with “A: Let’s think step by step.” and let the language model generate the consecutive reasoning paths. We then use those generated reasoning paths as examples for few-shot CoT prompting.

Experimental Setup

We demonstrate the effectiveness of our method on three types of tasks111We evaluate on the test sets of GSM8K, ARC, OpenBookQA, and ANLI, and the dev set of DROP (ground truth labels of the test set are not publicly available).:

Arithmetic reasoning: We use the math problem set GSM8K (Cobbe et al., 2021), and a reading comprehension benchmark DROP (Dua et al., 2019) which requires numerical reasoning. We follow Zhou et al. (2022a) to partition the DROP dataset into football related and non-football related subsets for training.

Commonsense reasoning: We use the OpenBookQA (Mihaylov et al., 2018) dataset, and the AI2 Reasoning Challenge (ARC) (Clark et al., 2018) dataset. Note that for ARC, we only use the Challenge sub-set (ARC-c) in our experiments. Both datasets contain multiple-choice questions.

Natural Language Inference: We use the Adversarial NLI (ANLI) (Mihaylov et al., 2018) subsets, ANLI-A2 and ANLI-A3, which are the more challenging subsets compared to ANLI-A1. These datasets contain pairs of sentences with relations of entailment, neutral, or contradiction.

Models, Training settings and Hyperparameters

We follow previous studies (Wei et al., 2022b; Wang et al., 2022b) and conduct our experiments on an autoregressive Transformer-based language model with 540 billion parameters. The CoT examples for each dataset are listed in Appendix A.2. We generate m=32m=32 reasoning paths for each question in a training set. Since each reasoning path is augmented into four formats in Sec. 3.2, the final training samples are up to the size of 128×Dtrain128\times|\mathcal{D}^{\mathtt{train}}|, with Dtrain|\mathcal{D}^{\mathtt{train}}| being the size of the corresponding training set. For all datasets except DROP, we use the whole training set; To reduce the training burden, we sample 55k examples from the non-football and football partition of the DROP dataset, and sample 55k examples from ANLI-A2 and ANLI-A3. For each dataset, we fine-tune the model for 10k steps with a learning rate of 55e5-5 and a batch size of 3232. For multiple path decoding, we use a sampling temperature of T=0.7T=0.7 with the pre-trained model as suggested by Wang et al. (2022b). We use T=1.2T=1.2 for the language model after self-improvement (LMSI). We set the maximum number of decoded steps to 256256 for all experiments.

Results

We conduct a series of experiments to demonstrate the effectiveness of our proposed self-improving method. First, we apply our method on each individual dataset (task) and report the results. We then merge the generated data from all datasets and train one model to study the generalization ability of the model on unseen datasets as in (Wei et al., 2021). In addition to the results of using generated CoT reasoning paths, we show studies on generating input questions and few-shot prompts. We end with ablation studies on model sizes and hyperparameters.

We list the results of using the PaLM-540B model before and after LMSI in Table 3. For each model, during test time, we apply three separate prompting methods on all six datasets: standard-prompting, CoT-Prompting, and Self-Consistency. We observe that after LMSI, the performance of all three prompting methods increase by a large margin. We observe significant improvement, comparing self-consistency versus LMSI with self-consistency: +7.7%+7.7\% on GSM8K, +4.8%+4.8\% on DROP, +4.4%+4.4\% on OpenBookQA, and +4.5%+4.5\% on ANLI-A3. This shows that our proposed method is quite effective. Furthermore, the single path CoT-Prompting performance of LMSI is close to or even better than the multiple path Self-Consistency performance of the model without LMSI, showing that LMSI truly helps the language model learn from the multiple consistent reasoning paths. We also compare our results with previous SOTA, achieved by different methods on different datasets, listed in Table 3. On ARC-c, OpenBookQA, ANLI-A2 and ANLI-A3, LMSI outperforms previous SOTA. On GSM8K dataset, LMSI is close to the DiVeRSe approach (Li et al., 2022a) which uses diverse prompts and a voting verifier to ensemble 100100 output paths. On the contrary, we only use 3232 output paths for self-generating training samples and for self-consistency with LMSI. On the DROP dataset, LMSI is close to the OPERA approach (Zhou et al., 2022b) which uses ground truth labels for training. On the other hand, our method only leverages the questions in the training set, without using ground truth labels.

To demonstrate the generalization ability of LMSI, we conduct experiments of self-training on a mixture of the training-set questions from the above six datasets (denoted as In-Domain tasks), then use the same model checkpoint for the evaluation on six Out-Of-Domain (OOD) tasks, as shown in Table 4. Of all the OOD tasks: (1) AQUA (Ling et al., 2017a) and SVAMP (Patel et al., 2021) are arithmetic reasoning tasks; (2) StrategyQA (Geva et al., 2021) is a commonsense reasoning task; (3) ANLI-A1 (Mihaylov et al., 2018), RTE (Dagan et al., 2005) and MNLI-M/MM (Williams et al., 2018) are natural language inference tasks.222We evaluate on the test set of SVAMP and ANLI, the dev set of MNLI and RTE (ground truth labels of the test sets are not publicly available). For StrategyQA we use the question-only set from bench collaboration (2022). Among these tasks, AQUA, StrategyQA, and RTE are significantly different from any In-Domain task. These three tasks have their own few-shot prompts. From Table 4, we can observe that LMSI achieves higher accuracy results on all OOD tasks, showing that the overall reasoning ability of the language model is improved.

Importance of training with Chain-of-Thought formats

We demonstrate the importance of training language models with Chain-of-Thoughts compared to training with only direct answers. In Table 5, we list the results of LMSI with all four formats, and the results of LMSI with only direct answer formats. The results clearly show that without the CoT formats, the language model can still self-improve, but the performance gain drops by a large amount compared to using all four formats.

2 Pushing the limit of self-improvements

We further explore the few-shot setting where there are only limited training questions in the target domain. On GSM8K, we sample 1010 real questions as few-shot samples, and use the language model to generate more training questions using the method in Section 3.3. We then self-train the language model with these generated questions and list the results in Table 6. The results show that using self-generated questions still improves the reasoning ability of language models, but using the real training-set questions leads to better results.

Self-Generating Few-Shot CoT Prompts

We explore the situation where no in-domain CoT examples are provided for a task. We apply the Step-by-Step method (Kojima et al., 2022) to generate CoT examples using the language model as described in Section 3.3, and show the results in Figure 3. We observe that few-shot prompting with self-generated Step-by-Step CoT examples substantially outperforms the Step-by-Step (Kojima et al., 2022) baseline (66.2% vs 53.8% at 10 paths, 74.2% vs 70.1% at 40 paths), and nearly matches the performance of human-written few-shot CoT (Wei et al., 2021) (74.4% at 40 paths (Wang et al., 2022b)). The strong performance of “Few-Shot w/ Step-by-Step” despite the limited accuracy of prompt examples (43.0% for greedy Step-by-Step) likely comes from leveraging more diverse CoT prompts for multi-path decoding (Li et al., 2022a), where at 40 paths it uses 20 generate prompt-templates, each with 4-shot CoT examples, i.e. a total of 80 generated CoT examples compared to 8 human-written examples use in Wei et al. (2022b). Since we did not use training questions or few-shot CoT examples, 74.2% also marks the new state-of-the-art zero-shot performance on GSM8K.

3 Distillation to smaller models

We also explore whether the knowledge can be distilled to smaller models, such as in distillation (Hinton et al., 2015) and in Zelikman et al. (2022). We use the same set of training samples generated by the PaLM-540B model, but fine-tune on models with smaller sizes (PaLM-8B and PaLM-62B respectively), and show the results of CoT-prompting in Table 7. It is interesting to point out that after distillation from LMSI, the 62 billion model can outperform the pre-trained 540 billion model, and the 8 billion model can outperform the pre-trained 62 billion model. This implies that for downstream applications with limited computing resources, the reasoning knowledge from large models can be used to largely enhance small models to achieve competitive performance.

4 Hyperparameter Study

We study the effect of varying the temperature TT for multiple path decoding after LMSI is applied. Specifically, we vary TT between [0.7,1.0,1.2,1.50.7,1.0,1.2,1.5] and show the results on GSM8K and DROP dataset respectively in Fig. 4(a). As shown in the figure, T=1.2T=1.2 benefits both datasets the most, and is used in the Self-Consistency method for LMSI on all datasets. We notice that the optimal TT after model self-improvement is larger than the optimal T=0.7T=0.7 (Wang et al., 2022b) before self-improvement. We believe the reason is that after training the model, the entropy of the output distribution is reduced.

Number of Sampled Reasoning Paths

We study whether the number of sampled reasoning paths mm for Self-Consistency largely affects the accuracy after LMSI is applied. We show the accuracy on GSM8K test set for models both with or without LMSI in Fig. 4(b). For both cases, setting m=15m=15 already achieves a reasonably good accuracy, and using a larger mm only brings marginal improvements. We also notice that after Self-Improvement, using 55 paths for Self-Consistency can already surpass the performance of using 3232 paths for model without Self-Improvement. Thus, with a well-improved model, huge computing resources can be saved when applied to real applications.

Conclusions

We demonstrated that a Large Language Model (LLM) is capable of improving its performance on reasoning datasets by training on its own generated labels, given input questions only. Experiments using an LLM with 540 billion parameters show that our approach improves the accuracy scores on the six datasets by 1.1% to 7.7%, achieving new state-of-the-art results on ARC, OpenBookQA, and ANLI, without training on ground truth labels. Furthermore, we show that it is possible for the LLM to self-improve even on its own generated questions and few-shot Chain-of-Thought prompts. As part of our future work, we plan to combine large-scale generated data from our approach and existing supervised data, to further improve the performance of LLMs.

Acknowledgements

We thank Jason Wei (Google research), Hyung Won Chung (Google research), and Denny Zhou (Google research) for their advice and feedback on our work. We thank Yi Tay (Google research) for the guidance on UL2 fine-tuning. We also thank Jingjin Li (Cornell University) for the discussion about metacognition.

References

Appendix A Appendix

We also apply LMSI on a recently proposed public language model, UL2 (Tay et al., 2022), using the pre-trained model at step 2,650,000333UL2: https://github.com/google-research/google-research/tree/master/ul2. We use a fixed set of hyperparameters for fine-tuning on each dataset. Specifically, we generate m=40m=40 reasoning paths for each question in a training set for majority voting. We fine-tune the model for 10k steps with a learning rate of 55e5-5 and a batch size of 3232. For multiple path decoding, we use a sampling temperature of T=0.5T=0.5 with the pre-trained UL2 model following Tay et al. (2022), and set T=0.7T=0.7 for the language model after LMSI. We set the maximum number of decode steps to 256256 for all experiments.

The results are shown in Table 8. For arithmetic reasoning datasets, we follow (Tay et al., 2022) to provide both exact matching accuracy scores as well as accuracy scores after an equation-correction postprocessing step. We observe that for most datasets, LMSI still improves the reasoning accuracy, but the improvement on UL2 is not as large as that on PaLM-540B. We think the reason is that, since LMSI exploits the implicit rationale of language models, and the capacity of a language model is determined by its size, larger models can capture more high-order semantics and are more likely to benefit from LMSI.

A.2 Chain-of-Thought Prompts for Each Dataset

We list the Chain-of-Thought Prompts for each dataset for “CoT-Prompting” experiments and self-generated training samples.