STaR: Bootstrapping Reasoning With Reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, Noah D. Goodman

Introduction

Human decision-making is often the result of extended chains of thought . Recent work has shown that explicit intermediate reasoning (“rationales”) can improve large language model (LLM) performance as well . For example, nye2021show demonstrated that LLMs explicitly trained to use “scratchpads” for intermediate steps can attain perfect in-distribution performance on arithmetic, and strong out-of-distribution generalization, while models trained to predict answers directly fail to do either. These works suggest that generating explicit rationales before giving a final answer (“rationale generation”) is valuable for LLMs across diverse tasks including mathematical reasoning, commonsense reasoning, code evaluation, social bias inference, and natural language inference. However, the two primary methods for inducing rationale generation both have serious drawbacks.

One approach to rationale generation is the construction of a fine-tuning dataset of rationales, either manually by human annotators or automatically with hand-crafted templates . Manual methods are expensive, and it is infeasible to construct such a dataset for each interesting problem . Meanwhile, template-based methods rely on automatically-generated rationales but only work when a general solution is already known or reasonable hard-coded heuristics can be made .

An alternative is to leverage in-context learning by including only a few rationale examples in the language model prompt. This has been shown to improve accuracy on mathematical and symbolic reasoning tasks relative to prompting without rationales (“direct” prompting) . Yet, while few-shot techniques with rationales tend to outperform their non-reasoning counterparts, they generally substantially underperform models fine-tuned to directly predict answers using larger datasets .

In this paper, we adopt a different approach: by leveraging the LLM’s pre-existing reasoning ability, we iteratively bootstrap the ability to generate high-quality rationales. Specifically, we few-shot prompt a large language model to self-generate rationales and refine the model’s ability further by fine-tuning on those rationales that lead to correct answers. We repeat this procedure, using the improved model to generate the next training set each time. This is a synergistic process, where improvements in rationale generation improve the training data, and improvements in training data further improve rationale generation.

However, we find this loop eventually fails to solve any new problems in the training set because it receives no direct training signal for problems it fails to solve. To overcome this issue, we propose rationalization: for each problem that the model fails to answer correctly, we generate a new rationale by providing the model with the correct answer. This lets the model reason backward—given the correct answer, the model can more easily generate a useful rationale. These rationales are then collected as part of the training data, which often improves overall accuracy.

We thus develop the Self-Taught Reasoner (STaR, Fig. 1) method, a scalable bootstrapping method allowing models to learn to generate their own rationales, while also learning to solve increasingly difficult problems. In our method, we repeat the following process: in each iteration, first construct a finetuning dataset by attempting to solve the dataset using the current model’s rationale generation ability; then, augment this dataset using rationalization, justifying ground-truth answers to problems the model failed to solve; finally, finetune the large language model on the combined dataset.

Applying STaR on arithmetic, math word problems, and commonsense reasoning, we observe it is able to effectively translate a small number of few-shot prompts into a large rationale dataset, yielding dramatic performance improvements. On CommonsenseQA , we find STaR improves over both a few-shot baseline (+35.9%) and a baseline fine-tuned to directly predict answers (+12.5%) , and performs comparably to a fine-tuned model that is 30 $\times$ larger (72.5% vs. 73.0%).

Thus, we make the following contributions:

We propose a bootstrapping mechanism to iteratively generate a rationale dataset from a few initial examples with rationales—without needing to check new rationales’ correctness.

We complement rationale generation with rationalization, where a model is tasked with justifying an answer and then fine-tuned as if it had come up with the rationale without any hint. We show rationalization accelerates and improves the bootstrapping process.

We evaluate these techniques with a variety of ablations in both mathematical and commonsense reasoning domains.

We propose what is, to our knowledge, the first technique to allow a pre-trained large language model to iteratively use its language modeling capacity to improve itself.

Background and Related Work

Recently, a collection of works has emerged exploring the capacity for large language models to perform in-context learning . In essence, in-context learning treats few-shot learning as a language modeling problem, by showing a few examples in the context (i.e. prompt), and allowing the model to learn and identify the pattern to apply to new examples. Some have studied in-context learning based on the language modeling objective in terms of Bayesian inference xie2021explanation while others have attempted to describe the process more mechanistically in terms of “induction heads” . Moreover, differences in prompt configurations have been known to have dramatic effects on few-shot performance. Some have even found that replacing few-shot prompts with a “soft prompt” which can be optimized in embedding space results in noticeable gains . Instead of emphasizing the representation of the question, we focus on the model output; in particular, we focus on the model’s ability to reason through a problem before coming to a conclusion.

Rationales

One of the initial works on the impact of rationales on language model performance was rajani2019explain , showing that training a language model on a dataset with explicit rationales preceding the answer could improve a model’s ability to generate the final answer. However, this required many thousands of training examples to be manually annotated with human reasoning. Recently, nye2021show demonstrated that step-by-step “scratchpads” can improve fine-tuned LLM performance and generalization on tasks such as arithmetic, polynomial evaluation, and program evaluation. Similarly, wei2022chain used a single few-shot “chain-of-thought” reasoning prompt in order to improve model performance on a collection of tasks, without fine-tuning. Finally, polu2022formal showed that a curriculum learning approach could help solve formal math problems, as long as 1) they were translated into Lean (a theorem-proving language ), 2) one could directly evaluate the validity of the proofs, 3) one could sample numerous potential solutions for each problem, 4) had trained a separate value function model, and 5) started with GPT-f (a model already fine-tuned on a large math dataset ). We note that there are many domains where these conditions do not all apply. In addition, works have aimed to explain why rationales have this beneficial effect: some have analyzed their impact from the perspective of latent variable models while others have provided formal proofs of the benefit of intermediate task supervision .

Iterated Learning

A variety of iterated learning algorithms have been proposed, where solutions or successful methods which are found are in turn used to find additional solutions . anthony2017thinking introduced Expert Iteration (ExIt), a reinforcement learning technique serving as an inspiration for our approach. Essentially, it consists of a loop of self-play by an “apprentice,” followed by imitation learning with feedback from a slower “expert” and then the replacement of the expert with the now-improved apprentice. polu2022formal builds off of ExIt for formal reasoning, while vani2021iterated applies iterated learning to visual question answering using modular networks which can be combined compositionally. There are further similarities between STaR and expert iteration methods . For example, filtering generated examples based on whether their ultimate answer matches the target can be seen as expert feedback. However, we have a fixed “expert” and do not train a separate value function.

Natural Language Explanations

Natural language explanations have also been discussed from the perspective of explainable machine learning, focusing on justification rather than reasoning . The motivation for this line of work is largely grounded in explainable decision making, and similarly to rajani2019explain , generally does not find that requiring post-hoc explanations improves model performance.

Method

We are given a pretrained LLM $M$ and an initial dataset of problems $x$ with answers $y$ : $\mathcal{D}=\{(x_{i},y_{i})\}_{i=1}^{D}$ . Our technique starts with a small prompt set $\mathcal{P}$ of examples with intermediate rationales $r$ : $\mathcal{P}=\{(x^{p}_{i},r^{p}_{i},y^{p}_{i})\}_{i=1}^{P}$ , where $P\ll D$ (e.g. $P=10$ ). Like standard few-shot prompting, we concatenate this prompt set to each example in $\mathcal{D}$ , i.e. $x_{i}=(x^{p}_{1},r^{p}_{1},y^{p}_{1},\dots,x^{p}_{P},r^{p}_{P},y^{p}_{P},x_{i})$ , which encourages the model to produce a rationale $\hat{r}_{i}$ for $x_{i}$ followed by an answer $\hat{y}_{i}$ . We assume that rationales that lead to correct answers are of better quality than those that lead to incorrect answers. Therefore, we filter the generated rationales to include only the ones which result in the correct answer ( $\hat{y}_{i}=y_{i}$ ). We fine-tune the base model $M$ on this filtered dataset, and then restart this process by generating the new rationales with the newly fine-tuned model. We keep repeating this process until the performance plateaus. Note that during this process, once we collect a new dataset, we train from the original pre-trained model $M$ instead of continually training one model to avoid overfitting. We provide an outline of this algorithm in Algorithm 1.

where the gradient is obtained via the standard log-derivative trick for policy gradients. Note that the indicator function discards the gradient for all sampled rationales that do not lead to the correct answer $y_{i}$ : this is the filtering process in STaR (Line 5). Thus, STaR approximates $J$ by (1) greedily decoding samples of $(\hat{r}_{i},\hat{y}_{i})$ to reduce variance of this estimate (at the cost of potentially biased exploration of rationales), and (2) taking multiple gradient steps on the same batch of data (similar to some policy gradient algorithms ). These approximations make STaR a simple and broadly applicable method that can be implemented with standard LLM training machinery; future work should more closely investigate the link between STaR and the RL objective above.

2 Rationalization

The rationale generation bootstrapping algorithm carries a limitation. Since the model is only trained on the examples which it answers correctly, improvement ends when the model fails to solve new problems in the training set. This is fundamentally due to the fact that the algorithm cannot obtain any training signal from failed examples. Inspired by rajani2019explain , we propose a technique we call “rationalization”. Specifically, we provide the answer as a hint to the model and ask it to generate rationales in the same style as in the previous rationale generation step. Given the answer, the model is able to reason backwards, and hence more easily generate a rationale that leads to the correct answer. For example, in Figure 2, we provide the hint that ”(b) grocery cart” is the correct answer in the prompt to generate the rationale. We apply rationalization to the problems which the model failed to solve with rationale generation. When adding a rationalization-generated rationale to our dataset, we do not include the hint in its corresponding prompt, as if the model had come up with the rationale without the hint. After filtering, we fine-tune on the previously generated dataset combined with the rationalization-generated dataset.

Algorithm 1 describes the full algorithm, with the parts in blue corresponding to rationalization. Without those parts, Algorithm 1 corresponds to STaR without rationalization. Figure 1 provides an overview diagram. Fine-tuning on the dataset generated by rationalization has a crucial benefit of exposing the model to difficult problems which otherwise would not have appeared in its finetuning dataset. This can be understood as challenging the model to “think outside the box” about problems on which it was unsuccessful. A secondary benefit of rationalization is an increase in dataset size.

Experiments

For our experiments, we focus on arithmetic, commonsense reasoning, and grade school math to demonstrate STaR’s breadth. In particular, for arithmetic, we follow a setup inspired by nye2021show . For commonsense question-answering we follow xie2021explanation and use CommonsenseQA (CQA), a widely used multiple-choice dataset for this domain . For grade school math, we use GSM8K from cobbe2021training .

We used GPT-J as our base language model, and the fine-tuning script from the GPT-J repository . We chose GPT-J, a 6B-parameter model, because the checkpoint and fine-tuning code are publicly available , and the model is large enough to generate rationales of non-trivial quality to be bootstrapped from. More hyperparameter details about GPT-J and our fine-tuning are included in Appendix H. Following the default setting of mesh-transformer-jax , we perform a 100-step learning rate warmup, from which point we use a constant learning rate. Unless stated otherwise, we start with $40$ training steps at the first outer loop, and increase the number of fine-tuning training steps by $20\%$ with each outer loop. In general, we found that training more slowly at the beginning ultimately benefits model performance. We expect that further improvement is possible via a thorough hyperparameter search—we leave this to future work due to computational constraints.

For arithmetic problems, we first generate a dataset of 50,000 randomly sampled questions (uniformly over the digit lengths) in the format introduced by nye2021show . For each outer loop iteration on arithmetic, we sample 10,000 problems from the dataset. We use 10 random few-shot rationale examples for each digit for its corresponding few-shot prompt. For each of the $9,741$ questions in the training set of CommonsenseQA, we add the question to the few-shot rationale prompt, and prompt the model to generate the rationale and answer for that question. For few shot prompting on CQA, we start with the same 10 questions as used in wei2022chain , with the rationales modified slightly to fix an incorrect answer and to more explicitly reference relevant knowledge. We include these modified prompts in Appendix BBased on min2022rethinking , this is unlikely to meaningfully affect ’s few-shot performance.. These prompts serve as our complete set of explanations. We run STaR until we see performance saturate, and we report the best results.

When performing rationalization, we find that the choice to include or omit few-shot prompts on outer-loop iterations after the first iteration does not have a substantial impact on the method’s ultimate performance. However, there are some nuances which we discuss further in Section 5, leading us to use few-shot prompts unless stated otherwise.

2 Datasets

The arithmetic task is to calculate the sum of two $n$ -digit integers. We generate the dataset based on the descriptions in nye2021show and visualize an example scratchpad in Figure 3. Everything up to and including “Target:” is given as part of a prompt, and the model is asked to generate the scratchpad (start/end indicated by “”) and the final answer, as in nye2021show . Each line of the scratchpad corresponds to the summation of each pair of digits from the final digit to the first digit, the accumulating final digits of the answer, and a carry digit corresponding to whether the previous pair summed to at least 10. We include few-shot prompts for 1 to 5 digits. When performing rationalization, we include the correct answer after “Target” and query the model to produce the scratchpad and then reproduce the correct answer following the scratchpad.

CommonsenseQA

The multiple-choice commonsense reasoning task, CommonsenseQA (CQA), is constructed from ConceptNet, a semantic graph of concepts and their relationships with over a million nodes . identified a set of “target” concepts in ConceptNet for each question, where the target concepts share a semantic relationship to one “source” concept. Then each question is crowdsourced to allow a reader to identify one target concept, while mentioning the source concept. In addition, two distractor answers are added. The dataset has 12,247 questions, each with five choices, with 9,741 in the train set, 1,221 in the dev set, and 1,285 in the (withheld) test set.

Corresponding to the broad variety of ConceptNet, CQA contains a diverse set of questions which require commonsense reasoning ability building off of standard world knowledge, where human performance is 89% . Many have pointed out that CQA contains a number of biases, along several dimensions including gender . We discuss how this may impact our method in Appendix G. There are also many typos and questions which are fundamentally ambiguousFor example, “Billy bought coffee and waited for his wife to arrive from France. Where might he have been?” includes airport and train station as options. The correct answer, perhaps surprisingly, is train station.. We use it despite these issues as it is a general question-answering dataset relying on both common world knowledge and simple reasoning, which serves as a good test-bed for our method.

Grade School Math (GSM8K)

We also evaluate on the Grade School Math (GSM8K) dataset, which contains 7,473 train and 1,319 test examples of grade-school-level word problems . These math problems are posed in natural language and require two to eight calculation steps to arrive at a final answer. This dataset combines the skills needed for arithmetic and commonsense reasoning.

3 Symbolic Reasoning: Results on Arithmetic

The accuracies of the model across digits $1$ - $5$ over each iteration of the outer loop are plotted in Figure 4. After running STaR for 16 iterations, the overall accuracy is $89.5\%$ . For reference, a baseline trained on 10,000 examples without rationales for 5,000 steps attains $76.3\%$ accuracy. Notably, few-shot accuracy on arithmetic problems is very low, even with rationales: accuracy on 2-digit addition is less than $1\%$ , and accuracy on more digits close to zero.

With rationalization, the accuracy is able to improve especially quickly. After one fine-tuning iteration on the model’s generated scratchpads, 2-digit addition improves to $32\%$ from less than 1%. Without rationalization, the performance improvement is stage-wise: the model generally has poor performance on the $n$ -digit sum until it has good performance on the $(n-1)$ -digit sum. With rationalization, the model can learn many lengths at once, though not with equal accuracy. Rationalization allows many problems to be solved few-shot, so we start STaR training with 300 steps (note, doing so without rationalization causes overfitting on $1$ -digit addition), and increase training by 20 steps per iteration.

We also perform an experiment where we continue pre-training STaR with rationalization with additional digits, starting before the 20th iteration, while keeping the total number of training examples fixed at each iteration. We find that not only does this appear to quickly improve performance on the initial set of digits, but when evaluated on 9 and 10 digit examples, never seen during training, the model successfully solves many of these out-of-distribution problems. As visualized in Figure 5, the introduction of these digits appears to make the training less stable, but the exact cause is unclear.

4 Natural Language Reasoning: Commonsense Question Answering

The CommonsesenseQA (CQA) setting introduces several new challenges. In the arithmetic task, an incorrect scratchpad in the reasoning step, and to a lesser degree in the rationalization step, was extremely likely to result in an incorrect answer. On the other hand, CQA problems are 5-way multiple choice questions. Thus, one will get the right answer at random approximately 20% of the time, regardless of the quality of reasoning. Moreover, some simple heuristics (e.g. semantic similarity) can meaningfully improve this to $\approx$ 30% without any reasoning, as shown by .

We evaluate this dataset as described in the experimental protocol and compare to several baselines. The first baseline is to finetune GPT-J to directly output the final answer, which we call “GPT-J Finetuned”. We also compare to GPT-3 finetuned to directly predict the final answer from xu2021human , and a 137B parameter Lambda model few-shot prompted with chain-of-thought (CoT) rationales from wei2022chain .

We found that, as shown in Table 1, STaR without rationalization outperformed GPT-J fine-tuned directly on the final answer for the entire dataset, despite training on less of the data. The inclusion of rationalization improved this performance to $72.5\%$ , far closer to the $73\%$ of the 30 $\times$ larger GPT-3. As expected, we also see STaR surpassed the few-shot baselines, including the much-larger 137B LaMDA model . We expect accuracy would be further improved if we applied STaR to a model with higher few-shot performance.

Note that it is harder to judge the rationale quality: for arithmetic, one can compare them to the ground truth rationales, but for CQA the evaluation is necessarily qualitative. For this reason, we include a case study in Figure 7. We observe that the rationales provided are generally coherent and of a similar structure to the few-shot rationales. We make the following two observations:

After training with STaR, we see the model was able to generate reasonable rationales that solve new problems, which explains part of the observed performance gain.

We also see that there were many instances in which STaR improved the quality of rationales over those generated in a few-shot manner.

Human Evaluation

Based on the observation that STaR may improve reasoning quality for problems even when they were initially answered correctly via few-shot prompting, we performed a preliminary qualitative analysis. We randomly selected 50 rationales generated from few-shot CoT and STaR-generated rationales on questions which they both answered correctly, as well as human-generated rationales for these problems from . We then presented a random subset of 10 questions and rationales to each of 20 crowdworkers on Prolific with the rationales in a randomized order, asking them to rank the rationales based on which they felt best justified the answer. The participants were 30% more likely to rank the STaR-generated rationales higher than the few-shot rationales ( $p=.039$ ). This indicates that, as mentioned in the case study, STaR can improve the quality of rationale generation.

We also found that the participants were 74% more likely to prefer the STaR-generated rationales over the human-generated rationales ( $p$ < $.001$ ). To be clear, we do not believe that this indicates human-level rationale-generation performance. Instead, we feel that it speaks to the difficulty of eliciting high-quality rationales. We reproduce the test prompts in Appendix C and elaborate on the limitations of the crowdsourced explanations dataset.

Failure Cases

Finally, we found a variety of interesting failure cases, many of which corresponded to standard logical fallacies. For example, the model often made statements related to the topic of the question but which were not actually arguments for why the answer should be true. Sometimes, the model claimed the question implied the answer as an argument, without explaining why. Other times, especially early in training, the model answered as if it has knowledge about a particular individual, instead of making a general statement - e.g. “the king’s castle is a place where he feels safe” instead of “castles are places where kings feel safe.” We provide examples and analyze errors in Appendix A.

Few-shot Prompt Training

Including few-shot prompts during fine-tuning appears to have a meaningful performance benefit (60.9% to 68.8% without rationalization, 69.9% to 72.5% with rationalization). Thus, we generally suggest its use for at least some portion of the training, though we discuss some caveats in Section 5.

5 Mathematical Reasoning in Language: Grade School Math

We again find on GSM8K that STaR substantially improves performance beyond few-shot with rationales or training to directly predict the answers (without rationales), shown in Table 2 and include the few-shot prompt in Appendix I. We observe that on this task, the use of rationalization does not substantially improve performance. Note that, in training, it was necessary to cap the number of training steps at the 30th iterations (after 7912 steps), to prevent the training process from becoming prohibitively long. The results were reached after 36 iterations for STaR without rationalization and an additional 10 iterations with rationalization.

Most often, the number of calculation steps generated by the model matches the number of steps taken by humans (generally between 53% and 57% agreement across all iterations). We visualize this explicitly in Figure 6. We see that when the ground truth and model disagree on the number of calculation steps, the model typically uses fewer. Sometimes this is because the model skips steps, but occasionally it finds different solutions. We show an example in Appendix J, where the model disregards redundant information and solves a 7-step problem in a single step.

Discussion and Challenges

An essential question is exactly what role rationalization plays. Intuitively, rationalization allows a model to reverse-engineer a solution, or provides a heuristic for identifying whether each step makes the conclusion more likely. This parallels real-world problems where the final result is known, but challenging to derive a good justification. From a mathematical perspective, while rationale generation samples rationales from the distribution $p(r\mid x)$ provided by our model $M$ , rationalization conditions on the answer, letting us access an alternative distribution $p(r\mid x,y)$ which may be a better search space for rationales. Then rationalization could be framed as an off-policy estimate of the objective in Equation 1, sampling from the hint-augmented model as a proposal distribution. Future work should establish more connections between rationalization and these RL objectives, and examine more generally when and why rationalization improves learning.

In addition, due to the low sampling temperature, the outputs without rationalization correspond to the examples where the model is most confident in its answer. This results in these examples providing a weaker gradient signal than the rationalization examples, at least in the first iteration. Since we retrain from the initial pre-trained model every time we run a fine-tuning iteration, the degree of this effect is also difficult to measure directly. Finally, we must point out that the method to add the “hint” does not follow immediately from the question and answer and in some contexts providing it may be nontrivial. An exploration of the various impacts of different hinting techniques and their generality is an avenue for future work.

Temperature

One intuitive alternative to rationalization, if one seeks to expand the training dataset, is more and higher-temperature sampling. However, in practice, we found that this is counterproductive. In general, it substantially increases the likelihood of a correct answer despite incorrect reasoning, and training on bad or irrelevant reasoning prevents generalization. This is particularly clear in more structured tasks, like arithmetic, where the scratchpads that the model learns to produce with a higher-temperature sampling approach diverge into meaninglessness and cause the model to stagnate. Overall, we found that higher temperatures as an alternative to rationalization (e.g. $0.5$ or $0.7$ ) consistently led to models worse than models with reasoning alone. In addition, as text generation by large language models is sequential (i.e. one cannot produce a token without producing the preceding token), generating text is a bottleneck and this is computationally far less efficient than rationalization. For example, generating 10 sample outputs is approximately 10 times slower than generating one sample output. However, one potentially valuable way to leverage multiple samples would be to use the method proposed in wang2022selfconsist , using the majority-vote result of multiple high-temperature scratchpads as a ground truth against which we compare a low-temperature scratchpad. This may allow one to apply STaR to a dataset of only questions, without answers.

Few-shot Prompting

A noteworthy phenomenon is that the inclusion of few-shot prompting during sampling seems to dramatically reduce “drift” where later rationales become increasingly dissimilar from the initial few-shot set of rationales. One benefit of this is that the model may be less constrained by the quality and difficulty of the initial rationales, theoretically allowing it to generalize more. One potentially negative consequence is that the style of the rationales may less-closely match the original prompting style. Another benefit is in terms of computational resources - a shorter prompt length allows for a shorter sequence length when sampling. Technically, the point in training at which we “disable” few-shot prompts is another hyperparameter which we could tune, but we leave this to future work. In addition, by leaving prompts out after the initial outer-loop iteration, the model tends to perform gradually worse at rationalization as it trains for longer periods of time. As a result, it may be necessary to include some hints during training for long periods of time with this approach.

Ultimately, the choice to include few-shot prompts in later iterations of training appears to depend on the use-case: when the goal is consistent adherence to a particular prompt style, which may benefit explainability, include few-shot prompts in sampling; when the goal is a faster training loop, one may remove them. Moreover, it is possible that with other datasets or larger models there is an impact on performance, so we encourage this to be generally treated as a hyperparameter.

Conclusion

We present the Self-Taught Reasoner (STaR), which iteratively improves a model’s ability to generate rationales to solve problems. We few-shot prompt a model to solve many problems in a step-by-step manner by generating rationales, and then prompt it to rationalize the correct answer for problems it gets wrong. We finetune on both the initially correct solutions and rationalized correct solutions, and repeat the process. We find that this technique significantly improves the model’s generalization performance on both symbolic reasoning and natural language reasoning.

There are several important limitations on STaR as presented. In order for the first iteration of STaR to succeed, few-shot performance must be above chance, implying that the initial model must be big enough to have some reasoning capabilities. For instance we found that GPT-2 was not able to bootstrap from few-shot reasoning in even the arithmetic domain. A further limitation is that settings with a high level of chance performance (e.g. binary decisions) yield many poor rationales, confounding the STaR approach. An open problem is how to filter bad reasoning in these settings.

Nonetheless, we believe using examples without reasoning to bootstrap reasoning is a very general approach, and that STaR can serve as the basis of more sophisticated techniques across many domains.

Acknowledgements

We thank Imanol Schlag for his detailed feedback about this work, as well as Rose E Wang, Markus Rabe, Aitor Lewkowycz, Rishi Bommasani, Allen Nie, Alex Tamkin, and Qian Huang. We thank Cem Anil for his very helpful insight that rationale finetuning performance can be improved if the training includes the few-shot rationales. We also thank Ben Prystawski for his suggestions on survey creation. We thank Google TPU Research Cloud for TPU access.

References

Appendix

Appendix A CommonsenseQA Error Patterns

Throughout our experiments, we came across a variety of interesting failure cases for commonsense reasoning. Note that all the final answers are correct – however, we take issue with the reasoning used in order to arrive at those answers.

One key failure case was answers in the form of “the answer must be something that is . is . Therefore, the correct answer is .” In these cases, the model fails to explain why the answer that it has chosen satisfies the question property.

These rationales, while perhaps useful to the model, read to us as opaque and unexplanatory.

A.2 Begging the Question

A related but stronger version of the previous failure case, while less common, is particularly uninsightful. Sometimes the model will imply the answer that it has chosen in its question.

A.3 Exercise to the Reader

A rare failure case is when the model finds it unnecessary to justify its answer. For example:

A.4 World State Assertions

Sometimes, the model will assume that it knows something about a subject or a person whose name was used as a variable. This leads to somewhat comical examples of reasoning. Part of the reason for this is that generally, there is an expectation that good rationales will leverage understanding of more general classes of objects and appeal to the relationship between those general classes and the particular instance. For example, the argument that “a person would typically feel exhilaration from heights” is generally more compelling than the argument that “James would feel exhilaration from heights.”

A.5 Red Herrings

Some errors in reasoning corresponded to the model making a statement which, while technically true, is not useful in demonstrating the claim.

A.6 Hint Short-cutting

In the experiments where the model was shown some examples of “hints” during training, in order to prevent it from losing the ability to perform rationalization over time, the model appeared to pick up on the fact that the final answer would always correspond to the hinted answer. This led to answers such as

Appendix B Modified CQA Prompts

For reference, we include our modified prompts based closely on those in wei2022chain .

Appendix C Human-evaluated Test Prompts

We also selected a random sampling of 50 questions which were correctly answered both few-shot and by a STaR-trained model (without rationalization), as discussed in 4.4. Presented in a random order, twenty crowdworkers preferred the STaR-generated answers. We reproduce the examples here with the few-shot rationale first, the STaR-trained rationale second, and the human rationale third, though these were shuffled when presented to participants. We selected human answers from ’s original split rationales where possible, finding that duplicate rationales were much more common in the new split rationales. For example, the explanation “Rivers flow trough valleys,” appeared over 400 times verbatim in the new split dataset, and “This word was most relevant” appeared over 150 times. ’s dataset also includes explanations like “The only answer that makes sense” or “BOB WILL NOT POKEMON CARDS WERE COMMON AND WORTHLESS BUT WRONG ABOUT THEM SO FEEL REALLY RARE TO DELAY” or restatements of the answer. We append the phrase “Therefore, the answer is ANSWERTEXT (ANSWERLETTER)” with ANSWERTEXT replaced by the correct answer’s text and ANSWERLETTER replaced by the correct answer letter. This is done 1) to make it less obvious that one of the answers is generated by a different source and 2) to prioritize differences in rationales, not the answer format.

Before the questions and after the consent form, we presented the following directions:

The examples were subsampled and presented to the crowdworkers:

Appendix D Example Rationalizations on CQA

We include a randomly sampled set of rationalizations which the model is able to produce before fine-tuning. We observe that sometimes, the model constructs an argument roughly of the form “the answer must have a set of properties. correct answer has those properties. therefore, the answer is correct answer.” This structure of argument is fairly standard, but given that the model originally answered those questions incorrectly, it resembles template-matching more than reasoning. The technique of rephrasing the question in a simpler way before answering it directly is also often effective. In general, the rationales from the pretrained model’s rationalization appears to be of similar quality to the pretrained model’s rationale generation. We reference the pretrained model here rather than the STaR-trained model, as the final iteration’s rationales are the ones which the model struggled to learn from – that is, if the rationales were good, then we might expect that the model would have already incorporated them into its rationale generation.

Appendix E STaR Without Rationalization (Rationale Generation Bootstrapping)

For convenience, we include the rationale generation bootstrapping algorithm alone here. However, it is simply Algorithm 1, with the blue components removed.

Appendix F Case Study

Appendix G Bias and Faithfulness

It is important to note that STaR is designed to amplify the reasoning that leads to correct solutions on a given dataset. The implication of this with a dataset like CommonsenseQA is that if biases are “useful” in solving the dataset then they will be amplified. This is made worse by rationalization, as biased answers that the model may not naturally arrive at are, in a sense, pulled out of the model. The exact interaction between the bias in a dataset and the pre-existing bias learned by the model is unclear, and something to consider before real-world deployment of all large language models in general, but STaR-trained models in particular.

We find some encouraging initial results on this however: for questions where gender is not relevant, the model appears to disregard it in its explanation. For example: “Q: Where is a good place for a woman to store her sunglasses? $\rightarrow$ A: The answer must be a place where sunglasses are stored. Sunglasses are stored in purses. Therefore, the answer is purse (e).” or “Q: The only baggage the woman checked was a drawstring bag, where was she heading with it? $\rightarrow$ A: The answer must be a place where a drawstring bag is checked. The answer is airport (e).” We believe this question warrants a much more comprehensive study.

One important challenge with models which seek to be interpretable or provide explanations for their reasoning is that of faithfulness. While our primary emphasis is not on the explainability benefits that STaR may bring, there is a fundamental challenge around evaluating explanations and rationales: namely, faithfulness . describe faithful explanations as those which “accurately [represent] the reasoning process behind the model’s prediction.” While STaR encourages the use of reasoning in rationales which leads the model to correct answers, it is difficult, if not impossible, to ensure that the rationales reflect the model’s internal processing. For example, it is straightforward to imagine the model implicitly selecting a particular answer immediately and then generating a rationale to justify that selected answer. This would allow a model to generate unbiased rationales while selecting answers in a biased way.

The fact that our model outperforms one fine-tuned to directly predict the answers, and ablation studies from papers such as wei2022chain make it clear that the generation of a rationale before producing an answer non-trivially improves the model’s answer quality. However, it is difficult to evaluate the degree to which any particular answer’s rationale is faithful. However, we note that there problems are not unique to STaR, but are symptomatic of the difficulty of understanding large language models and in particular the rationales generated by large language models.

Appendix H Hyperparameters

GPT-J is a 28-layer decoder-only transformer, with an embedding size of 1024, 16 attention heads of dimension 256, and an FFN hidden layer of size 16384. It was pre-trained on the Pile , with a vocabulary size of 50.4K

In general, unless otherwise stated, we use a batch size of 8 sequences, each of length 1024. We also use packing, namely, packing the shorter examples to form longer sequences (up to length 1024) to improve TPU utilization. We do not use weight decay, and we train and sample on a single TPU-v3 node. We performed a hyperparameter search over learning rates from $10^{-7}$ to $10^{-4}$ using the Adam optimizer . We found that $10^{-6}$ was consistently the best-performing learning rate.

Appendix I GSM8K Few-shot Prompt

We include the following few-shot prompts for GSM8K, based on the examples in cobbe2021training .

Appendix J STaR GSM8K Solutions

We observe some interesting patterns with the GSM8K solutions proposed by the STaR-trained model. Typically, when the solution takes substantially fewer calculation steps than the ground truth, it corresponds to an instance where the model accidentally answered the question correctly despite mistakes in its reasoning. In some cases, however, the model produces simpler solutions than those in the ground truth. One example is shown in Figure 8.