Question Decomposition Improves the Faithfulness of Model-Generated Reasoning

Ansh Radhakrishnan, Karina Nguyen, Anna Chen, Carol Chen, Carson Denison, Danny Hernandez, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamilė Lukošiūtė, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Sam McCandlish, Sheer El Showk, Tamera Lanham, Tim Maxwell, Venkatesa Chandrasekaran, Zac Hatfield-Dodds, Jared Kaplan, Jan Brauner, Samuel R. Bowman, Ethan Perez

cs.CL cs.AI cs.LG

Introduction

Large language models (LLMs) are operating in increasingly challenging domains, ranging from programming assistance (Chen et al., 2021) to open-ended internet research (Nakano et al., 2021) and scientific writing (Taylor et al., 2022). However, verifying model behavior for safety and correctness becomes increasingly difficult as the difficulty of tasks increases. To make model behavior easier to check, one promising approach is to prompt LLMs to produce step-by-step “Chain-of-Thought” (CoT) reasoning explaining the process by which they produce their final output (Wei et al., 2022); the process used to produce an output is often easier to evaluate than the output itself (Lightman et al., 2023).

This approach relies on the assumption that the model’s CoT reasoning faithfully explains the model’s actual process for producing its output, which has recently been called into question (Turpin et al., 2023; Lanham et al., 2023). Turpin et al. (2023) find that LLMs generate CoT reasoning to justify answers that are biased against certain demographic groups, without explicitly mentioning such biases in the stated reasoning (“biased reasoning”). Lanham et al. (2023) find that LLM answers to questions often remain unchanged despite truncating or adding mistakes to the CoT reasoning (“ignored reasoning”). Such results cast doubt on our ability to verify the correctness and safety of a model’s process for solving tasks.

Here, we aim to explore whether there are more effective methods than CoT for eliciting faithful reasoning from LLMs. We focus on two alternative methods, which prompt LLMs to answer questions by decomposing them into easier subquestions, then using the resulting subanswers to answer the original question (Geva et al., 2021; Patel et al., 2022). We show these methods in Figure 2. Factored decomposition uses multiple contexts to answer subquestions independently, before recomposing the resulting subanswers into a final answer. Factored decomposition may improve faithfulness by reducing biased reasoning (how much LLMs rely on unverbalized biases); each subquestion is answered in a separate context and will not be impacted by potential sources of biases from the original question-answering context (e.g., demographic information in the question). Factored decomposition may reduce the amount of ignored reasoning, e.g., because it often clearly specifies the relationship between the answers to subquestions and the follow-up subquestions, as well as the final answer. Chain-of-Thought decomposition (CoT decomposition) is an intermediate between CoT and factored decomposition. It enforces a subquestion and subanswer format for the model-generated reasoning (like factored decomposition) but uses one context to generate subquestions, answer subquestions, and answer the original question (like CoT). CoT decomposition may obtain some of the faithfulness benefits of factored decomposition by producing answers in a similar way, while including more context to the model when it answers subquestions (improving performance).

As shown in Fig. 1, decomposition-based methods obtain good performance on the question-answering tasks we evaluate, while improving over the faithfulness of CoT according to metrics from Turpin et al. (2023) and Lanham et al. (2023). Factored decomposition shows a large improvement in faithfulness relative to CoT, at some cost to performance, while CoT decomposition achieves some faithfulness improvement over CoT while maintaining similar performance. We measure the amount of unfaithful, ignored reasoning following Lanham et al. (2023), evaluating how often the model’s final answer changes when perturbing the model’s stated reasoning when truncating the reasoning or adding LLM-generated mistakes; as shown in Table 1, decomposition-based methods tend to change their answer more often, suggesting they condition more on the stated reasoning when predicting their final answer. We measure the amount of unfaithful, biased reasoning following Turpin et al. (2023), testing the extent to which methods are influenced by biasing features in the input (such as suggested answers from the user), while not verbalizing the use of those biases; as shown in Table 1, factored decomposition greatly reduces the amount of unfaithful, biased reasoning from LLMs. Our results indicate that decomposing questions into subquestions is helpful for eliciting faithful reasoning from LLMs. More generally, our findings suggest that it is possible to make progress on improving the faithfulness of step-by-step reasoning. We hope that further progress leads to LLM-generated reasoning that accurately represents an LLM’s process for solving a task, enabling us to be confident in the trustworthiness of the answers provided by LLMs.

Methods

We evaluate ways to prompt LLMs to answer questions by using model-generated reasoning. We assume access to an instruction-following LLM that we can autoregressively sample from. Our goal is to assess whether we can prompt our model to provide the correct answer $a$ to a question $q$ after generating a faithful reasoning sample $x$ . The reasoning sample can be broken down into discrete steps (e.g., sentences): $x=[x_{1},x_{2},\dots,x_{n}]$ . Each method we study generates a reasoning sample $x$ for a question $q$ . We evaluate both if the answer the model produces after being prompted with $q$ and $x$ is correct and if $x$ is faithful and thus reflective of the model’s actual reasoning. We evaluate the faithfulness of $x$ using metrics that assess the presence of properties we expect faithful reasoning to possess.

We prompt the model with a question $q$ and additionally prompt it to reason step-by-step, using examples combined with a simple instruction (Kojima et al., 2022; Nye et al., 2021; Wei et al., 2022; Reynolds & McDonell, 2021). By sampling from the model, we can extract a reasoning sample $x$ that is comprised of individual steps. We refer to $x$ in this setting as a Chain of Thought or CoT.

LLMs can generate CoT reasoning that is significantly impacted by biasing features in the context (Turpin et al., 2023), such as the user suggesting an incorrect answer to a multiple-choice question. Lanham et al. (2023) show that CoT reasoning can be ignored by the model when producing a final answer, showing that a model may not change its answer if it receives a truncated or corrupted version of its CoT reasoning. These are reasons to suspect that CoT reasoning is much closer to biased reasoning than a faithful externalization of the model’s actual reasoning, at least in some settings.

2 Factored decomposition

There are three stages to this approach: decomposition, subquestion-answering, and recomposition. During the decomposition stage, we prompt the model with a question $q$ and instruct it to generate an initial list of subquestions to answer. We call this initial list $l_{1}=[q_{1,1},q_{1,2},\dots]$ . Each subquestion in $l_{1}$ may contain a reference to the answers of other subquestions in $l_{1}$ . We next use the model to answer all subquestions which do not reference any other subquestions, as part of the subquestion-answering stage. We do this by prompting the model with each subquestion $q_{1,i}$ in an isolated context and asking it to generate a subanswer $a_{1,i}$ . We then pass those subanswers to the model in the form of a list $a_{1}=[a_{1,1},a_{1,2}\dots]$ , which the model can now condition on. Then, the model updates the running list of unanswered subquestions with a new set of unanswered subquestions $l_{2}=[q_{2,1},q_{2,2},\dots]$ . The model produces $l_{2}$ by copying, removing, or editing (by replacing references with subanswers) subquestions from $l_{1}$ . The model alternates updating the running list of subquestions (decomposition) and answering subquestions (subquestion-answering) until the model generates a predetermined output to indicate that it has the information it needs to answer the original question. At this point, we collect all answered subquestions and their respective subanswers into a reasoning sample $x$ , where each $x_{i}$ is a tuple of subquestion and subanswer $(q_{i},a_{i})$ . The final stage, recomposition, happens when we prompt the model to answer the question using $x$ .

Our hypothesis is that factored decomposition partially mitigates the lack of faithfulness observed in CoT reasoning. We expect a reduction in biased reasoning because each subquestion $q_{i}$ is answered in an independent context from all other subquestions and the original question $q$ . As a result, biasing features in the input are less influential on the generated reasoning, so long as the subquestions do not contain the biasing features. We also expect a reduction in ignored reasoning, because the answers to earlier subquestions often have a clearly specified relationship to later subquestions that get asked (e.g., if those subquestions explicitly copy from the answers from earlier subquestions). Similarly, the answers to all subquestions may have a clearly specified or implied relationship to the final answer. At the final step, where the model uses the collected reasoning sample to answer the question, the model can potentially still ignore subquestions and subanswers that do not fit its biases, but we expect this effect to be more limited than if the reasoning sample itself contained biased reasoning.

3 CoT decomposition

We prompt the model with a question $q$ and instruct it to decompose the question into subquestions and answer the subquestions iteratively. The model generates one subquestion at a time, immediately generates a subanswer for that subquestion, and then continues generating until the model generates a predetermined output indicating that it is done decomposing $q$ . Sampling from the model thus allows us to extract a reasoning sample $x$ that is comprised of individual subquestion and subanswer pairs, meaning each $x_{i}\in x$ is a tuple $(q_{i},a_{i})$ . We refer to $x$ in this setting as a Chain-of-Thought decomposition (CoT decomposition).

CoT decomposition is an intermediate method to CoT prompting and factored decomposition. $x$ is still generated from the model in one autoregressive sampling call, like CoT, and unlike factored decomposition. However, $x$ is structured as a sequence of subquestion and subanswer pairs, like factored decomposition and unlike CoT. CoT decomposition may mitigate biased reasoning, because it may be harder for the model to generate a biased set of subquestions and subanswers as opposed to a biased sequence of reasoning steps. CoT decomposition may also answer subquestions in a similar, less biased way as in factored decomposition if the subquestions do not contain biasing features. CoT decomposition may mitigate ignored reasoning for similar reasons to factored decomposition, i.e., since there is often a clear relationship between answers to earlier subquestions and later subquestions, as well as the final answer.

4 Implementation

For all experiments, we use a pretrained LLM that has been fine-tuned for helpfulness with reinforcement learning from human feedback (RLHF; Bai et al., 2022), using the same base model as Claude 1.3 (Anthropic, 2023). We use nucleus (Holtzman et al., 2020) with top $p=0.95$ and temperature $0.8$ , following Lanham et al. (2023). We also use best-of-N (Nakano et al., 2021; Lightman et al., 2023) sampling with $N=5$ , using the same preference model (PM) that was used for RLHF training of the LLM to score samples.

We evaluate all prompting strategies for performance and faithfulness on four different multiple-choice question-answering tasks:

HotpotQA (Yang et al., 2018): Multi-hop questions, or questions that require multiple steps of reasoning to answer, e.g. “Did LostAlone and Guster have the same number of members?” We filtered this to only questions with binary (yes/no) answers since the remaining questions would not be easily amenable to a multiple-choice format.

StrategyQA (Geva et al., 2021): Open-domain questions where reasoning steps can be inferred from the structure of the question and are thus decomposable.

OpenBookQA (Mihaylov et al., 2018): Elementary-level science questions.

TruthfulQA (Lin et al., 2022): Questions that humans will often answer incorrectly because of common misconceptions. We use a version of TruthfulQA that has been formatted for multiple-choice evaluation.

We evaluate our methods on HotpotQA and StrategyQA because these tasks are well-suited to step-by-step reasoning or question decomposition. We additionally chose OpenbookQA and TruthfulQA to assess our methods on other kinds of questions. We evaluate the prompting strategies using 300 randomly sampled questions from each task’s test set, for a total of 1200 questions.

We evaluate five prompting strategies: zero-shot prompting, few-shot prompting, CoT prompting, CoT decomposition, and factored decomposition (Tables 2 and 3). Each dialog begins with an token and includes two newlines before each dialog turn. For all prompts involving few-shot examples, we format the few-shot examples identically to the format that we expect the model to follow when generating reasoning and providing a final answer. The questions we use for all of the few-shot examples are initially chosen for the factored decomposition few-shot prompt. We use the same set of 14 questions for all other prompting methods that require few-shot examples (all methods except zero-shot). We construct the prompt iteratively, starting from an initial set of simple, hand-crafted examples. We gradually expand the set of questions, pulling in questions from the training sets of the tasks we evaluate, trying to ensure question diversity, and patching various failure modes observed qualitatively in the generated reasoning samples, e.g., the model failing to phrase subquestions such that they can be answered in an isolated context. For prompting strategies that elicit reasoning samples from the model, we include high-quality reasoning samples as part of the few-shot examples, either resampling from a model multiple times until the reasoning is valid or manually editing intermediate steps. We share the instructions and the first several few-shot examples for each prompt in Appendix C; the complete prompts can be viewed at this supplementary repository.

Results

Having introduced the three model-generated reasoning methods we study, CoT prompting, CoT decomposition, and factored decomposition, we now evaluate the three methods on question-answering performance and a battery of reasoning faithfulness metrics, adapting evaluations proposed in Lanham et al. (2023) and Turpin et al. (2023).

Table 4 compares the accuracy of various methods on the evaluations we study. We view few-shot prompting (rather than zero-shot prompting) as the most relevant baseline for reasoning-generating methods since all reasoning-generating methods contain few-shot examples with high-quality reasoning demonstrations. CoT prompting outperforms both decomposition methods in terms of question-answering performance. CoT decomposition is overall competitive with CoT prompting, only underperforming it by 0.4% (absolute) on average, and factored decomposition outperforms few-shot and zero-shot prompting baselines by 2.1 and 9.0% on average. We observe the largest gains for all reasoning-generating methods over baselines on HotpotQA and StrategyQA, the two tasks most suited to step-by-step reasoning or question decomposition. For example, on HotpotQA we observe zero-shot and few-shot performance at 77.0% accuracy, whereas factored decomposition achieves 83.0%, CoT decomposition achieves 86.7%, and CoT achieves 87.3%. Ranking methods by per-task accuracies, we find a fairly consistent ordering: CoT, CoT decomposition, factored decomposition, few-shot prompting, and zero-shot prompting.

2 Faithfulness Measured via Reasoning Perturbation

A method to assess reasoning faithfulness is to perturb the reasoning that the model conditions on before producing a final answer. If the model gives a different answer with the altered form of the reasoning, the change in the final answer indicates that the model is not ignoring the reasoning when answering the question, suggesting greater faithfulness. We study two kinds of perturbation, truncation and corruption, on model-generated reasoning by adapting two metrics from Lanham et al. (2023).

In this set of experiments, we truncate reasoning samples and evaluate how much of an average reasoning sample a model needs to reach the final answer it would give with the full reasoning sample. We compare the different prompting methods by this metric, plotting the percentage of final answers that a model is able to reach across the average percentage of reasoning provided. We expect methods that generate more faithful reasoning to require larger amounts of reasoning to reach the same final answer since this indicates that the model is relying more on the reasoning for its final answer.

We take a completed reasoning sample $x$ and truncate it at each intermediate step, generating the empty sample $[]$ , then $[x_{1}]$ , and so on. For each truncated reasoning sample, the truncated reasoning replaces the original reasoning, with no additional sampling, in the prompting templates shown above. The model is then prompted to answer the question as before and we evaluate whether the model reaches the same final answer it did with the original reasoning. We analyze how the answer the model reaches varies across different truncations of the reasoning, where truncations that include greater percentages of reasoning should be more likely to result in the same final answer as the original reasoning.

Our findings are summarized in Figure 3. For CoT prompting and CoT decomposition, we observe that the curves have fairly gentle slopes and reach high values early in an average reasoning sample. This suggests the model requires relatively little of a CoT or CoT decomposition reasoning sample to reach its final answer and thus may not be fully relying on those reasoning samples. For factored decomposition, we observe the model requires a larger amount of its reasoning to consistently reach the same answer, indicating the model relies on more of its reasoning when answering the question.Our results are presented in a different form than the analogous results from Lanham et al. (2023), since we average our results across all reasoning samples, even if they differ in length or task. We show more detailed results, broken down by task, in Appendix A.1.

2.2 Adding Mistakes

In this set of experiments, we corrupt reasoning samples and evaluate how much this causes the model to change its final answers. We compare the different prompting methods by this metric, plotting the percentage of final answers that are changed if a model’s reasoning sample is corrupted. We expect methods that generate more faithful reasoning to have more final answers changed since this indicates that the reasoning is playing a causal role in the model’s final answer and is thus more likely to be reflective of the model’s actual reasoning.

We take a completed reasoning sample $x$ and prompt the same language model in a different context to modify step $x_{i}$ by adding a mistake to it and creating the corrupted step $x_{i}^{\prime}$ . The prompts for this are included in Appendix E. We prompt the model to regenerate the rest of the reasoning from that point onward, i.e. we prompt the model with $[x_{1},x_{2},\dots,x_{i}^{\prime}]$ and ask it to generate the corrupted reasoning $[x_{1},x_{2},x_{3},\dots,x_{i}^{\prime},x_{i+1}^{\prime},\dots,x_{n}^{\prime}]$ . We manually replace the original reasoning with the corrupted reasoning before prompting the model to answer the original question. We repeat this for three random and distinct selections of $x_{i}$ for each reasoning sample. We evaluate whether the model reaches the same final answer it did with the original reasoning. Examples of corrupted reasoning are also presented in Appendix E.

Our findings in Figure 4 show that corrupting CoT decompositions and factored decompositions often alters the answers the model gives, providing evidence for the claim that models rely more on decomposition-based reasoning samples than CoT reasoning samples. Corrupted CoT reasoning can also change the model’s final answer, but this happens far less often than it does for decomposition-based reasoning; a corrupted CoT reasoning sample changes the model’s final answer for only 9.6% of the questions, compared to 28.7% of the answers changing for CoT decomposition and 33.6% of the answers changing for factored decomposition. Our results are presented in a different form than the analogous results from Lanham et al. (2023), since we average the percentage of times the answer is changed across all reasoning samples, even if they differ in length or task, and across all possible locations of the mistaken step. We show more detailed results, broken down by task, in Appendix A.2.

2.3 Conclusions

Overall, our results from the reasoning perturbation experiments suggest that question decomposition leads to more faithful model-generated reasoning. Factored decomposition generates the most faithful reasoning, whereas CoT decomposition generates less faithful reasoning than factored decomposition but more faithful reasoning than CoT prompting. This is shown by the early answering experiments, which find comparable faithfulness between CoT decomposition and CoT prompting, and the adding mistakes experiments, which find CoT decomposition has intermediate faithfulness.

3 Faithfulness Measured via Biasing Contexts

Another way to test for reasoning faithfulness is to measure how much the model’s predictions change due to biasing features in the model’s context, for features which the model is unlikely to explicitly mention in its reasoning (Turpin et al., 2023). An example of such a biasing feature, which we test here, is to make all of the few-shot examples in the model’s context have the same, correct answer choice “A” following Turpin et al. (2023). We then measure unfaithfulness using the performance drop observed when we introduce this bias. Suppose the model answers in a bias-consistent way, e.g., incorrectly answers “A” if all of its few-shot examples have the answer “A” but would answer the question correctly otherwise; this finding would indicate that the model is not wholly relying upon its stated reasoning for its final answer, assuming the model never states that it is using the biasing feature (which we and Turpin et al. confirm in essentially all reasoning samples that we scanned). Here, we introduce the biasing feature by making the correct answer “A” for each of the few-shot examples in the model’s context, by changing what answer text corresponds to which multiple-choice answer, as needed. We also alter the reasoning samples in the few-shot prompt to accord with the change in answer order, e.g. if the model asks subquestions by going through each answer choice in order, we adjust the subquestion order along with the answer choices. We then prompt the model to generate reasoning and answer the question, or to directly answer the question in the few-shot condition.

We evaluate our methods on different tasks than Turpin et al.. As a result, the few-shot examples we use in our prompts differ from their few-shot examples, since we use the same examples for each method as we did for our earlier experiments. Our few-shot examples also consist of two-sided conversations between the Human and Assistant, where the Human asks a question and the Assistant answers a question, perhaps after generating reasoning; Turpin et al. instead place all few-shot examples and context on the Human side of the conversation, before prompting the Assistant to answer the question (perhaps after generating reasoning). Following Turpin et al. (2023), we filter our results by excluding questions where the correct answer is “A”, to specifically look at the results for questions where the bias could lead the model toward an incorrect answer.

Figure 5 (right) shows the results. We find that CoT prompting, CoT decomposition, and factored decomposition are all similarly unaffected by the biasing feature in the few-shot examples. We observe for CoT prompting a 1.2% (absolute) drop in accuracy, for CoT decomposition a 2.8% drop, and for factored decomposition a 2.1% gainWe suspect that the gain is likely due to noise.. This is in contrast to a more significant 7.1% (absolute) drop in performance for the few-shot condition. Overall, our results in this setting do not reveal significant differences in the reasoning faithfulness of different methods. We present more detailed results, broken out by task, in Appendix A.3.

Turpin et al. (2023) found that CoT prompting showed a larger drop in performance with the biased contexts relative to unbiased contexts for this bias (-4.7% absolute). There are several possible explanations for our differing results. It may be due to a difference in prompt formatting (whether the few-shot examples are given in the human or assistant side of the dialog), evaluation tasks, and/or models (Turpin et al. use Claude 1.0 for experiments). Most significantly, two of our evaluation tasks, our filtered version of HotpotQA and StrategyQA, have only two answer choices for all of their questions. By filtering out questions where the correct answer is indeed “A”, we remove half of the examples for each of these tasks. Comparing all prompting methods against each other in this setting is still valid, but comparing to other results, such as those in Turpin et al. (2023), is likely inappropriate.

3.2 Biased Reasoning from Suggested Answer

Next, we explore introducing a different biasing feature from Turpin et al. (2023). We choose an incorrect answer for each question and add (to the human’s utterance) “I think the correct answer is ([incorrect answer choice]), but I’m curious to hear what you think.”This tests for the model’s propensity to answer questions in ways that are in line with its human dialog partner’s preferences or beliefs (Perez et al., 2022, “sycophancy”); we present additional results related to sycophancy in Appendix B. We then prompt the model to generate reasoning and answer the question, or to directly answer the question in the zero-shot and few-shot conditions. Suppose the model’s accuracy significantly decreases and it fails to mention that it is choosing the suggested answer because the human suggested it; such a finding would suggest a lack of reasoning faithfulness, for similar reasons as in §3.3.1. We again measure the drop in performance caused by adding the biasing feature, while verifying that reasoning samples do not reference the bias. We use the implementation details from §3.3.1 here as well, except for the filtering of results. Here, since we always suggest an incorrect answer to the model, we do not need to filter the results to select questions where the model may answer the question incorrectly as a result of the bias. This is a slight departure from the setup of Turpin et al. (2023), who instead always suggest a random answer choice and then filter for examples where the suggestion is an incorrect answer choice; ultimately, both analyses should lead to similar findings.

Figure 5 (left) shows our results. We find a sizable drop in performance for all methods. For CoT prompting, we observe a 21.3% (absolute) drop in accuracy, for CoT decomposition a 29.1% drop, and for factored decomposition a 9.2% drop, by far the least across all prompting methods. This finding suggests that factored decomposition mitigates some but not all of the lack of faithfulness observed in the other methods in this context. It is also notable that CoT reasoning is more faithful than CoT decomposition reasoning in this context, though both methods observe a greater decrease in performance than the few-shot prompting condition (16.6% absolute drop). We present more detailed results, broken out by task, in Appendix A.3.

3.3 Conclusions

Our findings studying the faithfulness of model-generated reasoning via biased contexts suggests that factored decomposition leads to more faithful reasoning than CoT or CoT decomposition. CoT decomposition reasoning looks less faithful than CoT reasoning via these metrics, but our measurements from the reasoning perturbation experiments suggest otherwise. We do not make any claims about any ordering of the methods in terms of their importance to overall faithfulness, so by simple averaging (after normalizing to a 0–1 scale), we assess CoT decomposition reasoning as more faithful than CoT reasoning.

4 Qualitative Findings

We show reasoning samples for CoT decomposition and factored decomposition in Table 5 and Appendix D. The model-generated decompositions, for both CoT decomposition and factored decomposition, are generally sensible. The model often generates subquestions for each answer choice in order to perform process-of-elimination, which reflects the few-shot examples in its context. Additionally, the model often asks an introductory (sub)question about the general topic behind the question; this helps gather context that sometimes gets used in future subquestions.

Sometimes the model fails to phrase a subquestion such that it can be answered without additional context. It may also regenerate previous subquestions that were not able to be answered and still fail to receive answers to them, instead of reliably correcting the subquestions so that they can be answered. Occasionally, the subquestions and subanswers end up supporting multiple answer choices. The model can still end up answering the question correctly, but from the perspective of faithfulness, the model would ideally explicitly discuss which of the multiple supported answers is correct.

5 Discussion and Limitations

Our findings indicate that using question decomposition over CoT prompting provides faithfulness gains at the cost of question-answering performance. Factored decomposition generates the most faithful reasoning but leads to the worst question-answering performance. CoT decomposition provides intermediately faithful reasoning and performance. We are uncertain how this observed trade-off might be affected by other improvements such as further training, especially training geared towards improving a model’s ability to answer questions via decomposition. Such training or other techniques may lead to Pareto-dominating methods for highly faithful and performant model-generated reasoning, which we believe to be an exciting goal for future work.

Our work leans heavily on the methods we use to assess the faithfulness of model-generated reasoning. These methods are limited by our inability to access the ground truth for the model’s reasoning. Our claim that question decomposition improves reasoning faithfulness is one based on multiple, fairly independent, lines of evidence, but we are open to future tools for assessing reasoning faithfulness, perhaps those based on a mechanistic understanding of the internal computations of our models (Olah, 2023), changing our conclusions. Additionally, we evaluate our methods on only four question-answering tasks and on only one model (an RLHF-finetuned LLM); pretrained LLMs may be more or less prone to generating ignored or biased reasoning, which may increase or reduce the faithfulness benefit obtained via decomposition. Expanding the diversity of the tasks and models evaluated could lead to more robust conclusions about the relative performance and reasoning faithfulness of CoT prompting and question decomposition approaches.

Related Work

Task decomposition has been shown to achieve strong performance in a wide variety of settings. Several methods for prompting language models for reasoning share similarities to the question decomposition approaches we study, e.g., Least-To-Most Prompting (Zhou et al., 2023), Plan-and-Solve Prompting (Wang et al., 2023), Selection-Inference (Creswell et al., 2023), and Successive Prompting (a less flexible version of factored decomposition; Dua et al., 2022). These methods incorporate decomposition-style reasoning (Least-To-Most, Plan-and-Solve, and Successive Prompting) and/or restrict the amount of context used when generating reasoning steps (Least-to-Most Prompting, Successive Prompting, and Selection-Inference). Ferrucci et al. (2010); Min et al. (2019); Perez et al. (2020); Fu et al. (2021); and Guo et al. (2022) explore using supervision, heuristics, or language models to decompose hard, multi-hop questions into easy single-hop subquestions that can be answered independently. Reppert et al. (2023) study the process of Iterated Decomposition, where a human helps decompose tasks for LLMs to perform. AlKhamissi et al. (2022) find that decomposing the hate speech detection task into several subtasks greatly improves accuracy and out-of-distribution generalization. Christiano et al. (2018) and Snell et al. (2022) improve task performance by answering questions via decomposition, then learning to predict or distill those improved answers back into the original model. More broadly, Stuhlmüeller (2018) presents the factored cognition hypothesis or the claim that tasks can be decomposed or factored into small and mostly independent subtasks. Stuhlmüller et al. (2022) presents a software library for implementing factored cognition programs with LLMs. Our work complements existing literature by suggesting that decomposition-based methods may have additional benefits beyond performance, namely, improvements to the faithfulness of the reasoning generated.

Prior work also proposes metrics for and evaluates the faithfulness of model-generated reasoning. We adopt the definition of faithful reasoning from Jacovi & Goldberg (2020), where reasoning is faithful to the extent that it reflects the model’s actual reasoning. A type of faithfulness is the extent to which explanations lead to simulatability of model behavior, where the goal is for model behavior to match human expectations, perhaps after analysis of the model’s reasoning (Doshi-Velez & Kim, 2017; Hase et al., 2020; Wiegreffe et al., 2021). Gao (2023) find that LLMs can ignore parts of their CoT reasoning, as assessed by perturbing the CoT reasoning samples, corroborating our results and the results of Lanham et al. (2023). Creswell et al. (2023); Lyu et al. (2023) explore methods for prompting models to generate explanations that are more likely to be faithful by construction, though they do not explicitly measure faithfulness. Other work evaluates the plausibility of CoT reasoning and finds the plausibility of CoT reasoning to be varied; some find CoT reasoning to contain contradictions and logical errors (Uesato et al., 2022; Jung et al., 2022; Ye & Durrett, 2022; Golovneva et al., 2023), but others find CoT explanations to be both plausible and helpful, even to smaller models (Madaan & Yazdanbakhsh, 2022; Li et al., 2022).

Conclusion

We explore three prompting strategies for improving the question-answering performance while eliciting faithful reasoning from LLMs: Chain-of-Thought (CoT) prompting, CoT decomposition, and factored decomposition. Our work shows it is possible to greatly improve the faithfulness of model-generated reasoning by prompting models to perform question decomposition while maintaining similar levels of question-answering accuracy, suggesting that there is even more headroom for progress using other techniques.

We expect auditing the reasoning process of models to be a powerful lever for improving their safety when supervising models in high-stakes settings (Rudin, 2019); if models provide faithful reasoning for their outputs, we can discard their outputs in situations where their reasoning surfaces undesirable behavior such as reward hacking or sycophancy. We find several promising avenues for building upon our results. First, training models to generate more effective and faithful reasoning may lead to further gains, by training models e.g. to solve problems via decomposition or to generate consistent reasoning across logically-related inputs (to mitigate unfaithful, biased reasoning; Turpin et al., 2023). Second, improvements to the faithfulness of models’ stated reasoning may improve the effectiveness of methods that train models on the basis of their stated reasoning process (Uesato et al., 2022; Lightman et al., 2023). Lastly, it is important to validate that faithful stated reasoning enables us to detect undesirable model behaviors, especially ones that would be otherwise hard to catch by only looking at a model’s final output. With further research, we hope that faithful, model-generated reasoning will enable us to reliably understand and train the way LLMs perform tasks via process-based oversight, even as those tasks become more and more challenging.

Author Contributions

Ansh Radhakrishnan led the project, drafted the paper, and conducted all experimental work except for the sycophancy experiments, which were conducted by Karina Nguyen. Karina Nguyen, Jan Brauner, Samuel R. Bowman, and Ethan Perez helped to revise the paper and figures. Jared Kaplan, Samuel R. Bowman, and Ethan Perez provided feedback throughout the course of the project, and Ethan Perez scoped out the project direction. All other listed authors contributed to the development of otherwise-unpublished models, infrastructure, or otherwise provided support that made our experiments possible.

Acknowledgements

We thank Amanda Askell, Buck Shlegeris, Daniel Ziegler, Kshitij Sachan, Leo Gao, Miles Turpin, Ryan Greenblatt, and Saurav Kadavath for helpful feedback and discussion.

References

Appendix A More Detailed Results

We present more detailed results for the early answering experiments, which we discuss in §3.2.1, in Figure 6(a). Overall, we find that the curves for each prompting strategy generally match up with the curves averaged across all tasks (shown in Figure 3), suggesting that the model’s sensitivity to reasoning sample truncation is fairly similar across the tasks we evaluate. TruthfulQA is perhaps a slight exception, with all of the prompting strategies having noticeably more similar trends to each other, but the model still appears to be most faithful to factored decomposition reasoning samples by this metric.

A.2 Further Adding Mistakes Results

We present more detailed results for the adding mistakes experiments, which we discuss in §3.2.2, in Figure 6(b). We find that the relative ordering of the methods’ reasoning faithfulness is maintained across tasks. For each task, the model changes its answer most frequently when it is prompted with a corrupted factored decomposition reasoning sample and lest frequently when it is prompted with a corrupted CoT; a corrupted CoTD decomposition reasoning sample leads to intermediate results. OpenBookQA exhibits the smallest effect sizes for final answer sensitivity to reasoning truncation, across all prompting methods, with all other tasks generally showing very similar effect sizes.

A.3 Further Biasing Context Results

We present more detailed results for the experiments measuring reasoning faithfulness via biasing contexts, which we discuss in §3.3.1 and §3.3.2, in Figures 7 and 8. The results for HotpotQA and StrategyQA, especially the effect of the suggested answer bias, are likely skewed by the fact that the questions for those tasks only contain two answer choices. The results for the answer is always A experiments for OpenBookQA, specifically for factored decomposition, are fairly unexpected but are likely due to some form of noise.

Appendix B Biased Reasoning from Sycophancy

Here, we test for biased reasoning using other biasing features related to sycophancy, inspired by (but different from) the suggested answer bias that Turpin et al. study and we adapt in §3.3.2. We use three LLM-written evaluations designed to test LLM sycophancy from Perez et al. (2022), in the context of philosophy questions, questions about Natural Language Processing (NLP), and political questions. We evaluate on 200 randomly chosen questions from each evaluation. The evaluations consist of examples where the user introduces themselves as holding a certain belief or opinion, before asking a question related to that topic; an answer in accordance with the user’s preferences indicates sycophancy towards the user. We assess the percentage of answers the model gives that are non-sycophantic as a way of measuring reasoning faithfulness; we expect 50% of the model’s answers to be non-sycophantic if it was not sycophantic at all. The type of sycophancy we study here is less direct than the kind of sycophancy the suggested answer experiments test for since the model has to infer something about a user rather than simply answer a question in line with the user’s explicit suggestion, which requires no inference.

We display the % of answers that are not sycophantic for each method in Fig. 9. The results indicate that factored decomposition mitigates LLM sycophancy on the evaluations from Perez et al. (2022); factored decomposition leads to 14.7% of answers being non-sycophantic, as opposed to 4.7% for CoT prompting or 5.2% for CoT decomposition, which both lead to more sycophancy than the zero-shot (9.2%) and few-shot (8.3%) baselines.

A key assumption that our biasing context experiments rely on is the lack of explicit references to the biasing features in the model’s reasoning samples. We qualitatively verify this for both the answer is always A and suggested answer experiments, but find that this assumption does not hold when we attempt to evaluate the model for sycophancy; the model explicitly reasons about the user and tries to answer the question based on their views. Furthermore, the lack of sycophancy observed with factored decomposition is likely due to the model failing to appropriately phrase questions appropriately so that it can infer the user’s views, rather than the model actually attempting to not be sycophantic. We tentatively conclude that the reduction in sycophancy we observe when prompting models to perform factored decomposition is not a clear indication of greater reasoning faithfulness, or evidence that factored decomposition is a viable mitigation for sycophancy.

Appendix C Few-Shot Examples and Instructions

Tables 6, 7, 8, 9, 10, 11, and 12 contain the instructions and the first three few-shot examples (for each method) we use to prompt our model, including reasoning sample demonstrations. We share the full prompts, including the remaining few-shot examples and reasoning demonstrations, at this supplementary repository.

Appendix D Reasoning Samples

Tables 13 and 14 contain reasoning samples for CoT decomposition and factored decomposition. As we note in §3.4, the question decompositions for both strategies are quite similar and often exhibit a process-of-elimination structure.

Appendix E Adding Mistakes Prompts and Corrupted Reasoning Samples

Tables 15, 16, and 17 show how we prompt our model to add a mistake to a step in a reasoning sample to generate a corrupted reasoning sample, for each prompting strategy; we discuss the relevant experimental setup in §3.2.2. We show examples of corrupted reasoning samples generated using these prompts in 18, 19, and 20. Qualitatively, we find that over two-thirds of corrupted reasoning samples contain errors that should almost certainly result in different final answers, indicating that our results likely underestimate the true sensitivity of the model’s final answers to corrupted reasoning.