Let's reward step by step: Step-Level reward model as the Navigators for Reasoning

Qianli Ma, Haotian Zhou, Tingkai Liu, Jianbo Yuan, Pengfei Liu, Yang You, Hongxia Yang

Introduction

In the exciting evolution of Large Language Models (LLMs) such as GPT (OpenAI, 2023; Brown et al., 2020), LLaMA (Touvron et al., 2023a; b), OPT (Zhang et al., 2022a), Falcon (Penedo et al., 2023), and PaLM (Anil et al., 2023; Chowdhery et al., 2022), a consistent ability to handle tasks from conversation to text generation has been evident. However, when it comes to reasoning, especially multi-step reasoning, current LLMs, even with sophisticated prompting techniques like the Chain of Thought (CoT)(Wei et al., 2023), are still prone to a cascade of errors in their generation processes. As the number of reasoning steps increases, these LLMs face challenges in providing and integrating effective feedback, resulting in one error leading to another.

Achieving a refined multi-step reasoning capability for LLMs can unlock their potential across an even broader array of applications, ranging from complex problem-solving to high-level intellectual tasks. Furthermore, as these models become foundational tools in numerous domains, ensuring their reasoning accuracy is paramount to prevent compounding errors and instill trust in their outputs.

To address the challenges of multi-step reasoning, prior research has focused on feedback mechanisms and supervisory techniques. Works have incorporated self-reflection (Madaan et al., 2023; Xie et al., 2023) or external observation (Yao et al., 2023b; Gou et al., 2023) during reasoning, aiming to correct errors and determine correct reasoning steps. Additionally, novel methods like Tree of Thought (ToT)(Yao et al., 2023a) and RAP(Hao et al., 2023) have integrated traditional search algorithms to convert multi-step reasoning into a path search problem. However, while some of these solutions have led to improvements, they introduce new sets of problems. For instance, the risk of LLMs falling into repetitive thought loops or their self-reflection process exceeding the model’s context window constraints. Additionally, direct search processes, such as BFS or DFS, often result in redundancy and substantial reasoning overhead, which, in extreme cases, can degrade to exhaustive search complexity.

Building upon previous studies, it’s found that enhancing the reasoning capabilities of large language models involves a few essential aspects. These include step-by-step reasoning, which helps in breaking down complex problems incrementally and guiding the model toward the solution. Another crucial element is the incorporation of feedback or self-reflection. Such feedback can be from the model itself, other models, or even external sources like the results of code execution. Additionally, implementing heuristic search algorithms can be beneficial. Especially when trying to facilitate the model’s reasoning path search, it becomes important to compress the search space effectively.

Uesato et al. (2022) and Lightman et al. (2023) introduced the process-supervised reward model (PRM). PRM can provide step-level feedback for the multi-step reasoning result generated by the language model, but it has traditionally been used in RLHF (Ouyang et al., 2022; Luo et al., 2023) via proximal policy optimization (PPO) (Schulman et al., 2017) or reject sampling (Yuan et al., 2023; Dong et al., 2023; Bai et al., 2022). Thus, we hypothesized that the fine-grained feedback from PRM, which enhances RLHF (Wu et al., 2023), could similarly improve reasoning path searching. PRM was used as a reward model for the PPO process in Luo et al. (2023). Ideally, PRM would not introduce additional training overhead and could be further utilized in the decoding phase after the PPO stage to enhance the model’s reasoning capabilities. On the other hand, PRM only requires evaluation for each step, making its computational cost much lower than model self-reflection.

Consequently, we propose a heuristic greedy search algorithm for large language model reasoning tasks that harnesses the process-supervised reward model for step-level feedback(HGS-PRM). Similar to the Lightman et al. (2023), we trained a process-supervised reward model based on the open-source model. With our method, each new reasoning step generated by the large language model undergoes evaluation by the reward model. This determines whether to accept the given step or continue generating a new one. If a valid new step is unattainable, a backtrack occurs until a complete reasoning path is identified. Our experimental results on GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) both surpassed those achieved with the CoT method. For instance, with the WizardMath-13B (Luo et al., 2023)model, our method achieved accuracies of 65.4% and 13.7% on the GSM8K and MATH datasets, respectively, surpassing CoT’s 63.2% and 10.4%.

Beyond mathematical problems, we aimed to demonstrate the efficacy of HGS-PRM in code generation and even more intricate reasoning tasks. By leveraging technologies like Abstract Syntax Trees (AST) and Mutation Testing, we automatically generated PRM data for code generation challenges on the MBPP (Austin et al., 2021) dataset. Following this, we trained a process-supervised reward model specifically tailored for code. Moreover, our approach also yielded improved results on the HumanEval (Chen et al., 2021) benchmark. For instance, the results for Code-LLaMA-Python-13B increased from 41.5% to 44.5%.

In summary, our contributions are as follows:

We explored the feasibility of utilizing PRM during the decoding phase. Training and deploying PRM is a challenging task. By demonstrating the viability of PRM-assisted decoding on open-source models, we’ve achieved a significant advancement.

We evaluated the strengths and weaknesses of prior methods for enhancing large language model reasoning capabilities. Innovatively, we combined PRM with path search, introducing a heuristic greedy search algorithm that leverages PRM for step-level feedback.

Ours is the pioneering effort to apply PRM in the coding domain. We have released a PRM dataset specifically for code and validated that PRM is effective across a broader spectrum of reasoning tasks.

Method

In this section, we delve into the technical methodologies we have adopted. First and foremost, we discussed how to use step-level data to train our process-supervised reward model(PRM). Moreover, we provide an in-depth examination of the heuristic greedy search algorithm with PRM (HGS-PRM), discussing its operational mechanics and core concepts. To demonstrate the versatility of this method for other reasoning tasks, We will also introduce how we generated our code-specific PRM data by employing Abstract Syntax Tree (AST) techniques to automate the generation of PRM data tailored for code-related tasks.

Following Taori et al. (2023) and Lightman et al. (2023), we trained our reward model, with the entire process divided into two phases:

To equip the foundational LLaMA model with mathematical instruction capabilities and strengthen its mathematical proficiency, we initially used the training portion of the MATH dataset. In line with Taori et al. (2023) approach, we mapped ’question’ and ’solution’ from the MATH(Hendrycks et al., 2021) training dataset to ’instruction’ and ’response’, respectively. We then performed instruction tuning(Wang et al., 2023c) on the LLaMA-7B(Touvron et al., 2023a) model and got LLaMA-7B-SFT.

Building on the instruction-tuning basis, we trained a reward model using PRM800k(Lightman et al., 2023) based on LLaMA-7B-SFT. This model is designed to score each step of the Solution on a step-by-step dimension. The labels fall into three categories: Positive, Negative, and Neutral.

2 Heuristic Greedy Search with PRM

Heuristic search is an algorithmic strategy that utilizes heuristic information in problem-solving to guide the search direction, thereby finding solutions more rapidly. A classic example of heuristic greedy search is the A* algorithm, which combines the benefits of best-first search and Dijkstra’s algorithm. It efficiently finds the shortest path by using a heuristic to estimate the cost to the goal from a given node.

During the mathematical problem-solving process with large language models, each chain of reasoning can be broken down step by step. This can be translated into a search problem for the optimal reasoning path within a vast space. Assessing the reward of each step, or the value of a specific state, is crucial for enhancing the inference capabilities of large language models through search algorithms.

Utilizing the language model’s self-evaluation to judge the value of a step can lead to significant computational overhead. For autoregressive models like GPT(Radford et al., 2019) that rely solely on decoding, the computational time complexity is $O(N^{2})$ . The introduction of self-assessment lengthens the sequence, causing the time complexity to grow exponentially. Additionally, traditional search methods like BFS and DFS have an expansive search space and involve many redundant nodes.

In comparison, using only a process-supervised reward model for assessment significantly reduces overhead compared to self-assessment. Moreover, heuristic greedy search algorithms can effectively minimize the search space, yielding faster results.

Consequently, we propose a Heuristic Greedy search algorithm for large language models based on feedback from the process-supervised reward model(Figure 2) as detailed in Appendix A. Figure 3 shows an example of HGS-PRM for a GSM8K problem.

The Algorithm in the AppendixA illustrates the process of our heuristic greedy search algorithm. Given an input question $x$ , our algorithm expands to the next potential node and then evaluates it using a reward model. If the reward is positive, the algorithm advances using this new node. If not, it continues to explore the next potential node, assessing it with the reward model until it reaches the maximum bandwidth $B$ . If a node expands to the maximum bandwidth without finding a positive step, two scenarios are considered: If neutral sub-nodes exist, one is selected. If all sub-nodes are negative, this indicates an erroneous step. The large language model is unable to determine the correct subsequent step, prompting a backtrack. To constrain our search space efficiently and balance exploration with efficiency, we’ve set max bandwidth $B$ and max iterations $T$ parameters. Once the algorithm reaches the max iteration count, it will generate an answer from the current step. In our actual implementation, we also tried BFS and MCTS. However, we found that they introduced a more redundant search space and did not effectively improve the results. Our ultimate goal is to enhance the model’s mathematical reasoning ability, not to increase algorithmic complexity. Therefore, the heuristic greedy search achieves a better trade-off between exploration and efficiency and is a more optimal choice.

3 Process-supervised Data for Code

To extend our approach to the field of code, a step-level reward dataset for code is necessary. As opposed to maths problems where it is difficult to assess the correctness of a given reasoning step without human intervention, coding problems with unit-tests are uniquely suited for automated labeling of step-level reward data. In our work, we use MBPP as seed dataset from which our PRM for code dataset is created.

Figure 4 provides an illustration of our method for generating code PRM data. Similar to PRM data where the validity of each reasoning step is evaluated in the context of previous steps, the format of our PRM-Code data code also includes components such as (PREVIOUS DATA, STEP and LABEL). As Python programs are naturally divided into lines separated by the new-line token, we first collect positive results from the ground truth data. For example, if we select the n-th step as the current annotated step, the PREVIOUS DATA includes the prompt and steps 0 to n-1, the STEP is the n-th step, and the LABEL is positive.

The creation of step-level reward data with neutral and negative reward for code can the be accomplished with a combination of code mutation and unit-testing. Automatic code mutation (as in Mutation Testing) is a standard practice in code development where operators (such as addition) in a given code block is automatically mutated to similar operators (such as division). Such mutation requires first transforming a given code block into an abstract syntax tree (AST) representation, and defining rules of how different nodes on the AST are to be mutated. This yields N mutated code variants for each line of code where each mutated code is executable, and each variant undergoes only one mutation. These mutated codes are then executed with unit tests. If a mutated code passes the unit tests, its LABEL for the corresponding step is marked as neutral. If it fails to pass the unit tests, the LABEL is marked as negative. The PREVIOUS DATA includes the prompt and steps 0 to the previous step of the mutation step, the STEP corresponds to the mutation step generated.

We note that mutation testing primarily focus on atomic operators such as arithmetic operations and array slicing, which limits the generality of our PRM-Code dataset. Natural extensions of our current approach involves defining more elaborate mutation rules that involve, for example, mutating for-loops into while-loops that correspond to changes of multiple lines of code. However, despite this limitation, our PRM-Code dataset has proven effectively in improving the code generation performance as we demonstrate in subsequent sections.

Experiment

In this section, we will delve into the intricacies of training our processed reward model and present the evaluation results of our heuristic greedy search algorithm, based on this processed reward model, in mathematical and code generation tasks.

For mathematical tasks, we trained our process-supervised reward model(PRM) based on LLaMA-7B(Touvron et al., 2023a). As previously mentioned, our model training method first involved directive fine-tuning using the MATH training set, followed by reward model training. However, it should be noted that we also directly trained our reward model on LLaMA-7B. Our experimental results indicate that models fine-tuned with mathematical directives perform superiorly in all aspects compared to the base model.

From our result shown in Figure 5, the Instruction-tuning model with math data before the training reward model results in improved performance. We also find positive labels exhibit the highest precision and recall. Therefore, we adopted a greedy strategy when designing our search algorithm, directly selecting the step when a positive step is identified. We also find The ability to distinguish neutral labels is inadequate, so give them less consideration during decoding. They can be treated as either positive or negative.

Penalty miss rate refers to the proportion of classifying positives as negatives and negatives as positives in the mis-classification, which we consider to be the most serious classification error of the reward model. For neutral samples, We can avoid misjudgments by adjusting the action of our search algorithm for neutral samples.

We also trained a process-supervised reward model specifically for code. Drawing from our experiments with math, we directly use Code-LLaMA-7B(Rozière et al., 2023), which had been fine-tuned with code instructions, to train our process-supervised reward model for code. Our code step-level data was automatically generated on MBPP(Austin et al., 2021) with the method in Figure 4.

From our result in Figure 5, the process-supervised reward model(PRM) demonstrated a significantly better ability to discern negative cases in code compared to math. This suggests that there’s greater potential in applying the PRM within the coding domain.

2 Math result

For mathematical problems, we first conducted evaluation experiments on the general models LLaMA-7B and LLaMA-13B using GSM8K and MATH datasets. We achieved results on both GSM8K and MATH datasets that were superior to those of CoT. For the MATH task, we follow the setting of (Lightman et al., 2023) and evaluate our models only on the 500 MATH test problems because 4.5K MATH test problems are in the PRM800K training set, which we used to train the reward model. Apart from general models, we also conducted experiments on math-specific models WizardMath-7B and WizardMath-13B(Luo et al., 2023). We similarly achieved superior results to CoT on both the GSM8K and MATH datasets. Moreover, we observed that math-specific models benefit more from the reward model. If the language model’s intrinsic capability is too weak, even with the aid of a reward model, it remains challenging to sample the correct reasoning path. On the other hand, if the linguistic capacity of the model significantly surpasses that of the reward model, the benefits might not be pronounced. Therefore, aligning the capabilities of the reward model and the language model is of paramount importance.

3 Code result

To avoid data leakage, our code PRM’s step-level training data is created based on MBPP dataset(Austin et al., 2021), and the resulting reward model is evaluated on the HumanEval(Chen et al., 2021) benchmark. HumanEval is a widely recognized benchmark for testing large language models on coding tasks. It consists of 164 original programming questions, with an average of 9.6 test cases assigned to each question.

For our evaluation, we selected models Code-LLaMA-Python-7B and Code-LLaMA-Python-13B. On the HumanEval benchmark, our approach improved pass@1 result by 4.9%(36.6% VS 41.5%) for Code-LLaMA-Python-7B model and 3.0% (41.5% VS 44.5%) for Code-LLaMA-Python-13B model over CoT, respectively. This performance also surpassed the greedy decoding results presented in the Code-LLaMA paper(Rozière et al., 2023), which is 3.1%(38.4% VS 41.5%) and 1.2%(43.3% VS 44.5%). The accuracy improvement we observed in code is generally higher than in mathematics. This aligns with our prior evaluation results during the PRM training reward phase where the precision and recall for the Code PRM negative label were notably higher than for math. We hypothesize that this might be because both HumanEval and MBPP involve relatively simple programming challenges, whereas MATH presents more complex mathematical problems which are intrinsically more challenging for both PRM and the language models themselves to learn.

4 Further Analysis

Is the result obtained through HGS-PRM reliable? We further analyzed the relationship between PRM feedback and the correctness of the results. The actual process is often not so ideal. As indicated in 5, the actual precision of PRM is lower than 90%. If the accuracy of PRM is not up to par, HGS-PRM might reduce the accuracy of the results. Therefore, besides the greedy search method, we can also use the approach of scoring after sampling to verify whether PRM is effective.

For the field of mathematics, we sampled 10 times from the MATH dataset with LLaMA2-13B and then scored the sampled reasoning paths at the step-level using PRM. According to our HGS-PRM algorithm, the correct reasoning path should not have any negative steps, so we filtering all samples with negative steps. For code, we sampled 100 times from the HumanEval dataset using Star-Coder(Li et al., 2023) and scored in the same manner.

From our results in Table 3, the “Accuracy after filtering”(14.4%, 36.66%) is much higher than the original accuracy(4.25%, 30.13%), demonstrating the correlation between the step-level feedback provided by PRM and the correctness of the result. The computational cost of filtering after repeated sampling is higher than that of HGS-PRM, because, for complex reasoning problems, it’s challenging for large language models to sample reasoning paths that don’t contain negative steps. This is also why we use search instead of sampling.

5 Case Study

Appendix B displays some correct reasoning paths and erroneous paths obtained by HGS-PRM, further detailing what mistakes PRM specifically identified such as parsing errors, computational errors, and semantic errors.

related Works

Large Language Models For reasoning Large language models (LLMs) have faced challenges in complex reasoning tasks, covering mathematical reasoning (Lu et al., 2023; Frieder et al., 2023) and code generation (Chen et al., 2021; Li et al., 2023; Rozière et al., 2023). Key datasets for mathematical reasoning are GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), AddSub (Hosseini et al., 2014), MultiArith (Roy & Roth, 2015), and SingleEQ (Koncel-Kedziorski et al., 2015) while code generation evaluations employ HumanEval (Chen et al., 2021), MBPP (Austin et al., 2021), and DS1000 (Lai et al., 2022).. Wei et al. (2023) proposed ”Chain-of-Thought Prompting” (CoT) to improve LLMs’ performance by prompting step-by-step decompositions. The Zero-shot-CoT by Kojima et al. (2023) enhances this by adding ’Let’s think step by step’ to responses, detailing the model’s reasoning. Gao et al. (2023) introduced the Program-aided Language Model (PAL) to assist LLMs in reasoning, while Wang et al. (2023b) presented a self-consistent method building upon CoT for consistent responses. Zhao et al. (2023) combined both CoT and PAL for improved reasoning, allowing concurrent use with the self-consistent approach. Lastly, Zhang et al. (2022b)’s Auto-CoT leverages LLMs to produce diverse reasoning chains autonomously.

Reasoning with feedback Beyond prompt-based enhancements, Pan et al. (2023) reviews methods combining self-feedback, search, and tools (Gou et al., 2023) to amplify large language model reasoning. Research by Madaan et al. (2023), Gero et al. (2023), Chen et al. (2023), Weng et al. (2023) and Shinn et al. (2023) demonstrate that these models can serve as feedback sources to bolster reasoning. Xie et al. (2023) offers a prompting method integrating self-evaluation via stochastic beam search to refine multi-step inference. Yao et al. (2023a)’s ”ToT” (Tree of Thoughts) advances CoT, enabling models to explore varied reasoning paths, evaluate decisions, and adjust strategies. Hao et al. (2023) transforms the LLM to act as both a world model and logical agent, using a planning method inspired by Monto Carlo Tree Search. Wang et al. (2023a)’s LeTI uses textual feedback from code errors for better code generation. Yang et al. (2023) takes a retrieval approach, allowing LLMs to extract data from Lean and engage with proof environments. Finally, Yao et al. (2023b) presents a framework blending advances in reasoning and action for a broader spectrum of linguistic challenges.

While much attention has been given to other areas, the process-supervised reward model (PRM) and outcome-supervised reward model (ORM) have seen less exploration. Uesato et al. (2022) first introduced PRM, highlighting its advantages over ORM in several applications, from few-shot prompting to reward modeling. Expanding on this, Lightman et al. (2023) released PRM800K, a dataset based on MATH annotations, showcasing the reliability of process supervision over outcome supervision. This high-quality dataset has been invaluable to our research. Luo et al. (2023) introduced ”Reinforcement Learning from Evol-Instruct Feedback (RLEIF)”, using PRM as a reward model within the PPO framework (Schulman et al., 2017). While these studies have focused on PRM for math, there’s a noticeable gap in PRM research for coding, pointing to a ripe area for further investigation.

Conclusions and Future Work

In this study, we introduce an approach that integrates the process-supervised reward Model (PRM) into heuristic greedy search, demonstrating the value of incorporating PRM feedback in multi-step inference tasks within large language models. This offers fresh insights into utilizing the PRM framework rather than RLHF or reject sampling. Our experimental results on the MATH and GSM8K datasets validate the effectiveness of our proposed HGS-PRM method and explore the potential of PRM in the decoding phase experiment.

Moreover, we take advantage of the Mutating Testing technique for code PRM data generation, enabling rapid scaling of step-level reward data for code applications. We release a PRM dataset tailored for coding, which holds pivotal significance for further research on PRM’s application in code-related tasks. We believe that the potential of the process-supervised reward model in multi-step reasoning within expansive models remains under-explored, especially in coding tasks. Enhancing code-specific PRM data and reward models could significantly bolster the capabilities of large language models in code generation.

Acknowledgement

Yang You’s research group is being sponsored by NUS startup grant (Presidential Young Professorship), Singapore MOE Tier-1 grant, ByteDance grant, ARCTIC grant, SMI grant Alibaba grant, and Google grant for TPU usage.

References

Appendix A Heuristic Greedy Search with PRM(HGS-PRM)

Node definition Each node of the search tree possesses the following elements.

State $S=[x,s_{1}...s_{i}]$ , where the input prompt and question $Q$ represented as $x$ , each step $i$ represented as $s_{i}$

Step $s_{i}$ , the reasoning step for this node

Reward $r$ , the reward for the step $i$ ( $s_{i}$ ) evaluated from the PRM

Value $V$ , the cumulative reward for each step leading to that node.

Children $children$ , the children node of the node

Expand Expand the possible next steps from the current node, and create a new child node: $P(s_{i+1}|S)=P(s_{i+1}|x,s_{1}\textperiodcentered\textperiodcentered\textperiodcentered s_{i})$ . the sampling diversity is influenced by the sample hyper-parameter like temperature, top- $p$ , top- $k$ .

Get Reward: after expanding a new node, the process reward model will evaluate the new step and feedback reward range from $$, which means negative, neutral, positive step.

Go ahead: when the reward of the step is positive(r = 1), we will choose this step and go ahead to this new child node.

Backup: When this node cannot expand into different steps or the number of child nodes reaches the maximum bandwidth, it will backtrack to its parent node.