WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, Yansong Tang, Dongmei Zhang

cs.CL cs.AI cs.LG

Introduction

Recently, Large-scale language models (LLMs) have garnered significant attention and become the go-to approach for numerous natural language processing (NLP) tasks, including open domain conversation , coding and math . A conspicuous example is ChatGPT, developed by OpenAI. This model uses extensive pre-training on large-scale internet data and further fine-tuning with specific instruction data and methods. As a result, it achieves state-of-the-art zero-shot performance on various benchmarks. Subsequently, Anthropic, Google, and Meta also launched their competitive products one after another. Notably, Meta’s series of Llama models have sparked an open-source revolution and quickly narrowed the gap with those closed-source LLMs. This trend also gradually stimulates the releases of MPT6, Falcon , StarCoder , Alpaca , Vicuna , and WizardLM , etc. However, these open models still struggles with the scenarios which require complex multi-step quantitative reasoning, such as solving mathematical and science challenges .

Chain-of-thought (CoT) proposes to design better prompts to generate step-by-step solutions, which can lead to improved performance. Self-Consistency also achieves remarkable performance on many reasoning benchmarks, which generates several possible answers from the model and selects the correct one based on majority vote . In recent, finds that process supervision with reinforcement learning significantly outperforms outcome supervision for solving challenging MATH problems.

Inspired by Evol-Instruct and Process-supervised Reinforcement Learning, this work aims to enhance the mathematical reasoning abilities of the SOTA open-source LLM, Llama-2 . As shown in the Figure 1, we propose a new method named Reinforcement Learning from Evol-Instruct Feedback (RLEIF), which could firstly generate diverse math instructions data by math-specific Evol-Instruct, then we train an instruction reward model (IRM) and a process-supervised reward model (PRM) , the former indicates the quality of the evolved instruction and the later receives feedback for each step in the solution. The brand-new Evol-Instruct method includes two downward evolution and upward evolution progress to produce the grade school math and challenging math respectively. Initially, we re-generate, filter and finetune the original math instruction data from GSM8k and MATH . Immediately, we train the Llama-2 models to obtain the reward models and our WizardMath.

We perform experiments on two mathematical reasoning benchmarks, namely GSM8k and MATH , the results demonstrate that our WizardMath outperforms all other open-source LLMs, achieving state-of-the-art performance. Specifically, WizardMath observe a substantial improvement in pass@1 with an increase of +24.8 (81.6. vs. 56.8) on GSM8k, and +9.2 (22.7 vs. 13.5) on MATH. Notably, our model even also significantly surpasses OpenAI’s ChatGPT-3.53, Anthropic’s Claude Instant-1 , and Google’s PaLM-2 in terms of pass@1 on GSM8k.

The main contributions of this work are as following:

We introduce WizardMath model, which enhances the mathematical reasoning abilities for open-source pretrained large language model Llama-2 .

We propose a new method, Reinforcement Learning from Evol-Instruct Feedback (RLEIF), alongside Evol-Instruct and Reinforcement Learning, for improving LLM reasoning performance.

WizardMath surpasses all other open-source LLMs by a substantial margin in terms of mathematical reasoning, including Llama-2 70B , Llama-1 65B , Falcon-40B , MPT-30B6, Baichuan-13B Chat7 and ChatGLM2 12B on both GSM8k and MATH .

WizardMath significantly outperforms various main closed-source LLMs, such as ChatGPT3, GPT-3.5, Claude Instant , PaLM-2 , PaLM-1 and Minerva on GSM8k.

Method

In this section, we elaborate on the details of our WizardMath. Following WizardLM and PRMs, we propose Reinforcement Learning from Evol-Instruct Feedback (RLEIF), which integrates the Evol-Instruct and reinforced process supervision method to evolve GSM8k and MATH, and fine-tune the pre-trained Llama-2 with the evolved data and reward models.

As shown in the Figure 1, our methods apply three steps:

Training instruction reward model, and process-supervised reward model.

Following InstructGPT, we also firstly fine tune the base with supervised instruction-response pairs, which contains:

To make the parsing of each step easier, we few-shot re-generate 15k answers for GSM8k and MATH with an Alpha version of WizardLM 70B model to produce solutions in a step-by-step format, then find out those with a correct answer, and use this data to finetune base Llama model.

To enhance the model’s ability to adhere to the neural and diverse instructions, we also sample 1.5k open-domain conversations from WizardLM’s training data, then merge it with above math corpus as the final SFT training data.

2 Evol-Instruct principles for math

Motivated by the Evol-Instruct method proposed by WiazrdLM and its effective application on WizardCoder , this work attempts to make math instructions with various complexities and diversity to enhance the pre-trained LLMs. Specifically, we adapt Evol-Instruct to a new paradigm including two evolution lines:

Downward evolution: It enhances instructions by making the questions easier. For example i): revising high difficulty questions to lower difficulty, or ii) producing a new and easier question with another different topic.

Upward evolution: Derived from original Evol-Instruct method, it deepens and generates new and harder questions by i) adding more constraints, ii) concretizing, iii) increasing reasoning.

3 Reinforcement Learning from Evol-Instruct Feedback (RLEIF)

Inspired by InstructGPT and PRMs, we train two reward models to predict the quality of the instructions and the correctness of each step in the answer respectively:

Instruction Reward Model (IRM): This model aims to judge the quality of the evolved instructions on three aspects: i) Definition, ii) Precision, and iii) Integrity. To produce the ranking list training data of IRM, for each instruction, we firstly use ChatGPT and Wizard-E Wizard-E named Wizard-Evol-Generator, which is an Alpha version fine-tuned Llama model specifically used to execute Evol-Instruct without APIs. to generate 2~4 evolved instructions respectively. Then we leverage Wizard-E to rank the quality of those 4~8 instructions.

Process-supervised Reward Model (PRM): As there is no powerful open-source math reasoning LLMs before this work, there is no simple way to support highly precise process supervision without professional human-labelers and close-source ChatGPT. Therefore, we depend on ChatGPT to provide process supervision, and ask it to assess the correctness of each step in the solutions generated by our model.

PPO training. We evolve the original math (GSM8k + MATH) instructions by 8 turns, increasing the data size from 15k to 96k. We use IRM and PRM to generate the instruction reward ( ${r^{I}}$ ) and the answer reward ( ${r^{A}}$ ). Then apply a product as the final reward $r={r^{I}}\cdot{r^{A}}$ .

Experiment

This section provides a comprehensive overview of the baseline models in our experiments. Subsequently, we mainly elucidate the performance metrics of our models on two prevalent mathematical benchmarks: GSM8k and MATH .

Numerous technology companies have effectively created exceptionally proficient Large Language Models (LLMs) , but have opted against making them publicly available, so they are referred to as close-source models. In our research, we extensively integrate a significant number of close-source models as the foundational benchmarks. Specifically, our baselines encompass the following models: (i) OpenAI’s GPT-3 , GPT-3.5, ChatGPT https://openai.com/, GPT-4 ; (ii) Google’s PaLM 2 , PaLM , and Minerva ; (iii) Anthropic’s Claude Instant , Claude 1.3 https://www.anthropic.com/index/introducing-claude, Claude 2 https://www.anthropic.com/index/claude-2, DeepMind’s Chinchilla .

Open-Source Models.

Massive open-source LLMs have been accessible to the AI community. Nonetheless, their performance consistently tends to significantly lag behind the close-source models. As part of our research, we incorporate a significant number of these open-source models as our baselines, which mainly contain the following: Llama 1 & Llama 2 , GAL , GPT-J , GPT-Neo , Vicuna , MPT https://github.com/mosaicml/llm-foundry/, Falcon, Baichuan https://github.com/baichuan-inc/Baichuan-13B, ChatGLM , Qwen https://github.com/QwenLM/Qwen-7B/ and RFT .

2 Evaluate Benchmarks

We mainly evaluate WizardMath on two benchmarks (GSM8k and MATH ). The GSM8k dataset contains approximately 7500 training data and 1319 test data, mainly on grade school level math problems, each of which consists of basic arithmetic operations (addition, subtraction, multiplication, and division), and generally requires 2 to 8 steps to solve. The MATH dataset collects math problems from prestigious math competitions such as AMC 10, AMC 12, and AIME. It contains 7500 training data and 5,000 challenging test data in seven academic areas: Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus. Furthermore, these problems are divided into five levels of difficulty, with ‘1’ denoting the relatively lower difficulty level and ‘5’ indicating the highest level.

3 Train and Evaluation prompt

The Llama 2 base serves as our foundation model.

We undertake the training of our WizardMath by employing the prompt from Alpaca :

Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response: We evaluate GSM8k and MATH benchmarks by employing the following CoT prompt:

4 Evaluation on GSM8k and MATH

Notably, in the Figure 2 and Table 1, we cite the metrics of GPT-4 and GPT-3.5 from . The evaluation of the ChatGPT model’s scores are from . For the assessment of Claude Instant, Claude 1.3, and Claude 2, the scores are extracted from 5. The scores of PaLM 1, PaLM 2, and Minerva are garnered from . Finally, the scores associated with Text-davinci-002, GPT-3 and GPT-2 are garnered from . On the open-source models, most scores are retrieved from the paper of Llama 2 or their self-reports. Additionally, we evaluate the Baichuan-chat, Vicuna v1.3 by ourselves. In the Table 2, we show the detailed results of MATH subtopics with our WizardMath 70B model.

In Table 1, our WizardMath 70B slightly outperforms some close-source LLMs on GSM8k, including ChatGPT, Claude Instant and PaLM 2 540B. And as shown in Figure 2, our model is currently ranked in the top five on all models. Simultaneously,WizardMath 70B also surpasses the Text-davinci-002 on MATH. The detailed results are as follows:

WizardMath 13B outperforms PaLM 1 540B (63.9 vs 56.5), Minerva 540B (63.9 vs 58.8), and GPT-3.5 (63.9 vs 57.1) on GSM8k. Meanwhile,it surpasses PaLM 1 540B (14.0 vs. 8.8), GPT-3 175B (14.0 vs. 5.2) on MATH.

WizardMath 70B, our largest model, achieves the superior or comparable performance with Claude Instant (81.6 vs 80.9), ChatGPT (81.6 vs 80.8) and PaLM 2 (81.6 vs 80.7) on GSM8k. Concurrently, WizardMath 70B also exceeds Text-davinci-002 (22.7 vs. 19.1) by a margin of 3.6% on the MATH benchmarks.

Comparing with the Open-Source Models.

The findings illustrated in the table 1 explicitly demonstrate that our WizardMath 70B, distinctly manifest a substantial performance advantage over all the open-source models across both the GSM8k and MATH benchmarks. The detailed results are as follows:

WizardMath 7B surpasses most open-source models with parameter counts ranging approximately from 7B to 40B, including MPT, Falcon, Baichuan-chat, Vicuna v1.3, ChatGLM 2, Qwen, Llama 1 and Llama 2 on the GSM8k and MATH benchmarks. Even though its parameter counts are significantly lower.

WizardMath 13B is significantly superior to Llama 1 65B (63.9 vs. 50.9) and Llama 2 70B (63.9 vs. 56.8) on GSM8k. Additionly, it substantially outperforms both Llama 1 65B (14.0 vs. 10.6) and Llama 2 70B (14.0 vs. 13.5) on MATH.

WizardMath 70B, our most extensive model, exemplifies a substantial advancement in performance, surpassing Llama 2 70B (81.6 vs. 56.8) by a significant margin of 24.8% on GSM8k. Concurrently, it also outperforms Llama 2 70B (22.7 vs. 13.5) by a margin of 9.2% on MATH.

5 Case Study

Appendix A shows some examples generated by our WizardMath. The examples demonstrate that our model consistently generates accurate response answers accompanied by clear explanations.

Related Work

LLMs have achieved substantial advancements within the realm of Natural Language Processing (NLP), providing a valuable and task-agnostic foundation for widespread applications. These models typically encompass parameter counts reaching into the hundreds of billions, which are trained on extensive large-scale corpuses of textual data. The prominent instances entail OpenAI’s GPT3&4 , Anthropic’s Claude5, Google’s PaLM , Bard https://bard.google.com/, DeepMind’s Chinchilla , and Gopher . However none of them have been open-sourced so far, and some of them can only be exclusively accessible through APIs.

Recently, the AI landscape has borne witness to the emergence of numerous open-source LLMs, characterized by publicly accessible model codes and weight parameters. EleutherAI has contributed GPT-NeoX-20B and GPT-J-6B . BigScience has introduced BLOOM . Similarly, Meta has made strides by releasing OPT , Llama 1 , Llama 2 , and GAL . Tsinghua University has unveiled GLM-130B and ChatGLM . TII has facilitated the release of Falcon . Additionally, LLMs such as Baichuan7 and Qwen8 have also surfaced. Presently, Llama assumes a pivotal role as the foundational model for supervised fine-tuning, ushering in the emergence of several extremely remarkable models, including Alpaca , Vicuna , Guanaco , WizardLM , and Orca , RFT etc.

Large Language Models For Mathematical reasoning.

It’s well known that complex reasoning problems are challenging for NLP models, which include mathematical reasoning , common-sense reasoning , and logical reasoning . A substantial body of current research is centered around the intricate task reasoning of the Mathematical Word Problems(MWP) , which requires the ability to understand mathematical concepts, computation and multi-step reasoning . Addtitionly, models are evaluated across different levels of MWP benchmarks on some mathematical reasoning datasets such as AddSub , MultiArith , SingleEQ , SVAMP , GSM8K , AQuA and MATH .

To enhance the reasoning ability of LLMs, proposed Chain-of-Thought Prompting, which attaches multiple reasoning steps before obtaining the answer for a question. By employing the simple few-shot reasoning strategy, LLMs are able to perform better in complex reasoning problems. Least-to-Most prompting decomposes the problem into sub-problems that are then solved incrementally. Additionally each step has a more detailed reasoning process. Similarly, the Complex CoT underscores the pivotal role of prompt complexity by strategically choosing the most intricate problems and their corresponding solutions to function as prompts. To alleviate the burden of manual efforts, introduced Auto-CoT, an approach that automates the process of acquiring k samples through the application of clustering techniques on a provided dataset. With the objective of mitigating manual intervention, proposed Zero-shot-CoT, which entails the straightforward practice of appending the phrase "Let’s think step by step" to each answer, eliciting the inference steps without examples. Moreover, expanded upon this notion by suggesting the exploration of diverse inference paths throughout the reasoning process. Consequently, the ultimate outcome is determined through either the aggregation of answers using majority voting or by leveraging a validation mechanism, as posited by . employs a straightforward approach for generating augmented samples, focusing on probing the correlation between LLMs and math reasoning ability.

Large Language Models For Reinforcement Learning.

Nevertheless, even state-of-the-art models frequently manifest logical errors and a range of illusions . These anomalies become especially challenging within domains necessitating multi-step reasoning, where a singular logical misstep maybe precipitate the unraveling of an entire solution. An effective strategy involves the training of reward models aimed at discriminating between favorable and unfavorable outputs . Early outcome-based approaches were mainly performed on algorithmic tasks . demonstrated the significant benefits of reward models or validators, and proposed a heuristic-based step-size-aware RM. proposed the use of reward models for a reinforcement learning pipeline. employed rejection sampling for searching to achieve alignment of LLMs with human preferences.

The differences between outcome-based and process-based reward modelling are further discussed by . Outcome-supervised reward models (ORMs) undergo training exclusively utilizing the ultimate outcomes derived from the model’s chain-of-thought process. Conversely, process-supervised reward models (PRMs) are designed to solicit feedback for each individual step within the chain-of-thought progression. In the domain of logical reasoning, ORMs frequently employ incorrect reasoning pathways yet yield the correct final answer . Notably, PRMs has been demonstrated to effectively alleviate this phenomenon of inconsistent behavior . amassed an expansive corpus of process-based supervised signals through meticulous manual annotation, which verified that PRMs and supervision with manual annotation yielded more pronounced advantages for LLMs as compared to ORMs.

Large Language Models For Instruction Fine-Tuning.

The initial endeavors in instruction-following training work primarily focused on enhancing the language model’s capacity for generalization across diverse tasks. This often involves the process of fine-tuning across substantially available Natural Language Processing datasets, and evaluates on the different NLP tasks. T5 undertake the earliest attempts to train a range of NLP tasks, including Question and Answer, Document Summarization, and Sentiment Classification, by employing a consistent prompt format across all the data. Subsequently, instruction fine-tuning work such as FLAN , ExT5 , T0 , UnifiedQA , ZeroPrompt , and FLAN-T5 emerged to adapt for a large number of downstream tasks. To address the challenge of misalignment between model outputs and human requirements, OpenAI manually annotates the instruction library to construct a diverse range of tasks. Simultaneously, Reinforcement Learning from Human Feedback technology is employed, which facilitate the rapid development of LLMs such as InstructGPT , ChatGPT3, GPT-4 . To reduce manual involvement, self-instruct improves instruction-following through self-generated instructions. Alpaca used a dataset of 50k instructions generated from a limited (e.g., 175 samples) seed set of manually-written instructions. Vicuna used 70k user-shared conversations with ChatGPT collected from ShareGPT.com. Meanwhile, WizardLM introduces the evol-instruct approach, which seeks to refine the existing instruction data by enhancing both its complexity and diversity.

Conclusion and Future Work

This paper introduces WizardMath, a mathematics model fine-tuned with RLEIF. The experimental results demonstrate that WizardMath achieves SOTA performance surpassing all existing open-source LLMs on two widely recognized mathematical reasoning benchmarks: GSM8k and MATH. Furthermore, WizardMath exhibits superior performance compared to some of the largest close-source LLMs, including ChatGPT, GPT-3.5, Claude Instant, PaLM-2, PaLM-1 and Minerva on the GSM8k benchmark.

Although our WizardMath achieves impressive mathematics performance, as depicted in Figure 2, our model still falls significantly behind the SOTA LLM, GPT-4 and Claude-2. Therefore, future work will prioritize the enhancement of the RLEIF or better method to further augment the performance of our model.

Broader Impact.

Similar to the other LLMs, our WizardMath could also generate unethical, harmful, or misleading information sometimes. Therefore, future research to address the ethical and societal implications is needed.