WizardCoder: Empowering Code Large Language Models with Evol-Instruct

Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, Daxin Jiang

Introduction

Recently, Large Language Models (LLMs) have garnered significant attention and demonstrated impressive success. Notably, OpenAI’s ChatGPT stands out as a prominent example. Leveraging extensive pre-training on vast amounts of internet data and further fine-tuning with detailed instruction data , these models have achieved state-of-the-art (SOTA) zero-shot performance across diverse tasks. This trend is also observed in the domain of code understanding and generation. Numerous Code LLMs have been proposed to tackle the challenges associated with code-related tasks. These Code LLMs undergo pre-training using substantial amounts of code data, enabling them to excel in various code-related tasks, showcasing impressive performance.

In contrast to most previous Code LLMs that primarily emphasize the pre-training process, there has been limited exploration of fine-grained instruction tuning in the Code domain. The introduction of instruction tuning initially aimed to enhance the generalization capabilities of LMs across different tasks . OpenAI’s InstructGPT , for instance, involved soliciting human annotators to provide explicit instructions to ensure alignment with users’ intentions. Similarly, recent works such as Alpaca employed the self-instruct method, where ChatGPT generated the instruction data. Vicuna utilized user-shared conversations collected from ShareGPT.com. WizardLM introduced the Evol-Instruct method, which involved evolving existing instruction data to generate more complex and diverse datasets. However, it is worth noting that all these approaches primarily focused on the general domain and lacked specific design considerations for the code domain.

Motivated by the Evol-Instruct method, this study aims to enhance the capabilities of the SOTA open-source Code LLM, StarCoder , by generating intricate code instruction data through code-specific Evol-Instruct. To achieve this, we have made several adaptations to the evolutionary prompt process tailored specifically for code-related tasks. These modifications include refining the evolutionary instructions, simplifying the form of evolutionary prompts, and incorporating code debugging and time-space complexity constraints. Initially, our method is applied to evolve the basic code instruction data, Code Alpaca . Subsequently, we conduct fine-tuning of StarCoder using our newly created code instruction-following training set and obtain our WizardCoder.

The experimental results obtained from four code generation benchmarks, namely HumanEval , HumanEval+ , MBPP , and DS-100 , demonstrate that our WizardCoder outperforms all other open-source Code LLMs, achieving state-of-the-art (SOTA) performance. Specifically, we observe a substantial improvement in pass@1 scores, with an increase of +22.3 (57.3 vs. 35.0) in HumanEval and +8.2 (51.8 vs. 43.6) in MBPP. Remarkably, despite its much smaller size, our WizardCoder even surpasses Anthropic’s Claude and Google’s Bard in terms of pass rates on HumanEval and HumanEval+.

The contributions of this work can be summarized as follows:

We introduce WizardCoder, which enhances the performance of the open-source Code LLM, StarCoder, through the application of Code Evol-Instruct.

WizardCoder surpasses all other open-source Code LLMs by a substantial margin in terms of code generation, including StarCoder, CodeGen, CodeGee, CodeT5+, InstructCodeT5+, StarCoder-GPTeacher, and Instruct-Codegen-16B.

WizardCoder achieves superior results in code generation compared to the largest closed-source LLMs, such as Claude, Bard, PaLM, PaLM-2, and LaMDA, despite being considerably smaller in size.

Related Work

Recently, LLMs have demonstrated remarkable achievements across a broad spectrum of tasks. Prominent tech companies have made significant strides in developing highly proficient LLMs. These include OpenAI’s GPT3&4 , Google’s PaLM , and Bard111https://bard.google.com/, DeepMind’s Chinchilla , and Gopher , as well as Anthropic’s Claude222https://www.anthropic.com/index/introducing-claude. However, it is important to note that these models are closed-source and can only be accessed through specific APIs or may not be accessible at all.

The AI community has witnessed the release of several open-source LLMs, where the model weights are made publicly available. EleutherAI has contributed GPT-NeoX-20B and GPT-J-6B . Google has released UL2-20B . Tsinghua University has introduced GLM-130B . Meta has released OPT and LLaMA . It is worth noting that while these open-source models have made valuable contributions, they generally do not exhibit the same level of performance as their closed-source counterparts.

Large Language Models for Code.

Recent studies have introduced a significant number of LLMs for code-related tasks to address the challenges of code understanding and generation. OpenAI has unveiled Codex and Code-Davinci . Google has proposed PaLM-Coder . They perform outstandingly on the popular code completion benchmarks, like HumanEval and MBPP . However, these models are closed-source.

On the other hand, there are several open-source Code LLMs available. Salesforce has introduced CodeGen , CodeT5 , and CodeT5+ . Tsinghua University has contributed CodeGeeX , and the BigCode Project has developed StarCoder . These models have demonstrated notable advancements in code-related tasks. However, when compared to the SOTA closed-source models, they still lag behind significantly. In contrast to the aforementioned models without instruction fine-tuning, our work demonstrates that further training Code LLMs with Code Evol-Instruct can substantially enhance performance.

Instruction Fine-Tuning.

The primary objective of instruction fine-tuning in its early stages was to enhance the cross-task generalization capabilities of LMs. This was achieved by fine-tuning LMs with a substantial corpus of public NLP tasks. T5 was among the first models to explore this approach, training on a multitude of supervised text-to-text tasks. Subsequent works such as FLAN , ExT5 , T0 , and UnifiedQA further expanded the range of tasks to bolster the overall generalization ability of LMs. Notably, ZeroPrompt and FLAN-T5 pushed the envelope by incorporating thousands of tasks in their training pipelines. Across these studies, a consistent finding emerges: fine-tuning LMs with diverse NLP task instructions yields significant performance improvements when applied to new tasks.

While fine-tuning LMs with diverse NLP tasks has shown promising results, it often falls short in aligning with the intentions of real-world users. OpenAI has pursued a different approach by soliciting human annotators to provide a large corpus of human instructions, encompassing diverse forms and a wide range of task types. Building upon this dataset, OpenAI trained its GPT3 model to create InstructGPT , which better aligns with users’ inputs. This line of development has even led to the impressive work known as ChatGPT. However, it is important to note that the dataset and model weights associated with these advancements are not publicly available. Alpaca takes a different route by adopting the self-instruct method , leveraging ChatGPT to generate data for training. Vicuna utilizes user-shared conversations collected from ShareGPT.com to train its models. WizardLM introduces the Evol-Instruct method, which involves evolving existing instruction data to generate more complex and diverse datasets. In contrast to these general instruction fine-tuning approaches, our WizardCoder successfully applies the Evol-Instruct method specifically in the domain of Code LLMs.

Approach

In this section, we elaborate on the methodological details of WizardCoder. Following WizardLM, we apply the Evol-Instruct method to evolve Code Alpaca generated using self-instruct and fine-tune the pre-trained Code LLM StarCoder with the evolved data.

Inspired by the Evol-Instruct method proposed by WizardLM, this work also attempts to make code instructions more complex to enhance the fine-tuning effectiveness of code pre-trained large models. To adapt Evol-Instruct to the realm of code, we made the following modifications to the evolutionary prompt:

Streamlined the evolutionary instructions by removing deepening, complicating input, and In-Breadth Evolving.

Simplified the form of evolutionary prompts by unifying the evolutionary prompt template.

Addressing the specific characteristics of the code domain, we added two evolutionary instructions: code debugging and code time-space complexity constraints.

The unified code evolutionary prompt template is as follows:

Please increase the difficulty of the given programming test question a bit. You can increase the difficulty using, but not limited to, the following methods: {method} {question} Here, $\{$ question $\}$ represents the current code instruction awaiting evolution, and $\{$ method $\}$ is the type of evolution. The five types we used are listed as follows:

2 Training WizardCoder

We employ the following procedure to train WizardCoder. Initially, we utilize StarCoder 15B as the foundation and proceed to fine-tune it using the code instruction-following training set, which was evolved through Evol-Instruct. The prompt format for fine-tuning is outlined as follows:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: {instruction} ### Response: To construct the training dataset, we initialized it with the 20K instruction-following dataset called Code Alpaca333https://github.com/sahil280114/codealpaca. We iteratively employ the Evol-Instruct technique on this dataset consisting of 20,000 samples to produce evolved data. After each round of data evolution, we merge the evolved data from all previous rounds with the original dataset to finetune StarCoder and assess the pass@1 metric on HumanEval . Once we observe a decline in the pass@1 metric, we will discontinue the usage of Evol-Instruct and choose the model with the highest pass@1 as the ultimate model.

Experiment

This section begins by providing a comprehensive overview of the baseline models in our experiments. Subsequently, we present the performance of our models on four code generation benchmarks: HumanEval , HumanEval+ , MBPP , and DS-1000 .

Multiple technology companies have successfully developed highly proficient LLMs while choosing not to publicly release them. These models are referred to as closed-source models. For our research, we incorporate a substantial number of these models as our baselines. Specifically, our baselines encompass the following: (i) OpenAI’s GPT3.5&4 , Code-Davinci-002 , Code-Cushman-001 , and Codex ; (ii) Google’s Bard, PaLM 2 , PaLM , and LaMDA ; (iii) Google DeepMind’s AlphaCode ; and (iv) Anthropic’s Claude.

Open-Source Models.

Several open-source LLMs have been made available to the AI community, although their performance generally lags behind the closed-source models a lot. As part of our research, we incorporate a significant number of these open-source models as our baselines. Our baselines encompass the following models: StarCoder , LLaMa , CodeGen , CodeGeeX , CodeT5+, and InCoder. In addition, we also include several models with instructions fine-tuning, including StarCoder-GPTeacher,444https://huggingface.co/GeorgiaTechResearchInstitute/starcoder-gpteacher-code-instruct Instruct-Codegen-16B,555https://huggingface.co/sahil2801/instruct-codegen-16B Guanaco-65B,666https://huggingface.co/TheBloke/guanaco-65B-HF and Falcon-40B-Instruct.777https://huggingface.co/tiiuae/falcon-40b-instruct

2 Implementation Details

The StarCoder serves as our basic foundation model. The evolved dataset consists of approximately 78k samples. To fine-tune the basic models, we employ specific configurations, including a batch size of 512, a sequence length of 2048, 200 fine-tuning steps, 30 warmup steps, a learning rate of 2e-5, a Cosine learning rate scheduler, and fp16 mixed precision.

3 Evaluation on HumanEval, HumanEval+, and MBPP

HumanEval , HumanEval+ and MBPP are extensively utilized benchmarks within the field of Code LLMs. These benchmarks encompass a vast collection of Python programming problems, employing test cases to validate the code generated by Code LLMs. HumanEval consists of 164 original programming problems, with an average of 9.6 test cases allocated to each problem. To ensure a thorough assessment of the functional correctness of LLM-synthesized code, HumanEval+ extends the number of test cases significantly, averaging at 774.8 test cases per problem. On the other hand, MBPP offers a set of 500 test programming problems, accompanied by three automated test cases per problem. The prompt format for these tasks is as follows:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: Create a Python script for this problem: {Question} ### Response: Comparing with the Closed-Source Models. The SOTA LLMs for code generation, such as GPT4, Claude, and Bard, are predominantly closed-source. Acquiring access to the APIs of these models proves challenging. In this study, we adopt an alternative approach by retrieving the scores for HumanEval and HumanEval+ from the LLM-Humaneval-Benchmarks . Notably, all the mentioned models generate code solutions for each problem utilizing a single attempt, and the resulting pass rate percentage is reported. To maintain consistency, we employ the same experimental setup by generating answers using greedy decoding and evaluate our WizardCoder using the provided evaluation codes. By adhering to these standardized procedures, we aim to ensure fair and comparable evaluations of our model against existing benchmarks.

As depicted in Figure 1, our WizardCoder attains the third position in this benchmark, surpassing Claude-Plus (59.8 vs. 53.0) and Bard (59.8 vs. 44.5). Notably, our model exhibits a substantially smaller size compared to these models. Furthermore, our WizardCoder demonstrates a remarkable superiority over other open-source LLMs that undergo instruction fine-tuning, showcasing a significant performance margin.

Comparing with the Open-Source Models.

In Table 1, we conduct a comprehensive comparison of our WizardCoder with other open-source models on the HumanEval and MBPP benchmarks. In contrast to the results presented in Figure 1, we adhere to the approach outlined in previous studies by generating n samples for each problem to estimate the pass@1 score. The findings presented in Table 1 clearly demonstrate that our WizardCoder exhibits a substantial performance advantage over all the open-source models.

From the experimental results in Figure 1 and Table 1, we have the following conclusions:

WizardCoder outperforms the largest closed-source LLMs, including Claude, Bard, PaLM, PaLM-2, and LaMDA, despite being significantly smaller.

WizardCoder outperforms all the open-source Code LLMs by a large margin (+22.3 on HumanEval), including StarCoder, CodeGen, CodeGee, and CodeT5+.

WizardCoder significantly outperforms all the open-source Code LLMs with instructions fine-tuning, including InstructCodeT5+, StarCoder-GPTeacher, and Instruct-Codegen-16B.

4 Evaluation on DS-1000

The DS-1000 benchmark comprises 1,000 distinct data science workflows spanning seven libraries. It assesses the performance of code generations against test cases and supports two evaluation modes: completion and insertion. In our experiments, we only report insertion scores for models that support. The DS-1000 benchmark further classifies problems based on the libraries employed, including Matplotlib (plt), NumPy (np), Pandas (pd), SciPy (scp), Scikit-Learn (sk), PyTorch (py), and TensorFlow (tf). We follow the same prompt format as StarCoder. In Table 2, we present pass@1 (n=40) results for each library, along with an overall score. Based on these results, our conclusion is that WizardCoder demonstrates a significant superiority over all other models when tackling data science problems on the DS-1000 benchmark. This observation holds true across nearly all data science libraries.

5 Ablation Study

Figure 2 presents an ablation study investigating the impact of the number of data evolution rounds. The first round of evolved data contains 38k samples. The second round contains 58k. The third round contains 78k. The fourth round contains 98k. For consistency, all models undergo fine-tuning with 200 steps. The results reveal that the highest pass@1 score on humaneval is achieved after three rounds of data evolution. Based on this observation, we select the data that evolved during the third round as the ultimate dataset.

6 Examples

Table 3 showcases examples of interactions with our WizardCoder. The examples demonstrate that our model consistently generates accurate responses accompanied by clear explanations.

Conclusion and Future Work

This paper introduces WizardCoder, a Code Evol-Instruct fine-tuned Code LLM. The experimental results demonstrate that WizardCoder achieves SOTA performance surpassing all existing open-source Code LLMs on four widely recognized code generation benchmarks: HumanEval, HumanEval+, MBPP, and DS-1000. Furthermore, WizardCoder exhibits superior performance compared to the largest closed LLMs, including Anthropic’s Claude and Google’s Bard.

Although our WizardCoder demonstrates impressive coding performance, as depicted in Figure 1, our model still falls significantly behind the SOTA LLM, GPT4. Therefore, future work will prioritize the enhancement of the Code Evol-Instruct method to further augment the performance of our model.

Broader Impact.

Similar to the other LLMs, our WizardCoder could also generate unethical, harmful, or misleading information. Therefore, future research to address the ethical and societal implications is needed.