OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

Tianyu Zheng, Ge Zhang, Tianhao Shen, Xueling Liu, Bill Yuchen Lin, Jie Fu, Wenhu Chen, Xiang Yue

Introduction

Code generation has been a pivotal challenge within computer science for several decades. Recently, the landscape of code generation has been revolutionized by the advent of large language models (LLMs) pre-trained on extensive code corpora (Nijkamp et al., 2022; Christopoulou et al., 2022; Zheng et al., 2023; Li et al., 2023a; Wang et al., 2023c; Roziere et al., 2023; Guo et al., 2024). These models have showcased remarkable capabilities in generating code that accurately aligns with user intents, thus providing substantial support for software development (GitHub, 2023).

To unleash the capabilities of pre-trained code models, instruction-tuning methods have been developed. For instance, CodeAlpaca (Chaudhary, 2023) comprises 20K code instructions automatically generated by applying self-instruct (Wang et al., 2023b) to ChatGPT, utilizing 21 seed tasks as the foundation. To further refine the coding proficiency of LLMs, Luo et al. (2023) introduces Code Evol-Instruct, a method that applies a variety of heuristics to enrich the complexity of initial code instructions, building upon the dataset provided by CodeAlpaca. Meanwhile, MagicCoder (Wei et al., 2023) employs a robust LLM to generate novel coding challenges, sourcing inspiration from a diverse range of open-source code snippets. Additionally, WaveCoder (Yu et al., 2023) implements an LLM generator-discriminator framework for creating code instruction data, offering customization and control over the data generation process.

Despite these advancements, current code models are constrained by their capacity to utilize feedback for refinement. Essentially, feedback can have two forms: (1) execution feedback, which includes execution outputs and diagnostics, and (2) human feedback, comprising follow-up guidance or instructions from users. Execution feedback plays a vital role in enabling models to rectify syntactic and logical errors, and human feedback aids models in better understanding user instructions, facilitating the generation of solutions that more closely align with user expectations.

To address these challenges, we propose OpenCodeInterpreter, a family of open-source code systems designed for generating, executing, and iteratively refining code. OpenCodeInterpreter is trained on our constructed Code-Feedback dataset, which features 68K multi-turn interactions between users, code models, and compilers. OpenCodeInterpreter uniquely integrates both execution and human feedback, employing compiler diagnostics to rectify errors and human insights to refine code generation. This approach allows OpenCodeInterpreter to produce solutions that are both technically sound and closely matched to user requirements, significantly boosting its overall performance.

Our thorough evaluation of OpenCodeInterpreter on widely recognized benchmarks, such as HumanEval Chen et al. (2021), MBPP Austin et al. (2021), and their augmented counterparts from EvalPlus Liu et al. (2023), highlights its superior ability to generate and iteratively refine code, achieving exemplary standards of quality and functionality. Remarkably, OpenCodeInterpreter-33B secures an impressive accuracy of 83.2 (76.4) on the average (and plus versions) of HumanEval and MBPP, showcasing performance on par with GPT-4’s 84.2 (76.2). Furthermore, when augmented with synthesized human feedback from GPT-4, OpenCodeInterpreter’s performance notably increases to 91.6 (84.6). OpenCodeInterpreter thereby establishes a new benchmark in code generation, effectively narrowing the performance gap between open-source models and sophisticated proprietary systems like the GPT-4 Code Interpreter.

Code-Feedback

In this section, we detail the creation of our code instruction tuning dataset, Code-Feedback (Figure 2), designed to train OpenCodeInterpreter. Code-Feedback is crafted to meet specific criteria: 1) Diverse and challenging real-world queries: The dataset should encompass a wide range of queries derived from real-world coding tasks, presenting both diversity and complexity. 2) Multi-turn dialogue structure: Code-Feedback is structured as multi-turn dialogues, incorporating two types of feedback: execution feedback, which includes outputs and diagnostics from compilers, and human feedback, consisting of additional guidance or instructions from users. 3) Interleaved text and code responses: Each response is expected to provide responses that blend natural language explanations with code snippets, offering a holistic approach to solving coding queries.

To assemble a dataset that fulfills these desiderata, we have employed five distinct methods. Examples of these five categories can be found in Appendix E. The sources of our queries fall into two main categories: a variety of open-source datasets and coding challenges from LeetCode. In the next subsections, we will discuss how we develop data construction methods to meet the three aforementioned criteria from the two data sources.

We have aggregated 287k queries from four distinguished open-source code instruction tuning datasets: Magicoder-OSS-Instruct111hf.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K, Python code subset of ShareGPT222hf.co/datasets/ajibawa-2023/Python-Code-23k-ShareGPT, Magicoder-Evol-Instruct333hf.co/datasets/ise-uiuc/Magicoder-Evol-Instruct-110K, and Evol-Instruct-Code444hf.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1. To refine this extensive collection and isolate the most intricate and informative instructions, we employ a very capable open-source chat model, Qwen-72B-Chat Bai et al. (2023), for a selective filtering process. This involves the LLM assessing each code query and its corresponding response within the compiled datasets on a complexity score from 1 to 5. Only the most challenging queries, with ratings of 4 or 5, were retained for our seed set, ensuring a focus on the most difficult instructions. To guarantee the robustness of our selection, this filtering operation is repeated with two distinct prompts (detailed in Appendix A), thereby solidifying the complexity of our final query selection. This meticulous process resulted in 156k high-quality single-turn code instructions as the challenging query pool. Detailed statistics of this data compilation are provided in Appendix A.

Subsequently, we describe three methods employed to transform this curated single-turn data into multi-turn dialogues enriched with both execution and human feedback.

Singe-turn Packing. A direct approach to crafting multi-turn data is to group single-turn query-response pairs into multi-turn formats. Inspired by in-context pre-training techniques Shi et al. (2023), which consolidate similar sequences to foster model learning of dependencies among related documents, we merge similar single-turn query-response pairs to form multi-turn dialogues.

Utilizing the BERT-base embedding Devlin et al. (2019), we convert queries into vectorized representations. For each query, the kk-nearest neighbors algorithm is employed to identify its four closest counterparts. From these, we randomly select two or three to assemble multi-turn sequences. To maintain data uniqueness, once a query is chosen as a neighbor, it is exempt from future selections as a neighboring query, ensuring no single instruction is repeated across the dataset. Should a query’s potential neighbors have been previously utilized, that query is bypassed. This method results in the creation of 16.6K multi-turn instances derived from 105K single-turn instances.

Interaction Simulation. Gathering authentic human interaction data poses significant challenges. To replicate a realistic code interpreter usage scenario, we developed a simulator using GPT-3.5 and GPT-4. For each selected query, GPT-3.5 first generates a preliminary response from which we extract the code snippet and execute it. The outcome of this execution, along with any compiler diagnostics, is then fed into GPT-4 to elicit a follow-up response. This cycle is repeated until GPT-4 delivers what it deems a correct solution or until a maximum of three iterations is reached.

Subsequently, we introduce simulated human feedback into the interaction. We predefine ten common feedback categories, including issues related to syntax and formatting, efficiency, functionality, clarity, bugs, security, compatibility, resource use, scalability, and best practices, with each category detailed in Appendix B. GPT-4 is then prompted to select the most relevant feedback for the scenario and generate appropriate responses within that feedback category. By incorporating this simulated feedback into the dialogue history, GPT-4 is encouraged to refine its solutions further, mimicking intricate user-model exchanges and demonstrating self-correction in response to human input. Through this simulation approach, we have constructed 51K examples, effectively capturing the nuanced dynamics of user interactions and feedback-driven solution refinement.

Code Correction. To boost the model’s error-handling capabilities, we include a focused stage in our data compilation that generates 500 specific error correction interactions. We initiate this by prompting GPT-4 to intentionally produce incorrect code snippets, as outlined in Appendix B. The model then uses the error messages from executing these snippets as cues for corrections. This approach mirrors the real-life coding cycle, where developers continuously debug and refine their code, thus enriching our dataset with a broad spectrum of error correction examples. Following this, we replace the initial prompts that resulted in incorrect code with the ones that encourage the generation of correct code outputs. This method ensures the model learns from both successful code generation and error identification and correction, significantly enhancing its problem-solving skills and understanding of the debugging process.

2 Coding Challenges from LeetCode

LeetCode Similar Problem. Drawing inspiration from the practice among programmers of honing their skills through LeetCode challenges, we gather similar LeetCode questions and their solutions from the TACO dataset Li et al. (2023b). LeetCode555https://leetcode.com/problemset/ categorizes related questions through tags, facilitating the extraction of connected problems. TACO ensures the LeetCode dataset is cleansed to prevent any unintended impact on other task datasets, such as HumanEval and MBPP. By amalgamating associated LeetCode questions, we compile 303 multi-turn instances, enriching the dataset with varied coding challenges.

LeetCode Follow-up Question. We further delve into the LeetCode dataset to isolate solutions to identical questions that differ in time or space complexity or are implemented in various programming languages. This process of aggregating diverse solutions to the same LeetCode questions yields 200 multi-round instances, showcasing alternative problem-solving approaches.

Given the original LeetCode solutions often lack comprehensive natural language explanations, we engage GPT-4 to enhance these solutions with integrated text explanations and code snippets, standardizing all instances into a consistent format. The specific prompts used to guide GPT-4 in this enrichment process are detailed in Appendix C, ensuring clarity and educational value in the responses.

Experimental Setup

Training Setup. We select two capable base models CodeLlama Roziere et al. (2023) and DeepSeekCoder Guo et al. (2024) varying capacities to illustrate the dataset’s universal applicability and benefits across different scales (7B, 13B, 34B, 70B). We maintain uniform hyperparameter configurations across all models. We fine-tune the base models for 3 epochs. The learning rate is set as 2e-5 with a 0.05 warm-up ratio and a cosine scheduler. We impose a token cutoff length of 4096 to maintain consistency in the input size.

To optimize the fine-tuning process, we strategically combine high-quality single-turn data from the WizardCoder 110k dataset with our Code-Feedback at a ratio of 2:1. Blending with single-turn high-quality data may further boost the coding ability. This blend is carefully selected and more details are discussed in Table 2.

Evaluation Setup. Our evaluation framework primarily leverages HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021), two benchmarks renowned for their rigorous testing of code generation capabilities. Acknowledging the limitations of their original test suites in covering all edge cases (Liu et al., 2023), we further incorporate their extended versions, HumanEval+ and MBPP+, utilizing the EvalPlus framework (Liu et al., 2023) for a more comprehensive assessment.

In line with best practices outlined in recent studies (Liu et al., 2023; Chen et al., 2023), OpenCodeInterpreter’s solutions are generated via greedy decoding. For comparisons involving GPT-3.5 Turbo (OpenAI, 2022) and GPT-4 Turbo (OpenAI, 2023), we maintain a temperature setting of 0. EvalPlus’s unified sanitizer tool post-processes these solutions, which are then evaluated across the four benchmarks using EvalPlus’s toolset.

For single-turn code generation, we craft a simple instruction to encapsulate the original prompt, forming a new input for the model. The exact prompts are detailed in Appendix D, and we assess the model’s performance using the pass@1 metric, as per EvalPlus’s guidelines.

Our analysis extends to multi-turn pass rates to explore OpenCodeInterpreter’s proficiency in refining code through iterative feedback. This aspect of the evaluation draws on execution results and synthetic human feedback, generated by GPT-4 (OpenAI, 2023), to simulate real-world coding scenarios and interactions. Specifically, the multi-turn evaluation encompasses three scenarios, offering a holistic view of OpenCodeInterpreter’s capabilities in dynamic code refinement:

Execution Feedback: Here, OpenCodeInterpreter independently leverages execution outcomes and compiler diagnostics to pinpoint and correct errors, mirroring a developer’s process of refining code based on direct execution feedback.

Synthetic Human Feedback: In this scenario, GPT-4 generates feedback that mimics human input by considering the task description, initial model response, and any execution feedback. This tests OpenCodeInterpreter’s adaptability to nuanced, human-like feedback, reflecting real-world developer or user interactions.

Synthetic Human Feedback (Oracle): Building on the previous scenario, GPT-4 also accesses the ground-truth solution, offering insight into OpenCodeInterpreter’s optimal performance in code refinement when guided by precise feedback.

For each task, the code generation and evaluation process concludes either when the model’s solution successfully passes the evaluation or when it reaches the set maximum of two rounds. If a code sample fails the evaluation, both the solution and the test results are reincorporated into the prompt for refinement. The evaluation identifies three principal scenarios for non-passing outcomes: 1) Exception Handling: Captures and relays any exceptions or errors encountered during execution as error messages, providing direct feedback for correction. 2) Not-Expected: In instances where outputs deviate from expected results, the model receives feedback including test inputs, expected outputs, and actual outputs, highlighting the discrepancy. 3) Timeout Handling: Implements a timeout threshold to prevent evaluation delays from solutions with excessive or infinite runtimes. Exceeding this threshold triggers an "Execution timed out" notification.

Main Results

This section reports OpenCodeInterpreter and baselines in single-turn and multi-turn code generation settings. The results are in Table 1.

We compare OpenCodeInterpreter’s single-turn code generation performance against premier models such as GPT-3.5/4-Turbo OpenAI (2022, 2023), CodeLlama-Python (Roziere et al., 2023), WizardCoder (Luo et al., 2023), Deepseek-Coder Guo et al. (2024), CodeT5+ (Wang et al., 2023c) across different scales. Leveraging data from the EvalPlus leaderboard as of February 10th, 2024, we examine OpenCodeInterpreter’s achievements on the HumanEval and MBPP benchmarks, as well as their advanced versions, HumanEval+ and MBPP+. For straightforward comparisons, we consolidate results across different model scales into one table, facilitating direct performance comparisons between each model scale and the respective variants of OpenCodeInterpreter.

Our experimental analysis reveals OpenCodeInterpreter’s strong performance, with several configurations matching or surpassing leading benchmarks. The OpenCodeInterpreter-DS 33B variant achieves the highest scores among open-source models. This accomplishment is remarkable, especially considering the significant presence of low-quality or incorrect data in the initial training set.

2 Results of Multi-turn Code Generation

This section evaluates the proficiency of OpenCodeInterpreter in multi-turn interactions through iterative refinement, leveraging interpreter diagnostics and human insights.

Our experimental evaluation imposes a two-round limit on iterations to maintain fairness and consistency across tasks. While some issues may benefit from multiple refinements, others require fewer. This limitation offers clear insights into the model’s iterative capabilities. In the execution feedback scenario, our models across all scales exhibited superiority over state-of-the-art (SOTA) benchmarks, with the OpenCodeInterpreter 33B model achieving parity with GPT-4 Turbo’s single-round score, thus establishing a new SOTA benchmark among the evaluated code models.

Due to budget constraints, our Human Feedback and Human Feedback (Oracle) assessments concentrate on the OpenCodeInterpreter 6.7B and OpenCodeInterpreter 33B models. The outcomes reveal that with Human Feedback, the OpenCodeInterpreter 6.7B model significantly outperformed GPT-4 Turbo’s single-round score, while in the Human Feedback (Oracle) scenario, the OpenCodeInterpreter 33B model’s average score notably exceeded the 90 benchmark in the HumanEval/MBPP benchmarks. These results highlight the significant role of iterative feedback and refinement in advancing code generation models, establishing OpenCodeInterpreter as a leader in software development tools. Through this refined approach, OpenCodeInterpreter not only demonstrates its remarkable adaptability and code refinement based on diverse feedback but also sets a new benchmark for future code generation technologies.

3 Ablations of Data Sources

This section systematically explores the impact of various data sources on the performance of OpenCodeInterpreter. We conduct a series of ablation studies to evaluate the influence of high-quality single-turn data and diverse multi-turn feedback mechanisms on the model’s code generation, debugging, and refinement capabilities.

Impact of High-Quality Single-Turn Data. To evaluate the effect of high-quality single-turn data on OpenCodeInterpreter’s efficacy, we incorporate the WizardCoder 110K3 dataset, renowned for its syntactic accuracy and logical coherence, into our extensive multi-turn dataset. This integration seeks to identify the optimal mix of precise, single-turn code generation and the advanced, iterative refinement enabled by multi-turn interactions.

Our experiments employ a soft-target fine-tuning strategy across six configurations, varying the proportion of WizardCoder 110K data in our multi-turn dataset. These configurations span from full incorporation to total exclusion of the WizardCoder dataset, assessing the performance of the model in two versions: DeepSeekCoder-Base-6.7B and DeepSeekCoder-Base-33B.

Our findings are illustrated in Table 2. It shows that incorporating high-quality single-turn data (e.g., WizardCoder dataset) significantly improves our model’s multi-turn performance. This strategic incorporation ensures that the model benefits from the syntactic accuracy and logical coherence inherent in single-turn tasks, thereby enriching its capacity for nuanced, iterative refinement in subsequent turns. It reveals the critical role of high-quality single-turn inputs in setting the stage for more effective multi-turn code generation and refinement.

Benefits of Diverse Multi-Turn Data Sources. Following the enhanced baseline established by fully integrating the WizardCoder dataset, this subsection investigates the advantages of different data sources on the model’s refinement and debugging efficacy. We add diverse data sources to our training regimen, including Single-turn Packing, Interaction Simulation, and Code Correction Data, both individually and in combination.

The use of these multi-turn data sources, including Single-turn Packing, Interaction Simulation, and Code Correction Data, individually and in combination, demonstrably enhances OpenCodeInterpreter ’s debugging and refinement functions. Notably, the inclusion of Code Correction Data significantly elevates the model’s efficiency in correcting errors. This underscores the profound impact of a varied and targeted training approach on advancing the capabilities of sophisticated code generation models. Such an approach enables these models to more effectively address complex coding challenges, correct errors, and refine outputs via extensive feedback mechanisms.

4 Case Study: Coding Queries in the Wild

This section delves into three distinct case studies to demonstrate OpenCodeInterpreter’s operational dynamics when faced with “wild” user queries. The motivation behind these case studies is to showcase the practical applications of OpenCodeInterpreter.

In a notable success story (Figure A8), we tasked OpenCodeInterpreter with developing a function to calculate all prime numbers within the 1-100 range, later extending the solution to any arbitrary range x-y. Another commendable instance (Figure A9) involved OpenCodeInterpreter implementing a Python function to validate IPv6 addresses using regular expressions. Demonstrating its capability to iteratively refine its approach, OpenCodeInterpreter not only identified and corrected errors but also enhanced the solution based on human feedback. These two cases exemplify OpenCodeInterpreter’s strength in understanding mathematical logic and dynamically adjusting algorithms to meet specified criteria.

A challenging case (Figure A10) arose when OpenCodeInterpreter was asked to design a function identifying the intersection of two input lists, returning tuples of distinct elements present in both lists alongside their occurrence frequencies. Despite OpenCodeInterpreter’s attempts at correction, it addressed errors incrementally, ultimately exceeding the maximum number of attempts (three). This case sheds light on OpenCodeInterpreter’s limitations in simultaneously tackling multiple challenging errors.

Through these case studies, we gain invaluable insights into OpenCodeInterpreter’s capabilities and limitations. These insights are crucial for guiding future enhancements to OpenCodeInterpreter.

Related Work

LLMs for Code. It becomes a common practice to include code data for pre-training LLMs. For example, 5% of PaLM’s (Chowdhery et al., 2023) pre-training data is code, and this ratio for LaMDA (Thoppilan et al., 2022), Galactica (Taylor et al., 2022), LLaMA (Touvron et al., 2023), Gopher (Rae et al., 2021), GPT-NeoX (Black et al., 2022) is 13%, 7%, 5%, 3%, and 8%, respectively.

Additionally, specialized LLMs have been pre-trained for generating code, e.g., CodeGen (Nijkamp et al., 2022), PanGu-Coder (Christopoulou et al., 2022), CodeGeeX (Zheng et al., 2023), CodeFuse (Di et al., 2023), CodeT5+ (Wang et al., 2023d), AlphaCode (Li et al., 2022), InCoder (Fried et al., 2022), StarCoder (Li et al., 2023a), DeepSeek-Coder (Guo et al., 2024). On the other hand, code LLMs can be fine-tuned from general-purpose LLMs, e.g., CodeLlama (Roziere et al., 2023), WizardCoder (Luo et al., 2023), which is the approach we take here. Compared to specialized LLMs, the fine-tuning paradigm enables us to explore ways to improve code generation capabilities by leveraging pre-trained general-purpose LLMs, especially because these LLMs have already been trained on an extensive amount of code data.

Iterative Code Generation and Refinement. For many sequence generation tasks, iterative approaches are often taken to improve the generation quality, e.g., script generation (Tandon et al., 2021), summarization (Scheurer et al., 2022), and other tasks as shown in (Madaan et al., 2022; Saunders et al., 2022). Notably, in Self-Refine (Madaan et al., 2023), an LLM generates feedback after generating initial outputs, and the LLM iteratively updates the outputs with the feedback. Whereas it focuses on a general-purpose LLM setting, we focus on code generation tasks. As for code generation with LLMs, DebugBench (Tian et al., 2024) observes that incorporating runtime feedback improves code LLMs’ debugging performance. A most recent and relevant work is StepCoder (Dou et al., 2024), where, following the paradigm of relying on reinforcement learning with compiler feedback (Le et al., 2022; Shojaee et al., 2023), the authors further divide the original exploration problems into a sequence of easier sub-tasks. However, our approach does not rely on reinforcement learning and has access to the intermediate generation, which makes the training easier and more stable.

Conclusion

In conclusion, OpenCodeInterpreter represents a significant leap forward in the field of code generation, bridging the previously identified gap between open-source models and the advanced capabilities of proprietary systems like the GPT-4 Code Interpreter. By integrating compiler diagnostics and human feedback into an iterative refinement process, OpenCodeInterpreter not only surpasses traditional one-off generation approaches but also introduces a level of adaptability and precision previously unseen in open-source models. The introduction of Code-Feedback, with its extensive multi-turn interactions, further empowers OpenCodeInterpreter to dynamically refine code in response to evolving user intents and complex coding tasks.

Ethics Statement

The development and deployment of OpenCodeInterpreter, alongside the use of Code-Feedback, take ethical considerations to ensure responsible usage. We have made efforts to ensure that the dataset represents a diverse range of coding styles, problem domains, and user scenarios to prevent the propagation of biased or unfair outcomes. Given that OpenCodeInterpreter can generate and refine code based on user inputs, we strictly check out the dataset to ensure that it does not expose sensitive information or create security vulnerabilities. OpenCodeInterpreter has the potential to democratize coding by lowering the barrier to entry for non-experts and developers. We open-source all our code, models, and datasets to maximize accessibility.

Limitations

While OpenCodeInterpreter introduces significant advancements in automated code generation, it is important to acknowledge the limitations inherent in the system and the Code-Feedback that supports it. Although OpenCodeInterpreter is designed to support multi-language code generation and understand a wide range of programming contexts, its performance may vary across different languages and specific domains. While OpenCodeInterpreter excels at interpreting and responding to a variety of coding tasks, it may struggle with extremely complex or ambiguous user intents. The ability to accurately capture and address such intents is limited by the model’s current understanding and the specificity of the data in Code-Feedback.

References

Appendix A Source Data Filtering

Here, we outline the prompts used for source data filtering.

Query Filtering Prompt 1 Rate the following code queries on a scale of 1 to 5 based on their complexity, where 1 is the easiest and 5 is the most difficult. Consider the complexity of the query Query: [{query}] You are obliged to choose only from the following list. Scoring Criteria: 1 Point - Very Basic: The query involves simple operations or common issues 2 Points - Basic: The query involves fundamental programming concepts or commonly used functions 3 Points - Intermediate: The query requires some programming experience, possibly involving multiple steps 4 Points - Difficult: The query involves advanced programming skills, including complex logic, algorithms, or data structures 5 Points - Very Difficult: The query requires extensive expertise, potentially involving innovative problem-solving approaches or unique algorithm design Please give the score first then explain why Query Filtering Prompt 2 Rate the following code queries on a scale of 1 to 5 based on their complexity, where 1 is the easiest and 5 is the most difficult. Consider the complexity of the query Query: [{query}] You are obliged to choose only from the following list. Scoring Criteria: 1 Point - Moderately Difficult: Involves understanding specific programming concepts or libraries, and may include medium complexity algorithms or data structures like basic sorting algorithms or tree structures. 2 Points - Challenging: Requires handling more complex logic or algorithms such as advanced sorting algorithms, recursive logic, or intermediate data structures like hash tables and heaps. 3 Points - Highly Challenging: Demands deeper knowledge in algorithms and data structures, potentially including graph algorithms, dynamic programming, or complex string manipulation techniques. 4 Points - Advanced: Focuses on proficiency in programming and algorithm design, dealing with complex system architecture issues, performance optimization, or solving advanced algorithmic challenges like NP-hard problems. 5 Points - Expert Level: The highest difficulty level, requiring innovative problem-solving approaches or unique algorithm design, possibly involving interdisciplinary knowledge or the application of cutting-edge technologies. Please give the score first then explain why Below is an overview of the data filtering process applied to the initial seed dataset, with Figure A1 summarizing the data quantity after each filtering stage.

The pie chart in Figure A2 illustrates the distribution of programming languages in our dataset after filtering.

Appendix B Simulating Interactions for Data Collection

We illustrate the prompts used in multi-turn execution feedback and multi-turn human feedback respectively.

Appendix C Natural Language Explanations Generation

We use the following prompt to generate explanations for code using GPT-4.

Appendix D Model Evaluation Prompts

For different benchmarks, distinct prompts were employed during the initial turn of solution generation: identical prompts were utilized for HUMANEVAL and HUMANEVAL+, while MBPP and MBPP+ shared a similar prompt. The prompts are illustrated in the below.

Prompt for HumanEval and HumanEval+ You are an exceptionally intelligent coding assistant that consistently delivers accurate and reliable responses to user instructions. @@ Instruction Here is the given code to do completion: “‘{language} {original prompt} “‘ Please continue to complete the function with {language} programming language. You are not allowed to modify the given code and do the completion only. Please return all completed codes in one code block. This code block should be in the following format: “‘{language} # Your codes here “‘ @@ Response Prompt for MBPP and MBPP+ You are an exceptionally intelligent coding assistant that consistently delivers accurate and reliable responses to user instructions. @@ Instruction Here is the given problem and test examples: {original prompt} Please use the {language} programming language to solve this problem. Please make sure that your code includes the functions from the test samples and that the input and output formats of these functions match the test samples. Please return all completed codes in one code block. This code block should be in the following format: “‘{language} # Your codes here “‘ @@ Response We employ GPT models to emulate human behavior in generating feedback. The prompts provided to the GPT models are presented as follows.

Appendix E Examples of Methods used in Data Collection

Here we listed examples of each method in data collection process, including similar query packing, human feedback simulation and code correction for coding queries from open-source data (Section 2.1), and similar problem packing and follow-up Q&A for coding challenges from LeetCode (Section 2.2).

Appendix F Case Study