Execution-Based Evaluation for Open-Domain Code Generation

Zhiruo Wang, Shuyan Zhou, Daniel Fried, Graham Neubig

Introduction

Evaluations of NL-to-code generation systems, especially for general-purpose programming languages such as Python, have put an increasing emphasis on methods that execute code to verify the results. The predominant approach for creating such test sets is to manually write test cases for canonical code solutions (Chen et al., 2021; Austin et al., 2021; Lai et al., 2022; Huang et al., 2022). The correctness of model predictions is then evaluated by seeing if generated code passes the test cases (Chen et al., 2021). Compared to execution-free metrics such as text match against reference solutions, execution-based methods more rigorously assess the functional correctness of code (Hendrycks et al., 2021; Chen et al., 2021).

However, most resources with execution support only apply to closed-domain code, that only use Python built-in functions (Chen et al., 2021; Hendrycks et al., 2021; Austin et al., 2021; Li et al., 2022; Haluptzok et al., 2022) or specific libraries in data science domains (Lai et al., 2022; Huang et al., 2022). This focus on closed-domain problems diverges substantially from natural open-domain program usage covering a diverse range of libraries and functionalities (Yin et al., 2018; Agashe et al., 2019; Wang et al., 2022). To enable execution-based evaluation for coding queries using libraries, we present ODEX, an Open-Domain EXecution-based dataset (§2). We build ODEX by creating 1,707 test cases for 945 NL-Code pairs from the CoNaLa (Yin et al., 2018) and MCoNaLa (Wang et al., 2022) datasets, both stemming from StackOverflowhttps://stackoverflow.com with broad practical coding queries.

We analyze and highlight three aspects of ODEX (§3). First, ODEX has broad domain coverage of 79 libraries, with $53.4\%$ of the problems employing at least one library. Second, ODEX contains queries in four different languages, with 439, 90, 164, and 252 samples in English, Spanish, Japanese, and Russian, as shown in Figure 1. Third, ODEX addresses three unique challenges in open-domain code execution: irreproducible runs (Figure 1 a), randomized outputs (Figure 1 b), and specialized equivalence checks (Figure 2).

We evaluate two state-of-the-art code LLM families, Codex and CodeGen, on ODEX (§5). Our study shows that larger model sizes and augmented training data improve execution accuracy. Meanwhile, we observe satisfactory multilingual capabilities, despite that neither model was specifically designed for multilingual usage. However, we find that models face greater yet varied challenges with open-domain queries compared to closed-domain queries (§5). Specifically, Codex achieves higher overall results, while CodeGen presents better parameter efficiency and more balanced open-closed domain performance as model size scales up. By comparing execution-based metric with a series of execution-free metrics (§6), we further confirm the advantage of execution on allowing alternative solutions, but also show the potential of lexical metrics to identify simple bug fixes.

ODEX jointly facilitates practical open-domain code generation and execution-based evaluation. It serves as a comprehensive data benchmark for NL-to-code systems, supporting diverse NL contexts, library usage, and evaluation methods. By addressing the unique challenges of test creation and execution, we hope to lay a foundation for evaluating open-domain code via execution.

The ODEX Dataset

In this section, we describe our four-step process of constructing the ODEX dataset. We first collect resources of natural, open-domain coding queries (§2.1). Next, we establish the annotation standard and procedures for test case creation (§2.2). We then describe the annotator hiring and working processes (§2.3). Finally, we conduct checks to ensure data quality (§2.4).

We take two NL-to-code datasets, CoNaLa (Yin et al., 2018) and MCoNaLa (Wang et al., 2022), as sources for ODEX. We refer to them together as (M)CoNaLa. Their NL-Code pairs are collected from StackOverflow, which contains abundant coding queries that (1) naturally reflect practical program usage, and (2) cover diverse domains as measured by libraries used. These properties align well with our main focus on open-domain queries. (M)CoNaLa further proofs and clarifies its NL intents using human annotators to ensure data quality.

2 Annotation Standard and Procedures

Given each source NL-Code pair, our main annotation task is to write test cases to check code execution correctness, as illustrated by the four steps in Figure 2. A qualified test case should verify the main functionality of the canonical code solution. In the case where annotators do not understand the language of the intent, we use translation tools such as the Google Translate API.https://translate.google.com

Code solutions in (M)CoNaLa are often short snippets (e.g., x = np.zeros(5)) to ensure more precise matches with NL intents, but to be executable they often need additional context such as variable assignments. We therefore wrap code into standalone functions by specifying input and output arguments as contexts. For example, Step 1 in Figure 2 identifies variable a as an input argument.

Due to the open-domain coverage of (M)CoNaLa, some code snippets require extra library imports to execute correctly. Accordingly, our second step is to specify the prerequisite libraries for code solutions.

Next, we write test cases that contain three parts: (1) input: passing values to input arguments, (2) output: stating expected execution outputs, and (3) assertion: checking if execution results match the expected outputs.

However, test case creation for open-domain code faces three challenges. First, safe and reproducible execution can be hard to achieve. As in Figure 1 a, it is impractical to send an HTTP request when evaluating this sample. Instead, we use mock to simulate the output (a success response status code 200). Second, some codes entail randomness (e.g., random.randint(3,5)) and have no definite value. We instead make bounding assertions, e.g., checking that all elements are integers within the range of . Third, standard equivalence checks by == may be invalid, since library-specific objects often require specialized equality checks. For example, checking the equivalence of two NumPy arrays a and b uses np.array_equal(a,b), while a == b would cause execution errors.

In the last step, we perform self-verification to efficiently ensure the annotation quality. We execute the canonical code solution on each newly created test case. Unless the test case enables a successful pass of the solution, it should not be taken as a valid annotation.

3 Annotator Hiring and Task Fulfillment

As our data involves diverse functionalities from multiple libraries, our annotation task holds a relatively high standard for annotators. A qualified annotator should be proficient in Python and common libraries, and in writing workable test cases.

We chose to hire undergraduate students who have strong computer science backgrounds in Python. Of the 20 applicants who applied, we first conducted a resume screening to filter candidates with sufficient programming experience. Next, we gave each candidate an annotation test with five randomly selected NL-Code pairs. Since the test mirrors the official annotation process, we provided clear instructions about each step (as in §2.2) and code scripts for self-verification. Candidates were asked to finish their tests in three calendar days. Based on their test performance, we hired four candidates to officially participate in this job.

4 Quality Check

We put great effort into ensuring data quality throughout the annotation process. To assist annotators in more efficiently and accurately writing workable test cases, we require them to execute each written test case using the verification code that we provided, and explicitly report whether the canonical code solution can successfully pass all the annotated test cases that they created.

After the annotation, the authors performed post-hoc verification to check if each test case reads reasonably and executes correctly. In our final rounds of automatic quality checks, we confirm that the pass rate for all canonical code solutions over their annotated test cases is 100%.

We collect a total of 945 samples with NLs in four languages, including 439 samples in English, 90 in Spanish, 164 in Japanese, and 252 in Russian.

Dataset Analysis

We analyze ODEX from three aspects: domain diversity (§3.1), sample complexity (§3.2), and execution support (§3.3).

One unique property of ODEX is its broad domain coverage. We categorize codes that entail library usage (both built-in and third-party) as being in the open domain and those with none in the closed domain. Different libraries often serve specific functions and have unique capabilities. For instance, the datetime library is designed to handle date/time operations, while other libraries focus on various other fields such as data analysis or web requests. Therefore, in this work, we view the diversity in libraries as a representation of distinct domains.

Table 1 reports domain statistics and Figure 3 shows the library distribution. ODEX covers a diverse set of 79 libraries, which varies per language. Most samples, $53.4\%$ , use at least one library.

We compare ODEX with eight other code generation datasets that support test case execution: HumanEval (Chen et al., 2021), MBPP (Austin et al., 2021), APPS (Hendrycks et al., 2021), MTPB (Nijkamp et al., 2022), P3 (Haluptzok et al., 2022), DSP (Chandel et al., 2022), DS-1000 (Lai et al., 2022), and Exe-DS (Huang et al., 2022).

From their distributions in Figure 4, six out of eight datasets focus on the closed domain and most examples use zero libraries. Such examples deviate from realistic programs, which often use APIs of different libraries. DS-1000 and Exe-DS feature some open-domain problems, but their library usage is more homogeneous with a particular focus on data science domains. Moreover, DS-1000 restricts to code using libraries but only has seven libraries. In contrast, ODEX is more “colorful”; it covers significantly more open-domain libraries, as well as frequent queries in the closed domain.

To provide a reference on natural domain distribution, we approximate real-world usage by counting GitHub Python files that use each library. As shown in Figure 5, ODEX presents a better alignment with the practical scenario concerning the open domains – it features more diverse domains and preserves the long-tailed pattern in practical scenarios.

The full lists of libraries and their frequencies about ODEX, the eight comparison datasets, and the approximated natural setting are in §A.1.

2 Complexity

To measure dataset complexity, we first calculate the lengths of NL intents and code snippets. We tokenize NL intents with the spaCyhttps://spacy.io/ tokenizers in respective languages; we follow Yin and Neubig (2018) to tokenize code. For code, we also parse the AST tree using the Python standard ast library,https://docs.python.org/3/library/ast.html and count the number of input and output variables to quantify the complexity of execution contexts.

In Table 2, we see that code in the Spanish set is longer on average than other languages. For both the input and output sides, code in the English set has fewer variables, suggesting potentially simpler execution environments, which could stem from relative simplicity of SO queries asked in English.

3 Execution Support

We systematically compare code generation datasets that concern execution or open-domain code in Table 3. ODEX is the first dataset that supports execution-based evaluation for open-domain code. While ODEX does not have the largest number of test cases, we discuss in §7 how these test cases can still reliably measure code correctness.

Experiment Setup

Code LLMs have achieved strong results on multiple code generation tasks, yet their open-domain proficiency is understudied due to the limited domain settings of past datasets. To examine model capabilities in the open domain, we evaluate two top-performing model families, Codex and CodeGen, on ODEX. We perform evaluations using a prompting setting, without finetuning any model.

We introduce the baseline models, the prompt settings, and lay out the metrics for evaluation.

At the time of this work, Codex had three publicly available models. code-cushman-001 (C1) is a 12B Codex model in Chen et al. (2021). code-davinci-001/002 (D1, D2) are two 175B GPT-3 models.https://beta.openai.com/docs/model-index-for-researchers

CodeGen (Nijkamp et al., 2022) models are auto-regressive models trained on a combination of NL and code corpora, differing in model sizes (350M, 2.7B, 6.1B, 16.1B) and training data. Models are progressively trained on ThePile (Gao et al., 2020), BigQuery,https://cloud.google.com/bigquery and BigPython datasets are denoted as nl, multi, and mono. The most powerful CodeGen-16.1B-mono, performs similarly to code-cushman-001 on the HumanEval and MTPB datasets.

For fair comparison, we use the same prompt for both model families. While prompting with few-shot in-context examples may improve, our experiments do not always find this helpful for both models. Therefore, we report zero-shot results as baselines and leave few-shot results to §7. Creating zero-shot prompts only requires content from the test sample. Following Chen et al. (2021), we construct prompts by concatenating function context and a docstring. A docstring includes the NL intent and optional unit tests (compared in §7). Figure 6 shows an example prompt.

We follow Chen et al. (2021) and measure the execution accuracy using the pass@k metric, by computing the fraction of problems having at least one correct prediction within $k$ samples. We also compare it with a series of execution-free metrics later in §5.

We follow Chen et al. (2021) and use nucleus sampling (Holtzman et al., 2019) with top- $p$ set to 0.95 and temperature set to 0.8. We set outputs to a maximum of 512 tokens.

Experiment Results

We first present the overall performance of two model families on ODEX (§5.1). Next, given the unique challenges of open-domain code, we study the variances between open- and closed-domain problems (§5.2), and in individual domains (§5.3).

As in Table 4, aligning to existing works and our intuition, larger davinci 175B models outperform the smaller cushman 12B model, and the 002 version improves over 001. This trend holds for all languages and all sampling sizes. Somewhat surprisingly, all models attain decent results on non-English problems, even though Codex is not designed for multilingual use. This high accuracy on non-English problems suggests the multilingual potential of Codex models.

We report results of mono models in Table 4 given their superior performance over nl and multi variants (Nijkamp et al., 2022). The pass rate increases as CodeGen grows from 350M to 2.7B, and continues to increase in non-English languages when further scaling to 6.1B. CodeGen exhibits multilingual capacity, as its results on non-English subsets are close to that on English, and consistently increase during scaling.

Although Codex and CodeGen have comparable performance on existing datasets such as HumanEval, ODEX effectively unveils the efficacy of CodeGen on open-domain coding queries even with many fewer parameters, i.e., CodeGen 6.1B yields similar pass1 to the 176B Codex davinci-001 model on some languages. More fine-grained results (pass@k at $1\leq k\leq 10$ ) for both models are in §B.

2 Open Domain versus Closed Domain

Figure 7 (left) shows pass@1 on open-domain (OD) and closed-domain (CD). All Codex models score much lower in OD than in CD. Such large gaps hold across all languages, ranging from 4.34 in Spanish to 38.57 in Japanese. Model upgrades (c1 $\rightarrow$ d1 $\rightarrow$ d2) do not always reduce the gaps. Gaps slightly shrink in Spanish, but continuously increase in English and Japanese. While d2 performs the best, it also exhibits the most severe gaps. These findings suggest that common practices to improve LLMs may not address the complexities inherent in open-domain code generation problems. It is hence imperative that more advanced strategies are employed.

As shown in Figure 7 (right), CodeGen also has substantial gaps between open and closed domains, however, smaller than Codex gaps across all languages, by on average 6.0% points. As model size increases from 2.7B to 6.1B, the gaps reduce by about 6.3 points in English and 1.7 points in Spanish. This is in contrast to Codex, which when scaling up to davinci-002, these gaps continue to increase by 4.9 points on average, indicating that scaling up CodeGen more effectively catches up on open-domain performance.

3 Domain Variance

We now dive deeper into the results within individual domains. We focus on the code-davinci-002 model as it has the best performance across all models. In Figure 8, we plot accuracy with respect to the domain frequency, as approximated in §3.1.

Execution accuracy is not low on all open domains. For example, code-davinci-002 achieves $50\%$ pass@1 for several common libraries such as random and math. But high domain frequency does not ensure model proficiency. For example, on libraries with complex functionalities such as matplotlib and tensorflow, pass@1 can go below $10\%$ . See §C for more domain-wise results.

Comparing to Execution-Free Metrics

In this section, we study the alignment between execution-based evaluation and five execution-free metrics, identifying advantages for both types.

We evaluate models using five execution-free metrics using lexical, syntax, and semantic matches: BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), METEOR (Banerjee and Lavie, 2005), ChrF (Popović, 2015), and CodeBLEU (Ren et al., 2020). Refer to §D.1 for more descriptions.

We analyze using Codex, given its better performance. As shown in Figure 9, model rankings by execution-free metrics do not precisely correlate with their rankings by execution accuracy. Even when the rankings align, their differences are largely not proportional. Comparing the metrics, ChrF and METEOR have smaller inter-model variances, while BLEU and ROUGE change more and correlate better with pass rates. Notably, CodeBLEU is low in most settings and might not be suitable for evaluating code in snippet-style.

We next evaluate whether execution-free metrics might be used to discriminate between passed and failed samples. We take BLEU as an example since it shows similar ranking patterns to execution. Figure 10 shows negligible variances in BLEU scores of passed and failed groups. The other four metrics exhibit similar patterns, as could be found in §D.3.

What Affects Model Performance?

Besides differences in model configurations, we study three factors that might affect performance.

Models might benefit from example NL-Code pairs. We thus explore the few-shot setting by prefixing $N\in\{1,2,3\}$ input-output pairs in prompts. In Figure 11 (left), for cushman-001 and davinci-001, few-shot examples yield a clear improvement over the zero-shot setting; but for the strongest davinci-002, it brings minimal gains in English. See similar results in other languages in §E.1.

Including test cases in inputs adds execution hints of the expected functionality of the solution, and hence may improve execution accuracy. We test this hypothesis by experimenting with prompts that have varying numbers of test cases. Besides the default setting with zero tests, we compare adding one random test case and all annotated test cases.

Figure 11 (right) shows that injecting as few as one exemplar test case significantly improves the execution accuracy, yet adding more cases has little bonus. This potentially implies the sufficiency of one test case to show the main functionality.

Execution results could be more reliable if using more test cases for evaluation. However, there is a trade-off between evaluation effectiveness and annotation efficiency, due to the high cost of human effort. To study this tradeoff, we observe how results change with respect to the number of tests. Compared to using all cases in default, we also try using one randomly selected case. For simplicity, we do not include any test cases in prompts.

As shown in Figure 12, evaluating over one random test largely preserves the accuracy of using all tests, indicating that one case is sufficient to test the main functionality for most queries. Check §E for analysis on other factors such as function naming.

Related Work

Programs often use APIs from different Python libraries. Some datasets preserve natural coverage from interactive Jupyter Notebooks (Agashe et al., 2019) or StackOverflow posts (Yin et al., 2018; Wang et al., 2022), but face challenges in enabling execution (Lai et al., 2022; Chandel et al., 2022). Our ODEX dataset addresses execution for open-domain code.

Some works stem from coding contest websites (Hendrycks et al., 2021; Li et al., 2022), but GitHub Jupyter Notebooks (Agashe et al., 2019; Huang et al., 2022) and StackOverflow (SO) (Yin et al., 2018; Wang et al., 2022; Lai et al., 2022) provide more natural and practical coding queries. We preserve this naturalness and incorporate various NL settings to assist programmers worldwide.

Evaluation by execution has long been used for SQL (Zhong et al., 2017) or logical forms (Dong and Lapata, 2016). Many datasets have begun to support Python execution via test cases, however focus on built-in functions (Chen et al., 2021; Austin et al., 2021; Hendrycks et al., 2021) or specific domains (Lai et al., 2022; Huang et al., 2022). Our test cases, in contrast, cover diverse libraries in the open domain.

Conclusion

We present ODEX, an open-domain code generation dataset supporting execution-based evaluation via human-written test cases. ODEX not only supports execution-based evaluation of code using test cases, but also extends the task to the open domain, covering 79 diverse Python libraries and four natural languages (English, Spanish, Japanese, and Russian). Comparing two state-of-the-art code generation models, Codex and CodeGen, our dataset effectively unveils their varied behaviors between program domains and language contexts. ODEX serves as a comprehensive NL-to-code benchmark given its open-domain coverage, multi-natural language queries, and multi-metric support. When bringing code execution to open domain scenarios, our explorations also reveal emerging challenges in test creation and reliable execution, which we hope that our dataset will enable future work to tackle.

Acknowledgements

We would like to thank all the annotators for their hard work. We thank Uri Alon, Frank F. Xu, Tao Yu for their helpful feedback on this work.

Limitations

ODEX aims to serve as a comprehensive testbed, by enabling execution-based evaluation of code in the open domain, with flexible intent inputs in four natural languages. However, we should should hold continuous awareness of execution security, multilingual support, and evaluation reliability.

First, execution supports in ODEX enables more rigorous evaluations than other execution-free methods. However, due to the increased complexity of open-domain codes, more inspections are required for execution safety, either for code solutions or test cases. We should always keep alert to concealing malicious code Wallace et al. (2021) or generating code with security vulnerabilities Verdi et al. (2020); Pearce et al. (2021).

Second, in addition to English inputs, ODEX also encourages intents specified in three other languages. Still, its language coverage is bounded by the available forums in StackOverflow. We hope our initiative can highlight the multilingual nature of program developers, encourage the emergence of similar data resources in other languages, and continuously promote AI programming assistance in languages worldwide.

Third, ODEX covers wide-ranging code queries in the open domain, it is more suitable for less resource-demanding scenarios such as downstream evaluation or few-shot learning. Although ODEX is larger than many previous datasets with human-written test cases, it is still limited due to the intense human effort required by the curation process. Regarding this, we encourage readers to conduct significance testing (Dror et al., 2018) and report more substantial model improvements.

Ethics Statement

Our work has received IRB approval and is licensed under a Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 International License. The resulting ODEX dataset is built to serve as a benchmark for open-domain code generation, to further facilitate technological advances in AI programming assistance, meanwhile supporting multiple languages to encourage its universal accessibility.

We strive to ensure high data quality and optimize annotation efficiency. We build the ODEX dataset with natural and practical StackOverflow resources and hire annotators with qualified programming proficiency. We provide our annotators with clearly documented instructions, flexible annotation interfaces (Google Sheets, Jupyter Notebooks), and self-verification tools. We (authors) conduct pilot annotation to confirm the clarity of annotation standards and feasibility of the annotation task. We conduct posthoc examinations on the annotation results, both manually and automatically, to obtain assured data quality (100% pass rate).

We respect the contribution and privacy of our annotators. We offer competitive remuneration for their annotation job and treat each one of them fairly. All annotators possess the right to withdraw at any time. We secure that all their personal information is removed before public release.

We conduct systematic analysis from multiple perspectives in the paper, in an attempt to foster public awareness on generating and evaluating programs in the open domain, both in encouraging more advances in this direction, and raising more concerns about the robustness and security of such unique coding problems.

References

Appendix A ODEX Dataset

Aside from the illustrations in § 3.1, we list out the detailed statistics of libraries in ODEX, the eight comparison datasets, and the approximated natural distribution.

Table 5 lists the number and percentage of occurrences for each library in the ODEX dataset.

Table 6 lists the library frequency of eight comparison dataset mentioned in § 3: HumanEval, MBPP, APPS, MTPB, P3, DSP, DS-1000, and Exe-DS.

To approximate the natural distribution of libraries in the open domain, we count the number of Python files on GitHub that imports the library of interest. Following the GitHub search syntax,https://docs.github.com/en/search-github/searching-on-github/searching-code we use the query import ${library_name} to search files that import a certain library, and use NOT import to count files not using any libraries. Their frequencies are shown in Table 7.

A.2 More Annotation Details

Along with the NL-Code pair, we also provide IDs of the source StackOverflow post, using which annotators can trace back to the original post webpage and get a better understanding of the question. If any errors or under-specification are spotted in the given NL or code, we ask the annotators to correct it by making the minimal change possible.

Aligning with how programmers import a library, we require the expressions be written in three forms: (1) import ${LIBRARY}, (2) import$ {LIBRARY} as ${ABBR}, or (3) from$ {LIBRARY} import ${FUNCTION}, where the$ {LIBRARY} can also be sub-classes such as matplotlib.pyplot.

We encourage the annotators to use the language identical to the given NL intent when creating the test cases, especially if the code involves string-related operations (e.g., writing regular expressions in Japanese). We encourage the annotators to write reasonably more and diverse test cases, by varying the values or types of variables.

Please find the full instructionhttps://anonymous.4open.science/r/odex/data/instruction.md and exampleshttps://anonymous.4open.science/r/odex/data/sample_annotation.ipynb for annotation in our code repository.

Appendix B Baseline Results

According to the baseline results in § 5.1, we provide more detailed evaluation results, on the execution pass rate ranging from the top-1 to top-10 model predictions. Table 8 and Table 9 show the zero-shot execution accuracy of Codex and CodeGen models, respectively.

Appendix C Domain-Wise Execution Results

We list out detailed results for experiments in §5.

Table 10 and Table 11 shows the execution accuracy for Codex and CodeGen on open-domain and closed-domain problems, respectively.

C.2 Domain-wise Execution Accuracy

As introduced in § 5.3, we take code-davinci-002, and report its execution accuracy on each domain in Table 12.

C.3 Qualitative Error Analysis

To provide more intuitive explanations of the domain divergence aforementioned, we conduct error analysis over 60 randomly selected examples from ODEX dataset (15 for each language). By examining the error patterns from these examples, we aim to answer: what are the common error types on open- and closed-domain problems? What are the main differences between them?

Similar to the previous section, we take the code-davinci-002 since it scores the best and presents clear domain gaps, which might give more intuitive variances between domains.

Of the 60 random samples we analyzed, 31 are closed-domain problems, and Codex predicts erroneous code solutions for 22 of them. We identify four main types of errors from these samples: (1) 11 cases ( $50.0\%$ ) use the Python built-in functions incorrectly, mostly about strings manipulations and number calculations; (2) 7 cases ( $31.8\%$ ) failed at complex functions, which usually require multi-step implementations; (3) 4 cases ( $18.2\%$ ) received empty predictions, potentially because they involve unfamiliar topics to the model; (4) 2 cases ( $9.1\%$ ) imports extra library or add redundant implementations.

Note that the number of error cases in these four categories does not add up to 22. Since we analyze all of the error predictions among the model top-10 predictions, one case could present multiple error types in its different predictions.

Of the other 29 problems belonging to the open domain, 26 of them have erroneous predictions. Errors in the open domain exhibit more diversity than in the closed domain. The major error enclosing 16 cases ( $61.5\%$ ) is the failure to use the prerequisite libraries, or missing part of them when multiple libraries are involved. The next major type is using incorrect functions, which happens in 9 cases ( $34.6\%$ ). Similarly to the closed-domain errors, 5 cases ( $19.2\%$ ) have error usage of correct functions, 4 cases ( $15.4\%$ ) struggle with complex multi-step implementations, and 3 cases ( $11.5\%$ ) face empty predictions.

OD and CD problems share some error categories such as function misuse and complex operations. Nonetheless, open-domain problems introduce extra challenges: correct selection and usage of libraries and functions in the wild.

Appendix D Evaluation Metrics

We describe each of the non-execution metrics (§ D.1) as introduced in § 6, report model performance with each (§ D.2), and visualize their correlations with the execution accuracy (§ D.3).

BLEU (Papineni et al., 2002) is a lexical-based evaluation metric, which calculates the n-gram overlap between text prediction and (multiple) references. Most default calculation processes calculate up to 4-grams and adopt the smoothing function introduced in Lin and Och (2004).

ROUGE (Lin, 2004) is another more recall-oriented lexical-based evaluation metric. It was originally designed for measuring text summarization, mainly by counting the number of overlapping units (n-gram, word sequences, and word pairs) between prediction and references. Among the multiple variants proposed (ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S), we use the most common ROUGE-L in our experiments.

METEOR (Banerjee and Lavie, 2005) is a unigram-based metric originally intended for machine translation. It builds on a generalized unigram concept by involving unigram precision, unigram recall, and word order measures.

ChrF (Popović, 2015) targets lexical match on the character level, by calculating the character-level n-gram F-score between predictions and references. ChrF is also originally proposed for the machine translation task, but later adopted for some code evaluation works (Evtikhiev et al., 2022).

CodeBLEU (Ren et al., 2020) is specifically designed for code evaluation, by jointly considering the surface-form match, syntax similarly, and semantic data flows.

D.2 Evaluating with Non-execution Metrics

Table 13 and Table 14 shows the scores of Codex and CodeGen using non-execution metrics.

D.3 Visualizing Metric Correlations

Following the discussion in § 6, we visualize the non-execution metric metrics between samples that pass and fail during execution time. All experiments use code-davinci-002 predictions for evaluation. Figure 13, Figure 14, Figure 15, Figure 16 illustrates the histogram between passed/failed samples using ROUGE, METEOR, ChrF, and CodeBLEU metrics, respectively.

D.4 Why is Execution Better?

To give more intuitive reasons for the advantages of execution, we randomly sample 15 cases from each language subset and identified two major benefits: it tolerates alternative solutions and allows execution results as outputs.

Probably the greatest advantage of execution is it only requires correct execution results, without limitations on alternative methods, as in Figure 17.

Another interesting category is directly generating the code execution results instead of the implementation steps. This often happens to simple coding queries such as basic string manipulation, where predicting the results might cost the model similar efforts to getting the programmatic solutions.

In Figure 18, instead of the string decoding program, the model directly outputs the result string “JLK”. While this is somewhat unexpected under the NL-to-Code task, execution effectively handles such cases and would judge them as correct.

D.5 Potential Benefit of Lexical-based Metrics

Lexical-based metrics, although relatively ineffective for functional correctness, still are potentially helpful for debugging and interpretation. They are effective in small errors of two types: (1) a single function misuse and (2) slight variance in complex strings. The high lexical match in such cases indicates less effort for fixing (Deng et al., 2021).

Some code predictions are correct except for a single place where a wrong function is used, or an argument is misplaced.

For example, in Figure 19, the code imports the library and copies all strings correctly. But it uses the wrong function match instead of the correct findall. Although the execution fails, the code is similar to the solution. Given the sign of a high BLEU score of $92.5$ , we could readily spot such similarities and fix them with simple edits.

Another frequent error concerns string copying, where the code calls the correct functions but copies the string differently.

The example in Figure 20 gets a $100.0$ BLEU score, but the string inside actually misses a single whitespace, which the BLEU tokenization would discard. Such code also resembles the solution and could be easily fixed by even rule-based methods.

Appendix E Ablation Studies

This section provides the results tables according to each ablation study section in § 7.

Table 15, Table 16, Table 17 show the change in execution accuracy with respect to the examples in in-context learning, on the three Codex variants

E.1.2 Number of Input Test Cases

Table 18 shows the effects on execution accuracy, of adding one or more test cases to prompts. Experiments use code-davinci-002 as an example.

E.1.3 Pre-processing: Trailing Whitespaces

While the input construction process may introduce whitespaces at the start and the end of the text sequence, we find CodeGen model unexpectedly sensitive to trailing whitespaces. As shown in Table 19, removing whitespaces from the prompt input increases the pass rate of all sized CodeGen models by over 20 percent.

We conjecture the gain brought by whitespace stripping to be better distributional alignment with CodeGen training data. As CodeGen might be pre-trained on whitespace-stripped text sequences, inputs without whitespaces are potentially more aligned with them, hence resulting in better test-time performance. Meanwhile, note that the tokenization processes for text (natural language) and code (programming language) differ in whitespace-style tokens such as \n or \t. These tokens would be removed by text tokenizers by default, while preserved by code tokenizers since they imply structural information in code pieces.

E.2 Number of Evaluation Test Cases

Table 20 shows the effect when using different numbers of test cases for execution-based evaluation.

E.3 Semantics of Function Names

Because code is wrapped into functions to enable execution, how functions are named may affect model predictions. By default, we name functions using the post ID (e.g., f_3844801), which expresses little semantics of queries. So we try two other methods: (1) a constant string function; and (2) summary phrases from NL intents, e.g., find_max_value.

To do (2), we conduct a heuristic phrase extraction. We first cut the NL intent into words by whitespace, then remove the stop words (‘in’, ‘of’, ‘a’, ‘to’, ‘and’, ‘for’, ‘with’, ‘that’) and meaningless punctuations, lastly, concatenate the first $M=4$ words with ‘_’. For example, given an intent “decode a hex string ’4a4b4c’ to UTF-8”, the resulting function name would be “decode_a_hex_string”. However, for languages that do not separate words with whitespace, this approach may produce less meaningful strings, hence contributing to the inferior performance as shown below.

To fairly compare with previous results, we do not add test cases in prompts.

From Figure 21 and Table 21, using more semantically meaningful functional names barely improves over the default setting. Intuitively, summarizing names from intents adds no extra semantics, but may cost information loss at the curation step, both contributing to the performance drop.

Appendix F Related Work

Code written in general-purpose programming languages often uses classes or functions from external libraries. A few datasets for code generation preserve this open-domain nature. The CONCODE (Iyer et al., 2018) dataset tested generation of Java class methods. Later works target Python generation given the interactive context of Jupyter Notebooks (Agashe et al., 2019) or natural language intents from StackOverflow posts (Yin et al., 2018; Wang et al., 2022). Despite their natural coverage, enabling open-domain code execution has faced great challenges given its diversity and complexity (Lai et al., 2022; Chandel et al., 2022). To address this issue, our ODEX provides test cases as code execution contexts for evaluation.

Execution-based evaluation has been long adopted for domain-specific programming languages such as SQL queries (Zhong et al., 2017) or logical forms (Dong and Lapata, 2016). This execution-based paradigm has not been introduced to general-purpose languages until recently by the HumanEval dataset (Chen et al., 2021), where human-written test cases are provided for code execution. Many works afterward follow this approach, but focus more on closed-domain settings (Austin et al., 2021; Hendrycks et al., 2021) or specific libraries of interest (Lai et al., 2022; Huang et al., 2022). Toward broader execution environments, we provide executable test cases for as many as 79 libraries.

Programs from different sources are organized for various purposes. Coding contest websites such as LeetCodehttps://leetcode.com/ and Codeforceshttps://codeforces.com/ have been used to build many code generation benchmarks (Hendrycks et al., 2021; Li et al., 2022). However, they randomly align with how humans program in practical scenarios. To build datasets with natural and practical usage of code, many works use GitHub Jupyter Notebooks (Agashe et al., 2019; Huang et al., 2022) and StackOverflow forums (Yin et al., 2018; Wang et al., 2022; Lai et al., 2022) as a source of naturally-occurring code. We remain such naturalness by using StackOverflow posts, but uniquely from forums in various languages to also assist programmers worldwide.

While most benchmarks use Python test cases annotated by human programmers (Chen et al., 2021; Nijkamp et al., 2022; Lai et al., 2022), challenge-style datasets adopt a more direct approach by crawling from the web (Hendrycks et al., 2021; Li et al., 2022). Another thread of work attempts to generate test cases automatically based on the Python grammar (Lukasczyk and Fraser, 2022), but is largely limited to basic Python functions. Some propose to leverage the power of neural LMs (Tufano et al., 2020; Li et al., 2022), even jointly considering solution and test case generation (Chen et al., 2022). However, the quality and diversity of test cases are not robustly ensured. We hence use high-quality human-written test cases for ODEX evaluation.