Execution-based Evaluation for Data Science Code Generation Models

Junjie Huang, Chenglong Wang, Jipeng Zhang, Cong Yan, Haotian Cui, Jeevana Priya Inala, Colin Clement, Nan Duan, Jianfeng Gao

Introduction

Code generation models Chen et al. (2021a); Tunstall et al. (2022) have shown promising results to improve developer productivity by generating code from natural specifications Le et al. (2020); Al-Hossami and Shaikh (2022). These promising results also bring interest to code generation for data scientists, who program data analysis scripts in interactive notebook environments like Jupyter Notebooks Kluyver et al. (2016) where programs are written interactively in loosely organized program cells (Figure 1 (1)). This domain and style differences motivates new modeling resources, e.g., new datasets (e.g. JuiCe Agashe et al. (2019) ) and models (e.g. JuPyT5 Chandel et al. (2022)) specific to data science tasks.

However, we still lack a good methodology to evaluate data science (DS) code generation models. JuiCe dataset uses the BLEU Papineni et al. (2002) and Exact Match (EM), the prevailing metrics in code generation, to measure semantic similarity between the generated and reference code. However, these two surface-form metrics have limitations: the former neglects code syntactic features and the latter is too strict Ren et al. (2020). Execution-based metrics are another widely accepted line of metrics in general software engineering (SE) domain, where the correctness of generated functions is determined by whether the outputs are consistent with oracle input-output data/unit tests. For DS problems, however, collecting an executable dataset and performing execution-based evaluation are challenging. DS notebooks usually do not come with their own set of unit tests and existing datasets like JuiCe do not track the input data (such as tables) needed to run the notebooks. In addition, the outputs from notebook cells are often not "pure" values (e.g., numbers, strings, or lists) like the outputs of the functions in SE problems. The DS notebook cell outputs are meant for human understanding and hence, may contain complex data structures (e.g., data frames, plots) accompanied with texts; thus simply checking whether outputs are the same is too strict to capture cases when the generated cell output is semantically correct but formatted differently from the reference (Figure 1 (2)).

In this paper, we provide a dataset for evaluating DS code generation models, dubbed ExeDS, which contains 534 data science problems built on JuiCe Agashe et al. (2019). We collect ExeDS by first crawling data dependencies from original GitHub repositories for the notebooks and filtering out notebooks with runtime errors; then, we curated 534 high-quality problems with sufficient code context and human-written natural language (NL) to describe tasks as the testset. With ExeDS, we can evaluate execution correctness by comparing outputs from generated code with desirable outputs.

We experiment with 5 existing code generation models on ExeDS to identify their execution performance. Experiment results show that (1) models with high/low surface-form scores do not necessarily generate execution-correct code – for example, while Codex Chen et al. (2021a) is low in BLEU, it achieves high execution scores. (2) Execution-based metrics can better capture code errors which might be helpful for model improvements.

Related Works

Data science (DS) refers to the practice of analyzing data and acquiring insights with computational methods Donoho (2017). With the goal to improve productivity, there are increasing interests in building systems to solve a variety of DS tasks, including code synthesis Agashe et al. (2019), code synthesis for visualization Chen et al. (2021b) and data preparation Yan and He (2020), documentation Liu et al. (2021); Wang et al. (2021), etc. In our work, we focus on code generation in DS, which generates code with code, NL and data context.

Code generation benchmarks are predominantly evaluated by matching code surface formsPapineni et al. (2002); Lin (2004); Ren et al. (2020). These datasets evaluate explicit code generation with different input specifications, including natural language Wang et al. (2015); Oda et al. (2015); Zhong et al. (2017); Yin et al. (2018); Yu et al. (2018); Lin et al. (2018), unfinished code Iyer et al. (2018); Lu et al. (2021), and input-output examples Polosukhin and Skidanov (2018); Zavershynskyi et al. (2018). However, surface form metrics are unable to assess code as programmers, who focus on the functionality and execution correctness in practice.

Consequently, recent works turn to execution-based metrics instead, where the code would be correct if it passes a set of unit tests defined by humans Roziere et al. (2020); Kulal et al. (2019); Austin et al. (2021); Chen et al. (2021a); Hendrycks et al. (2021). However, the complex output data and scarcity of units tests in DS limit its application in DS code generation. Chandel et al. (2022) explore applying unit tests in DS, but they only focus on educational problems. Table 1 compares ExeDS with various related datasets.

ExeDS for Execution Evaluation

As mentioned in Section 1, the lack of executable environments for notebooks is a key limiting factor of execution-based model evaluation for data science tasks. Thus we first construct an evaluation dataset ExeDS and analyze its characteristics. Then describe the methods for execution evaluation.

ExeDS contains 534 problems with code context, NL task description, reference code and target execution output, which is built upon JuiCe Agashe et al. (2019) with 659K publicly available Python Jupyter notebooks from GitHub. We create ExeDS in the following steps.

Step1: Crawling Data Context and Execution. Programming problems in DS often deal with data, which are often stored in files (e.g., .csv) and loaded by code. Executing notebooks needs such data dependencies, which are not provided in JuiCe. Thus, we first crawl dependent data for notebooks from their GitHub repositories. Notebooks with inaccessible data or using libraries not present in Python standard library and default DS environment are removed. With data dependency, we execute notebooks with a time limit of 1000 seconds per cell. After execution, code cells have three types of outputs: (1) displaying data with a figure; (2) execute result with a textual execution output; and (3) stream output with a printed textual output through streams. Since it’s hard to compute figure similarities, in this paper, we only evaluate execution correctness on textual outputs and construct ExeDS with execute result and stream output.

Step 2: Dataset Filtering and Intent Curation. As some cells are overly complex for code generation, for simplicity, we remove examples with more than 5 lines or using customized methods in target code cells. To keep diversity, we downsample cells with frequent outputs, e.g. df.summary(), df.info(), df.shape, etc. To ensure sufficient context is provided, we remove the target code whose variables are absent in the previous 5 cells.

Since some cells lack sufficient descriptions for the problems, for clarity, we recruit two university students with Python and notebook experience to manually write NL descriptions for each example. After viewing the context, target code and output, they are asked to write descriptions containing information in two aspects: (1) the functions of target code; (2) the instructions to print outputs. We discard examples that annotators feel hard to describe.

Finally, we obtain 534 problems from 278 notebooks for ExeDS, each with code context, NL description, target code, and desired execution output.

Dataset Statistics

Table 2 shows the function types in ExeDS. We found the majority of target codes are computing statistics (40%), exploring data value (19%) or property (10%), and for machine learning (16%), which are popular DS tasks.

Table 4 presents the types of execution output in all 534 problems. We find the majority of execution output are numbers, which is not surprising considering the fraction of data statistics and exploring data value in code functions. Also comparing numbers is less complicated than comparing other types of data like strings or data frames, which helps easier evaluation of execution outputs.

Table 3 displays the most common libraries used in ExeDS. We find the majority of them use data science libraries and all of them use pandas, which indicates our focus on data science code generation.

Evaluation Metrics

In ExeDS, we measure the execution correctness by comparing the reference outputs with outputs from generated code, which is called output exact match (OutputEM). However, as a variety of examples produce outputs in numbers, we convert all numbers in string type to the float type with two decimal spaces to better match numbers. Similarly, we remove the explanation string when printing outputs for better comparison.

Evaluating Code Generation on ExeDS

Based on ExeDS, we evaluate the models’ performance on data science code generation and compare both surface-form code and execution output.

We investigate the task of target code cell generation in notebooks with context. Figure 1 presents an example of the task. For each target code cell, we prepare a source-target example, conditioned on prior multimodal context and natural language intent. The context includes: (1) the closest three cells prior to the target cell, regardless of code or markdown; (2) a code statement to define the columns names of data in the format of df.columns=[’a’, ’b’].

Baseline Models

We test five code generation models: (1) PyMT5 Clement et al. (2020) is an encode-decoder transformer Vaswani et al. (2017) pretrained on Python corpus. (2) JuPyT5 Chandel et al. (2022) is an encoder-decoder transformer pretrained on Jupyter notebooks with the code-infilling objective. (3) CodeGPT and CodeGPT-adapted Lu et al. (2021) are two GPT-style models Solaiman et al. (2019) pretrained on CodeSearchNet Python functions Husain et al. (2019), where the former is trained from scratch and the latter is trained from GPT-2 checkpoint. (4) GPT-neo Black et al. (2021) is a GPT-style model pretrained on The Pile Gao et al. (2021), a dataset with a variety of text sources including 8% GitHub code. We evaluate three GPT-neo models with different parameters, including 125M, 1.3B, and 2.7B. (5) Codex Chen et al. (2021a) is the state-of-the-art model trained on 159G GitHub Python files from GPT-3 Brown et al. (2020). We test its zero-shot performance due to the inaccessibility of model weights.

Finetuning

For training and validation, we filter a set of 123K source-target examples from JuiCe with data dependencies, where the target is any code cell and the source is the prior multimodal context as in ExeDS. We randomly select 4K examples for validation and leave the rest for finetuning. More details can be found in Appendix A.

Metrics

We report results with OutputEM, which is the proportion of examples with correct output, and surface-form metrics, i.e. BLEU, CodeBLEU Ren et al. (2020), and Exact Match (EM).

Evaluation Results

In this section, we show and analyze evaluation results to show the advantages of our ExeDS dataset.

Table 5 shows the results of different baseline models in surface form metrics and execution correctness. We have the following main observations.

(1) For all models, the surface form EM is close to zero while the OutputEM is in a normal range. This suggests that surface form EM often fails to evaluate code correctness, while the execution metric is better which covers more correct cases and shows correctness beyond matching code strings.

(2) Surprisingly, zero-shot Codex achieves compatible results with finetuned JuPyT5 in OutputEM, but it performs badly with surface-form metrics. This finding suggests the strength of Codex to generate correct code and understand the multimodal context. In addition, the difference between surface-form scores and OutputEM again shows the superiority of measuring code with execution correctness.

(3) Encoder-decoder models perform better than GPT-style models with all metrics, which indicates their strength in generating code. Also, JuPyT5 achieves the best performance with all metrics. One possible reason is that JuPyT5 is pretrained on a large corpus of notebooks, which learns the necessary knowledge from the notebook context.

2 Error Analysis

We give two error analyses of execution results to investigate examples with raised execution exceptions and erroneous outputs. The code examples are produced by our top-performing model JuPyT5. Detailed examples can be found in Appendix B.

Table 6 shows five exception types from 154 examples. We find for 45% cases, the model fails to capture data-flow and uses undefined variables in context. For 16% cases, the model misuses API methods and often leads to AttributeError , possibly due to version differences and calling methods without import. 22% cases misuse the data schema of dataframes, which indicates the need to improve code generation models with such multimodal context, especially how to incorporate the data schema context. Only 8% cases have syntax problems, suggesting the model’s strong ability to generate syntax-correct code.

Output Errors

Table 7 shows four types of output errors from 50 examples. We find 56% cases have incorrect code. The challenging NL description and context might be hard for models to understand and generate correct code. 8% cases complete the correct functions but do not call print() to output. 12% of cases are partially correct, where the output mismatch is caused due to some missing details, for example, the absence of some parameters. Finally, 24% cases produce too many outputs.

Case Study

We give an example predicted by JuPyT5 with a high BLEU score but erroneous outputs in Figure 2, to show the advantages of execution evaluation for DS code generation. The example is a typical DS task which intends to explore the shape of a dataframe. But the model misunderstands the intents and generates code to display all dataframe information. Although we can find the expected shapes from the output, i.e., 17135 entries and 17 columns, the output is not exactly correct. However, as the code is short while the variable name is long, which leads to a high overlap between prediction and ground truth, the generated code obtains above average BLEU and CodeBLEU scores. This example reveals the deficiency of surface form metrics to evaluate code correctness.

Conclusion

In this paper, we propose an evaluation dataset to support execution correctness evaluation for data science code generation dubbed ExeDS, which consists of 534 typical data science problems from Jupyter Notebooks, each with code context, task description, target code, and desired execution output. By performing experiments with five strong code generation models on ExeDS, we find models that achieve high surface-form scores do not necessarily produce execution correct code, and execution-based metrics could capture more detailed code generation errors. We expect our efforts to attract more attention to code execution correctness and generating executable code.

Limitations

Firstly, only the test set examples have high quality of human annotation and verification. Thus the training set might be too noising to train a robust code generation model. Secondly, the execution metric is insufficient to show other information like semantic relatedness, variable naming, and API usages, which are also important in evaluating a good code. Thirdly, our datasets and metrics focus on Python code in data science domain. It’s unclear whether is applicable to general software code. Fourth, our execution-based automatic evaluation is more time-consuming to compute and evaluate than other surface-form metrics like EM, BLEU. At last, evaluating generated code is far different from evaluating natural languages. The final goal of code generation is to generate execution and functional correct code. Though with many limitation, our work could be a pilot study which provides insights and possible solutions on how to better evaluate code generation models.

References

Appendix A Finetuning Details

We finetune all the baseline models, except Codex, on our cleaned training set and select the best checkpoint with the perplexity score on dev set for testing. All models are trained on 16 Tesla V100 32GB GPUs. The hyper parameter are presented in Table 8.

At inference time, we use beam search decoding with a beam size of 5.

Appendix B More Examples

In this section, we present 6 examples to show the typical types of errors with erroneous outputs in Figure 3 - Figure 8. We also give an example with a typical type of errors causing exceptions in Figure 9.