Natural Language to Code Generation in Interactive Data Science Notebooks

Pengcheng Yin, Wen-Ding Li, Kefan Xiao, Abhishek Rao, Yeming Wen, Kensen Shi, Joshua Howland, Paige Bailey, Michele Catasta, Henryk Michalewski, Alex Polozov, Charles Sutton

Introduction

Data science is the process of extracting insights from data Wang et al. (2021a), and has become an integral part of decision making and knowledge discovery Donoho (2017). Data scientists and machine learning (ML) practitioners often use computational notebooks, which are interactive environments such as Jupyter notebooks Kluyver et al. (2016) and Google Colab, https://colab.research.google.com/ in their work. As illustrated in Fig. 1, data scientists spend significant amount of time on data wrangling and exploratory data analysis (EDA) Agashe et al. (2019); Wang et al. (2022a). This has motivated research on automating and accelerating the data science workflow in general Aggarwal et al. (2019); Wang et al. (2021a, b), with particular interest in data wrangling and EDA tasks Bavishi et al. (2019); Jain et al. (2021); Karmaker et al. (2020); Nazabal et al. (2020); Kandel et al. (2011).

On the other hand, large language models (LLMs) trained on code can assist developers by translating natural language (NL) intents into executable programs (Chen et al., 2021a; Austin et al., 2021; Chowdhery et al., 2022; Nijkamp et al., 2022; Fried et al., 2022, inter alia), with promising applications in synthesizing code for data wrangling and EDA tasks Jain et al. (2021); Rajkumar et al. (2022); Cheng et al. (2022). Computational notebooks also present unique challenges to LLMs, because notebooks freely mix NL, code, graphics, and execution results Perkel (2021), and because of their interactivity, notebooks feature multiple turns of related NL-to-code problems Heyman et al. (2021). As illustrated in Fig. 1, those problems often have interesting dependency structures. They may share common execution context (e.g. DataFrame df), form semantically coherent turns (e.g. c4,c5c_{4},c_{5}), or exhibit non-trivial long range data dependency (e.g. from c6c_{6} to c2c_{2}, or c7c_{7} to c3c_{3}). This dependency structure is more complex than existing multi-turn text to code tasks with sequentially dependent problems Nijkamp et al. (2022).

Several benchmarks have been proposed to evaluate program synthesis of data science programs from NL intents, but these datasets have several limitations. First, some datasets derive from data science tutorial notebooks Agashe et al. (2019); Chandel et al. (2022), but this data tends to contain NL text (e.g. exercise questions) which is more verbose and elaborate than the concise, ephemeral style that developers write when interacting with code LMs (Barke et al., 2022, more in § 3). Other datasets assume that the developer provides extra information, such as unit tests or input/output examples Chandel et al. (2022); Jain et al. (2022), but such systems pose an extra burden to users who might not normally write such tests or examples during their workflow Pimentel et al. (2019). Finally, existing datasets usually contain independent tasks with isolated contexts (Lai et al., 2022), or limited number of contextually dependent problems Huang et al. (2022), rather than having multiple, related tasks with complex dependencies such as in Fig. 1. Therefore, there is a need for a benchmark that provides realistic NL intents, rich notebook context, and multiple interrelated problems, so as to better reflect real-world usage by data scientists.

To fill this gap, we present ARCADE,Answer Repository for Computational Analysis and Data Engineering a new benchmark for code generation for data wrangling and EDA tasks in computational notebooks (§ 3). ARCADE consists of 1,0821,082 problems spanning across 136 notebooks for 106106 ML datasets. ARCADE features a series of NL utterances written by professional data scientists with the intention of interacting with an AI pair programmer when working in a notebook (e.g., green texts in Fig. 1), with high-quality code solutions using the pandas library. To mitigate the risk of data leakage when evaluating LLMs, 65%65\% of the problems and their notebooks are created from scratch, based on recent ML datasets on Kaggle.https://www.kaggle.com/ ARCADE also challenges LLMs with grounded language understanding, where a model needs to ground relevant semantic concepts in intents (e.g. “min and max” in u1\bm{u}_{1}) to variable states (e.g. the column df[’TIME’]) to understand the intent. Besides featuring concise NL intents and contextually rich problems, more than 67%67\% of problems in ARCADE also require complex solutions using 55 or more pandas API calls, making it more challenging than most existing benchmarks.

Problem: Code Generation in Computational Notebooks

A computational notebook is an interactive computing environment that allows mixing code, text, and graphics. The notebook itself consists of a sequence of markdown or source code cells. Formally, given a partial notebook context with kk cells {ci}i=1k\{c_{i}\}_{i=1}^{k} and a user-specified intent u\bm{u} for the next cell ck+1c_{k+1} (e.g., u1\bm{u}_{1} in Fig. 1 for k=1k=1), we aim to generate code for ck+1c_{k+1} that fulfills the user’s intent Agashe et al. (2019). This process could proceed sequentially with multiple rounds between the user and a system Heyman et al. (2021). To answer subsequent intents (e.g., u4\bm{u}_{4}), a system will leverage the updated notebook context (e.g., {ci}i=15\{c_{i}\}_{i=1}^{5}) which includes previous problems (e.g., u1u3\langle\bm{u}_{1}\sim\bm{u}_{3}\rangle).

We focus on synthesizing programs for data wrangling and exploratory data analysis tasks. Specifically, data wrangling refers to the process of cleaning and transforming data to allow for downstream analysis and machine learning, while exploratory data analysis involves querying the data for insights to support later statistical analysis and decision making. We focus on generating code using pandas, a popular Python library for performing wrangling and EDA tasks in computational notebooks, with built-in objects such as DataFrames and Series to model tabular data types (tables and columns, respectively) .

ARCADE: A Benchmark of pandas Data Science Code Generation

ARCADE consists of 1,0821,082 NL-to-code problems from both existing data science notebooks on GitHub (§ 3.1.1) and new ones curated from scratch (§ 3.1.2). In this section we elaborate on details to create annotated examples from both sources.

We identify candidate code cells performing data wrangling and EDA tasks from existing high-quality data science notebooks, and then manually annotate these cells with NL intents.

We hired a group of data scientists to annotate the notebooks selected above.Eight freelancers reported skill in pandas are hired from Upwork, with an average of 3 years of experience. Annotation primarily consists of judging the quality of candidate code cells, fixing any errors, and creating NL intents summarizing the code. Specifically, the annotators are instructed to frame their intents the way they prefer when interacting with an AI system to help them implement the existing code solution, while keeping the intents natural, concise without redundant elaboration, such as line-by-line explanation. In addition, to make the intents more challenging, we encourage annotators to refer to entities and variables in intents using semantic rewrites without introducing ambiguity (e.g., use “convert all binary columns to bool” instead of listing columns verbatim), reminiscent of synonym substitution for labeling utterances in text-to-SQL Gan et al. (2021).

Creating succinct NL intents without ambiguity could be non-trivial in this open-domain code generation setting, especially when there could be multiple plausible interpretations of an intent. For example, without the underlined part of u5\bm{u}_{5} (Fig. 1), a programmer or a system may propose alternative solutions using different table schema. Therefore, for such open-ended problems where there could be multiple alternative ways to present the answer, we ask annotators to provide extra specification in their intents about the desired output (e.g. schema of output DataFrame, such as the underlined part in u5\bm{u}_{5}). Even with these additional semantic constraints, empirically we observe around 45%45\% intents are still under-specified, making ARCADE a challenging benchmark for handling realistic NL intents with uncertainty. We present more analysis in § 3.2, while introducing a robust evaluation metric that mitigates this issue in § 3.3.

Indeed, re-purposing notebooks in the wild for our benchmark is not an easy task. As an example, many notebooks in JuICe are data science tutorials, which often contains documentation that includes background knowledge, reference materials, and even solution hints. Those extra information makes the code generation task easier, and may not reflect the style of ordinary notebooks authored by data scientists in their day-to-day work. We therefore ask the annotators to clean the notebook and remove such extra information whenever possible. We compile a 35-page annotation guideline to cover corner cases like this. Refer to Appendix A for more details about the guideline.

1.2 Creating Notebooks with Examples from Scratch

The problems derived from high-quality GitHub notebooks in § 3.1.1 capture realistic tasks and notebook contexts, but may result in artificially high evaluation accuracies due to potential leakage of evaluation notebooks to the training data of LLMs, which is a common issue in LLM evaluation Brown et al. (2020).Since JuICe and BigQuery primarily contain source files before or in 2019, this issue could be more problematic. To defend against this data contamination, we additionally annotated 660 problems by creating notebooks from scratch.

To ensure that those newly-created examples can be used to evaluate the generalization ability of code LMs on unseen ML datasets, we create notebooks targeting data wrangling and EDA tasks for 7070 tabular ML datasets that have been recently uploaded to the Kaggle data science platform since February 2022. Those short-listed datasets are manually selected from a pool of 600600 datasets with reasonably complex schema (e.g., having columns with diverse data type), and are verified by our annotators that no older-versioned datasets with similar schema appeared before.

For each ML dataset, the annotators were asked to create one notebook with a series of wrangling and EDA tasks annotated with NL intents.We only invite the top-3 performers for this task since it is harder than labeling existing notebooks. Specifically, we ask annotators to come up with tasks that they would like to perform in order to gain insights into these recently appeared ML datasets in order to build models for them. We follow the same standard to create intents as in § 3.1.1. To make the problems more challenging, annotators are encouraged to create harder tasks whose code solutions require at least 55 pandas API calls. For quality assurance, each notebook is peer reviewed by another annotator, before a final round of review by the first author. Since the annotators have already worked on the prior task of creating examples from existing notebooks, they are fairly familiar with the requirement, and are able to create each problem in 1313 minutes on average.

2 Dataset Analysis

ARCADE consists of 1,0821,082 NL-to-code problems from 133133 notebooks based on 106106 unique ML datasets. We first present some analysis on our dataset, before comparing the statistics of ARCADE with existing datasets in Tab. 1.

Since ARCADE aims to evaluate code LMs in the real-world scenario where data scientists provide succinct NL intents without extra specification (e.g. I/O examples), the intents we collected are often under-specified and may not contain sufficient information to generate a solution that executes to the exact reference output. To understand the patterns of semantic ambiguity in user-issued intents, we conduct a case study, in which we manually examine a small group of random samples. Among those samples, around 55%55\% of them are unambiguous, i.e., they are precise and sufficient to infer the target outputs. Those intents are often numerical queries with limited variety in output type (e.g., How many customers receive an income of more than 95 percentile?), or contain sufficient descriptions of output specification (§ 3.1.1, e.g., List the 3 largest towns, return the name and area). The remaining 45% of intents are under-specified. Of these, 30%30\% lack a description of target columns in output DataFrames (e.g., c4c_{4}, Fig. 1). An additional 25%25\% of intents imply that the desired DataFrames should have a complex schema, such as a nested row index or table header (e.g., Show the time of the day and the fare price for each airline), which is difficult to predict without additional information. Another 25%25\% of ambiguous intents have lists of entities as their answers, and do not specified the desired container type (e.g. List or pandas.Series). Finally, the remaining 20%20\% cases require outputs with more complex structures (e.g., multiple variables), or imply additional post-processing steps such as data imputation.

Interestingly, while nearly half of the intents are under-specified, we observe that 30%\sim 30\% of those cases can be disambiguated by referring to intents or code snippets from prior rounds of problems with similar query patterns in the notebook context. A common scenario is follow-up EDA queries (e.g. How many of them are \ldots) that refine the previous query step (e.g. Show me all \ldots) by narrowing the down search condition, while the output DataFrame structure is the same as previous turns, reminiscent of similar thematic relation patterns in contextual semantic parsing Iyyer et al. (2017); Yu et al. (2019b).

Looking closer into the problems in the Existing Tasks (§ 3.1.1) and New Tasks (§ 3.1.2) splits, the New Tasks split has more challenging problems than Existing Tasks, as measured by the number of pandas API invocations and the AST size in reference code solutions (Tab. 1, Bottom). Fig. 2 plots the histogram of pandas API usage, which shows 67%67\% problems in New Tasks require at least 5 API calls to solve them. As we show in § 5, with more complex held-out problems targeting for recent ML datasets, the New Tasks split serves both as a more robust and challenging benchmark for state-of-the-art code LLMs.

Tab. 1 compares ARCADE with existing datasets for data science code generation in computational notebooks. Specifically, JuICe Agashe et al. (2019) contains exercise problems in assignment notebooks from data science tutorials or coures, where the NL intents are usually elaborative assignment problem definitions. Notebooks in JuICe are not executable so evaluation is performed by surface-level matching (exact match or Bleu) between reference and predicted programs. DSP Chandel et al. (2022) contains problems from a filtered set of JuICe notebooks that are executable and also associated with unit tests for auto-grading. Hence the intents in DSP follow similar patterns as those in JuICe. To ensure that the free-form model-predicted code is compatible with unit tests, DSP uses the unit test code iteself as extra model input besides NL intents to constrain the model to generate code that could be directly consumed by the tests. Next, ExeDS Huang et al. (2022), a concurrent work to this paper, is another set of filtered problems from JuICe. Similar to this work, ExeDS uses hand-annotated intents, and compares the execution output between reference and predicted code for evaluation instead of relying on unit tests (§ 3.3). Moreover, NLGP Heyman et al. (2021) is another collection of the NL-to-code problems in Jupyter notebooks with short annotated intents for simple data manipulation tasks, where most notebooks have one associated problem. Finally, for the sake of completeness, we also list another concurrent work DS-1000 Lai et al. (2022), a collection of data science problems derived from StackOverflow questions. It primarily features problems using synthetic contexts with minimal working examples, and therefore does not concerns with code generation in notebooks with interrelated problems on general ML datasets.

We remark that ARCADE is the only benchmark that satisfies all the following criteria: First, our dataset consists of manually annotated, concise NL intents (“Intents Type” column in Tab. 1). Those utterances are significantly shorter (“Intent Length” column) than verbose markdown texts often found in tutorial notebooks (c.f. JuICe, DSP), and do not require a model to rely on extra specification such as unit tests (c.f. DSP). Those succinct intents are also under-specified, requiring a more robust evaluation metric to consider alternative answers (later in § 3.3). This may also motivate future research on generating more diverse predictions to cover possible interpretations of a problem — which we will explore in § 5 — or even models to explicitly capture such uncertainty in NL intents Lin et al. (2022). Second, ARCADE contains more interrelated problems in a single notebook (“Problems per N.B.”) with complex dependencies (Fig. 1), capturing the essence of interactive computing. This makes our dataset useful in testing an LLM’s skill at understanding rich contexts, including existing user-written code and markdown cells, as well as preceding problems and their solutions (§ 2). Third, ARCADE challenges LLMs with grounded language understanding, where the model needs to link relevant concepts (e.g. “max and min” in u1\bm{u}_{1}) in an intent to the corresponding variable execution states in the context (e.g. content of the TIME column in df). Such necessity of understanding semi-structured tabular data contents Pasupat and Liang (2015) and performing data transformations using an open domain programming language (Python) potentially makes NL grounding in ARCADE more challenging than other applications like database semantic parsing Yu et al. (2019b). Fourth, ARCADE has problems with complex data transformations and richer usage of real-world data science APIs. As evidence for this, note that the number of pandas APIs used in each problem (“No. API” in Tab. 1)Calculated by counting function names in a predefined list of pandas functions, including advanced array indexing operations. Functions with similar names from other libraries (e.g. numpy) may also be included. is on par with DS-1000 and significantly higher than ExeDS, DSP and NLGP. Finally, nearly 70%70\% of problems and their notebooks (New Tasks) are curated from scratch targeting for recent ML datasets, which mitigates the risk of training data contamination of LLMs, and also makes ARCADE a more reliable benchmark to test the generalization ability of LMs in understanding held-out tabular datasets Lee et al. (2021).

To summarize, ARCADE features more realistic NL intents and multiple related problems in the same notebook context to better reflect the real-world scenario where data scientists prompt LLMs using ephemeral NL comments for assistance with complex data wrangling or EDA tasks Barke et al. (2022), therefore making it ideal to evaluate LLMs in this realistic setting. Nevertheless, we remark that existing datasets in Tab. 1 usually contain broader types of problems other than the wrangling and EDA tasks (e.g. fitting ML models) considered in this paper. We leave expanding the task spectrum as important future work.

3 Evaluation by Fuzzy Output Matching

Recent code generation datasets have shifted from surface form evaluation metrics like Bleu Yin and Neubig (2017) to functional correctness, where the execution results of predicted programs on some input examples are matched with the reference output verbatim Hendrycks et al. (2021); Chen et al. (2021a); Austin et al. (2021), or approximately (for complex output like plots, see Chandel et al., 2022) using unit tests. Such evaluation methods assume a model generates code that strictly follows the signature of unit tests, which is impractical in data science notebooks with free-form code.

In this paper we aim to evaluate accuracy of code LLMs in synthesizing code for data science notebooks in the wild, i.e., the model is only conditioned on preceding notebook context besides a short user-issued intent (§ 2). As discussed in § 3.2, those intents are often under-specified and have multiple alternative solutions. We therefore use a set of heuristics to approximately match the execution outputFor code that in-place modifies a variable (e.g. df in c2c_{2}), we treat the modified variable as the output. of a predicted program with the annotated reference to determine if they are functionally equivalent. These heuristics primarily cover two scenarios. First, if the output variable is a container type,List, Tuple, Set, numpy.ndarray, pandas.Series and single-column pandas.DataFrame. we canonicalize variables to the same type. Second, we allow for partial matching between complex DataFrame variables. Formally, consider a reference DataFrame v\bm{v} with a set of columns {vi}\{v_{i}\}, where each viv_{i} is a vector of cell values for the ii-th column. We define that v\bm{v} is equivalent with the output DataFrame of a predicted program v^\hat{\bm{v}}, iff for any vivv_{i}\in\bm{v}, we have viv^v_{i}\in\hat{\bm{v}}. Intuitively, we consider a predicted program as correct if its output DataFrame contains all the columns (and cell entries) in the reference frame. Intuitively, a predicted program is already useful as long as it covers all the necessary information in the reference. In cases of DataFrames, a user could easily create a more compact view of the frame by selecting a subset of target columns.

Empirically, we find our evaluation metric is reliable in identifying solutions with alternative output structures, with a relatively low false-negative rate (under 10%10\%). We present an error analysis in § 6.

PaChiNCo: Adapting Code LMs to Computational Notebooks

In this section we elaborate on PaChiNCo, our code LM for Python notebooks.

PaChiNCo is based on PaLM, a family of Transformer-based LMs developed for few-shot learning for general natural language tasks Chowdhery et al. (2022). Specifically, we use the 62B PaLM model trained on 1.31.3T tokens with a mixture of conversational, webpages and code data (Section F, Chowdhery et al. (2022)). Starting with 62B PaLM as the base model, we perform two stages of fine-tuning, first on Python source code data, followed by another fine-tuning run on a collection of Jupyter notebooks.

We first fine-tune PaLM 62B on a large corpus of Python source code with 64B tokens. This corpus consists of deduplicated Python source files with permissive license collected from GitHub. We finetune PaLM for 1 epoch with a batch size of 256 while keeping the other hyper parameters consistent with Chowdhery et al. (2022). As we show in § 5 and also in Appendix C, this Python-finetuned PaLM 62B model is already a strong code LM, registering significant improvements on ARCADE over the base model, while even outperforming the larger code LM PaLM-Coder 540B on existing benchmarks.

Experiments

To demonstrate the challenges of ARCADE, we compare the performance of PaChiNCo with existing large code language models.

For each problem, we convert the problem into a prompt (§ 5.2) and draw random samples from the LMs using nucleus sampling with a top probability of 0.950.95 and a temperature 0.80.8. Refer to Appendix D for details.

Following prior work Chen et al. (2021a); Austin et al. (2021), we measure model performance using the pass@kk metric, defined as the fraction of problems with at least one correct sample given a sample size kk. To reduce variance, we estimate pass@kk (k30k\leq 30) by drawing 5050 samples for each problem Chen et al. (2021a).

2 LM Prompting Strategies

For a given problem, we conducted two types prompting experiments: prompting using only the notebook context of the problem (§ 5.3), and few-shot prompting with extra exemplars as prompt prefix before notebook context (§ 5.4), in order to impose more control to the style of predicted code.

In our default setting (§ 5.3), for a given problem we turn its notebook context to a prompt to query LLMs. Fig. 3 gives an example. The context consists of preceding cells before the problem, which may include previous rounds of interactions (§ 2), and is followed by the current NL intent. To help LLMs understand intents using variable state information (§ 3.2), we also add NL descriptions of the schema of the dataset imported to the notebook (schema description, e.g. cell 3, Fig. 3), which consist of columns and example values in the DataFrame. Representing structured schema information in NL is a common strategy when prompting LLMs to solve tasks requiring understanding structured data Xie et al. (2022). Next, for the following problems after cell 4, we include reference solutions to previous problems in their prompts, following the multi-turn task-oriented dialogue setting in Andreas et al. (2020). See Appendix N for a complete example.

2.2 Few-shot Prompting

Besides the basic setting using the notebook context of a problem, we also explored prompting using additional NL-to-code exemplars as prompt prefix before the notebook context. The motivation is to nudge the LM to generate code solutions following different coding and commenting styles, such as documenting the “reasoning path” used by the code. We develop four prompting strategies:

Vanilla code strategy as in Fig. 4, completion 4a. The predicted code follows the common practice of chaining multiple API calls in a single line.

Step-by-step (SbS) code strategy as in Fig. 4, completion 4b, which leads to code with fine-grained decomposition structure.

SbS code with preamble, as in Fig. 4, completion 4c, which could further elicit decomposition in predictions.

SbS code with preamble and explanation, as in Fig. 4, completion 4d, with inline NL explanations per step.

3 Results

In our first set of experiments, we evaluate the performance of state-of-the-art code LMs on ARCADE. Results are listed in Tab. 2. As in § 5.2, inputs to those models are only the notebook context (no few-shot prompting). We truncate the prompt up to 900900 tokens using the tokenizer of PaLM.

Next, we discuss results on public code LMs.We also evaluate Codex. Results are in Appendix G. First, among models at similar size and trained with roughly similar amount of Python code data, namely InCoder 6B and CodeGenmulti{}_{\textrm{multi}} 6B, InCoder 6B performs better. This is likely due to that InCoder contains dedicated portion of Jupyter notebooks (8GB) in its training data, which makes the model more suitable for ARCADE. With 4×4\times more Python data, CodeGenmono{}_{\textrm{mono}} 6B takes over, which is in line with the the performance pattern of these two model families on existing code generation datasets (Fried et al., 2022). Moreover, comparing CodeGen models at different size, we observe that pass@kk scales linearly with model size. As we discuss more in Appendix F, the scaling curve for ARCADE (New Tasks) is also more flatten than that for other code generation datasets, which might suggest that it is more challenging for CodeGen.

It is interesting to compare results between the Existing Tasks and New Tasks splits, where we observe two patterns. First, results on Existing Tasks are significantly higher across all models. Second, comparing the improvements after Python and notebook-specific fine-tuning of PaLM 62B, the gain on New Tasks is higher. These results are likely due to two factors. First, an obvious reason is that problems in Existing Tasks are overall simpler than New Tasks (§ 3.2). Second, it is likely that some code data similar to our evaluation notebooks in Existing Tasks split is leaked into the training data of those LMs. Specifically for PaLM, its training mixture contains general Web documents and sampled Python source files. Despite the fact that we went through significant amount of effort to deduplicate the training data against ARCADE during the second-stage notebook fine-tuning (§ 4), given that some of the notebooks used to create Existing Tasks are popular data science tutorials on GitHub, it is still possible PaLM has already seen similar code data from the Web or notebook-converted Python source files. This suggests the importance of evaluating code LMs on held-out data, which is the purpose of New Tasks. We leave investigating training data contamination as important future work.

To better understand PaChiNCo’s performance on problems at different levels of complexity, we plot pass@3030 with respect to the number of pandas function calls in the annotated reference solutions, as shown in Fig. 5. For problems with similar complexity, PaChiNCo generally achieves higher pass rate on Existing Tasks, again suggesting that the New Tasks split is still more challenging even after controlling problem complexity. Refer to Appendix H for a similar plot w.r.t AST size.

ARCADE requires a model to leverage rich programmatic and NL context in test notebooks to generate code solutions for the current cell. To study PaChiNCo’s performance with varying amount of available notebook context, we control the number dd of context cells {ci}i=kdk1\{c_{i}\}_{i=k-d}^{k-1} (§ 2) when generating code for each problem (at cell ckc_{k}) in our dataset. Fig. 6 depicts pass@3030 as a function of the context size dd. Since we use the first preceding cell ck1c_{k-1} to store the NL intent uk\bm{u}_{k} for ckc_{k} (Appendix N), having only one context cell is equivalent to the “cold-start” setting of only using uk\bm{u}_{k} (besides schema description) to predict ckc_{k}. PaChiNCo achieves a pass rate of 44%44\% (existing tasks) and 17%17\% (new tasks) in this challenging setting (d=1d=1), with errors mostly due to failure in referring to variables that the solution relies on, whose information is not present in the short context. Indeed, including additional context cells is crucial for good performance. In particular, having 3 context cells could already lift the pass@3030 to 72%72\% and 36%36\% on the two splits — 1.62×1.6\sim 2\times higher than d=1d=1. The results also start to plateau after including 575\sim 7 context cells, with diminishing returns after including more cells, which is in line with findings in Agashe et al. (2019). Prior work suggests that the plateau point is around three neighboring cells, while in our case, the number if approximately doubled since we need extra cells to include intents in previous turns (Appendix N). Empirically, we observe that using more context helps to reduce schema understanding errors (e.g. using undefined columns in DataFrames). Refer to Appendix I for more details.

Another interesting angle to study the effect of context is through the lens of model accuracy when solving problems ckc_{k} at different locations. Intuitively, problems located later in a notebook (kk is larger) would have more context available, therefore they could be easier to answer Wang and Cho (2016). Fig. 9 shows pass@3030 on problems grouped by their preceding context size, which shows increased task success rate when solving problems with more context, confirming the prior intuition.We remark that our setup is different from multi-turn semantic parsing where later turns are conditioned on the predictions of prior turns, while we use reference solutions (§ 5.2). See § 8 for discussion.

4 Few-shot Prompting Beyond Test Notebook Contexts

Our previous set of experiments demonstrate diminishing returns after including more notebook context cells. As motivated in § 5.2, besides test notebook context, we use few-shot prompting with additional exemplars to teach the model to perform this challenging task, while generating code in different styles. Here, we report both functional correctness (Fig. 7) as well as metrics evaluating the style of model-predicted code (Fig. 8) for problems in New Tasks. We only evaluate PaChiNCo due to that the prompt length (maximal 2,1002,100 sub-tokens) exceeds the limit of public code LMs.

Error Analysis

To understand the types of errors that LLMs make on ARCADE, especially on challenging problems, we conduct an error analysis on PaLM’s predictions on the New Tasks split (Tab. 2). Overall, we notice a significant drop in execution errors after two-stage code fine-tuning (PaLM\mapstoPaChiNCo, § 4). Out of all the incorrect predictions from PaChiNCo under the fuzzy output matching evaluation metric (§ 3.3), roughly 35%35\% of the samples result in execution errors, while the remaining 65%65\% predictions have executable programs but are functionally incorrect. First, for un-executable samples, we present an analysis of the distribution of different execution error types, as illustrated in Fig. 12. The primary source of execution error is KeyError and AttributeError due to reference to non-existing index or columns in DataFrames. While in the prompts we provide NL descriptions of schema for DataFrames loaded to notebooks (§ 5.2), such descriptions for intermediate DataFrames that are later derived in the context are still missing due to limited prompt length, and the model may not be able infer their schema information solely from the source code. This could be especially problematic for APIs that create compound intermediate DataFrames with complex schema, such as pd.groupby, which accounts for that more than 50%50\% of those KeyErrors and AttributeErrors. Similarly, other execution errors such as ValueError and TypeError are often caused by the insufficient knowledge about the DataFrame contents. For example, ValueError occurs when a model tries to calculate the mean of a column which has NaN values. This finding suggests the importance of developing LLMs that could handle longer context Transformers et al. (2022) in order to include more DataFrame information in prompts. We gave a detailed case study on these types of execution errors in Appendix L.

Next, we conduct a manual analysis on 5050 randomly sampled incorrect predictions (with a focus on executable but incorrect predictions). The cause of these errors can be grouped into the following categories: 1. Complex problems requiring non-trivial reasoning or data transformation steps (43%43\%); 2. Errors in interpreting NL intents, such as missing a requirement specified in the intent (e.g. round to two decimal places) in the code solution (26%26\%); 3. Errors caused by under-specified intents (§ 3.2, 19%19\%); 4. False-negatives due to limited coverage of the fuzzy-matching evaluation metric (§ 3.3, 6%6\%); 5. Annotation errors (6%6\%).

The primary source of errors is due to complex problems, which reiterates the motivation of ARCADE — evaluating code LLMs on challenging data wrangling and EDA tasks. The second majority type of errors (misunderstanding intents) suggests room to improve PaChiNCo’s skill in instruction following. Next, a non-trivial amount of errors are caused by under-specified intents, which are common in the setting of prompting LLMs using short instructions (§ 3.2), calling for future research to specifically address this issue. Finally, our evaluation metric based on fuzzy output matching seems effective in identifying plausible alternative solutions. Still, there are non-trivial cases where there are multiple ways of presenting the outputs (e.g. DataFrames with nested columns or different orientations, Fig. 32). We present more detailed examples of these errors in Appendix I.

Case Study: How Useful is Predicted Code with Step-wise Explanations?

In § 5 we show how prompting PaChiNCo to generate code with step-wise explanations is helpful in improving solution diversity (Fig. 10) and also accuracy of self-consistency reranking (Fig. 11). In this section we conduct a qualitative analysis on model-generated code snippets with explanations to explore how they could be useful for novice data scientists.

First, we observe that NL explanations could help users follow the flow of complex data transformations for programs involving a chain of pandas operations. By decomposing and explaining how data is manipulated after individual transformation steps, it is easier for users to understand the solution and track its dataflow behind the scene, especially when some steps involve complex computation (Fig. 19), or the underlying schema is less intelligible (e.g., column names with abbreviations, Fig. 20). Additionally, some inline explanations also describe the output of intermediate steps, which is particularly helpful when these steps involve advanced pandas functions whose output may not be obvious, such as pd.unstack (Fig. 21)

Meanwhile, step-wise NL explanations serve as high-level procedural descriptions of code, which enables users to easily browse through and understand different solution approaches without being distracted by nuances in the actual code implementation (Fig. 22). Moreover, explanations also help users verify the code solutions by identifying potentially incorrect steps and make necessary corrections (Fig. 23). The observations presented here offer insight into potential future avenues to improve the utility of code LMs for developers through the use of step-by-step explanations, which we leave as important future work.

Related Works

In recent years there has been a burgeoning of LMs trained on code Chen et al. (2021a); Austin et al. (2021); Nijkamp et al. (2022); Fried et al. (2022); Li et al. (2022); Tunstall et al. (2022); Xu et al. (2022). Many generalist LMs also have source code as part of their training data mixture Thoppilan et al. (2022); Chowdhery et al. (2022); Wang and Komatsuzaki (2021); Black et al. (2021). Those LLMs have registered impressive performance on a variety of benchmarks for code, such as code translation Lachaux et al. (2020a), program repair Gupta et al. (2017), and natural language to code generation Chen et al. (2021a); Hendrycks et al. (2021); Austin et al. (2021); Li et al. (2022).

Given the sheer amount of tasks involved in the lifecycle of data science and the expertise required, it calls for development of systems to automate this lifecycle Aggarwal et al. (2019); Wang et al. (2021a, b). As an example, automating the process of feature engineering and finding the best-performing models has been the central theme of AutoML research He et al. (2021); Karmaker et al. (2020), with well-established systems Feurer et al. (2015) and evaluation protocols Zöller and Huber (2021). In this paper we focus on automating data wrangling and EDA tasks, which account for nearly the same amount of code and documentations in notebooks as that for feature engineering and building ML models Agashe et al. (2019); Wang et al. (2022a). Along this line of research, AutoPandas Bavishi et al. (2019) synthesizes pandas programs for data transformation given examples of input/output DataFrames. Other approaches aim to generate pandas programs using LLMs using both NL intents and I/O samples Jain et al. (2021) or unit tests Chandel et al. (2022). In this paper we consider code generation in notebook environments with multiple rounds of problems and interdependent contexts (refer to § 3.2 for detailed discussions of recent work). Related to this line, another thread of research focuses on synthesizing problems for visualization plots Amar et al. (2005); Narechania et al. (2020); Fu et al. (2020); Chen et al. (2021b); Wu et al. (2022). ARCADE also includes 57 plotting examples which are not used in this paper. We leave collection and systematic evaluation of plotting tasks as important feature work.

Our work is another application of context-driven code generation, which concerns with mapping a series of contextually dependent utterances to executable programs, such as domain-specific logical forms Zettlemoyer and Collins (2009); Long et al. (2016); Iyyer et al. (2017), SQL queries over databases Suhr et al. (2018); Yu et al. (2019a, b), or general-purpose languages (PLs) like Python Nijkamp et al. (2022). Some of those existing works also feature EDA problems, mainly for querying structured databases, while the problems in ARCADE exhibit more complex query patterns (e.g. string manipulation over semi-structured data, as u1u_{1} in Fig. 1), in addition to extra data wrangling tasks, thanks to the expressiveness of pandas API and the general-purpose PL (Python) compared to domain-specific formalisms. As the first step forward, in this paper we perform evaluation using ground-truth solutions for previous turns (§ 5.2), similar to multi-turn program synthesis for task-oriented dialogue in Andreas et al. (2020). We leave conditioning on model-predicted solutions as context for evaluation Nijkamp et al. (2022) as an important future avenue.

Conclusion

Acknowledgements

We are grateful to Meg Risdal and Goeff Thomas from Kaggle for help with dataset collection, and Miltos Allamanis for research discussion. We thank Jo Chick from the research partnership team, and Rebecca Watson, Ashley Dawe and Kimberly Herrera from Upwork to help with managing the annotation project. We thank Aroma Mahendru for writing the data card section. We also thank Cheriskumar Patel, Preet Patel, and Jayendra Parmar for general assistance with the project.

References

Appendix A Outline of ARCADE Annotation Guideline

In this section we provide a brief summary of the outline in our annotation guideline.

The annotators are given a list of Jupyter notebooks. Each notebook uses pandas to perform certain data analysis tasks. For each notebook, an annotator is asked to:

Identify code cells that contain instructive code snippets that perform data wrangling or exploratory data analysis tasks.

Fix the notebook and make them clean and executable.

For each code snippet identified in Step 1, create natural language descriptions of the task. Also verify the code solution and fix them as appropriate. Finally, remove any redundant text in the notebook (e.g. solution outline or hints for tutorial notebooks) that could give away to the refernce solution.

Specifically, for step 3, in order to collect realistic NL intents, the annotators are given the following high-level description, followed by detailed instructions and examples.

Below we share some suggestions to write good intents.

Keep it natural without redundant explanations. Imagine an AI programmer can help you accomplish simple data wrangling and EDA tasks, what kind of intents will you send to the system? Our goal is to collect real inputs to such a system from data scientists like you.

One idea to write good intents is to keep it concise such that another programmer could quickly understand and implement a solution that executes to the same outputs. You are encouraged to create simple, short intents while describing the desired outputs without much ambiguity.

For each ML dataset we provided, an annotator creates a Colab notebook with code snippets for some interesting data wrangling and exploratory data analysis tasks using this dataset. Each code snippet is paired with its natural language intent, simliar to the process of annotating Existing Tasks. We ask annotators to feel free to work on any tasks that they may find interesting for the given dataset, as long as the code solution for the task should consist of multiple lines and use different pandas API functions. Different from annotating Existing Tasks, we ask them to first create a natural language intent for their task, and then write a code solution in the next cell.

Below is an excerpt from the annotation guideline describing the types of data wranling and EDA tasks to create.

In general, you may create whatever exploratory data analysis tasks that you find interesting for the given datasets. To come up with interesting tasks, you can think in this way: before training your ML models for the dataset, what kind of data wrangling or EDA tasks would you like to perform on the dataset? Below are some more concrete descriptions of such wrangling or EDA tasks:

Data Preprocessing/Wrangling Tasks which involves modifying existing dataframes or creating new ones. Such as normalizing column names, adding new columns, modifying existing columns (e.g., converting string values to date times), generating new dataframes using ops like group_by, and so on. Some datasets we shared are just raw data without any preprocessing or cleaning. Feel free to . Please also refer to Section: Identify Code Snippets to Annotate in our previous annotation guideline for more examples.

Exploratory Data Analysis Tasks that Require Some Wrangling and Preprocessing Answering interesting EDA questions using the given dataset, but some data wrangling steps are required in order to derive the answer. For example, given a dataframe df of user shopping history and credit card expiration dates in the format of df.loc[‘cc_exp’] = ‘08/26’. To answer the EDA question “How many users have a credit card expiring in 2024?”, we need to first convert the expiration year from the string-formatted cc_exp column.

To encourage the annotators to create more complex tasks, we also provide the following high-level instruction:

You should create relatively complex tasks that require multiple steps and also a combination of different pandas APIs to solve them. Avoid problems that can be solved using one-liner code such as df.group_by(...).sort_values(...). An ideal task should be reasonably complex and needs to be broken down into multiple smaller steps to solve, and each step may require using one or multiple pandas functions.

As a general rule of thumb, you should aim at creating tasks that either have at least 50 tokens or use at least 4 pandas APIs (dataframe/series indexing, like df[df[‘continent’] == ‘NA’] is also counted as one API usage). You can find more concrete example tasks at the end of this doc.

Our annotation guideline is 35-pages long in total, which we provide on a per request basis. Please contact pcyin@google.com to request access.

Appendix B Details of Fine-tuning PaChiNCo

We convert computational notebooks for finetuning (§ 4) and evaluation (§ 5.2) to Python source code using nbconverter. Specifically, markdown and code cells in a notebook are concatenated using the special delimiter ‘# In[]:’, and text in markdown cells is commented out using the ‘# ’ prefix. See Fig. 39 for an example of the linearized notebook for Fig. 1 (up to c3c_{3}). Jupyter notebooks that are converted to Python files in such format are common in GitHub repositories, which mitigates the domain transfer gap between general Python code and notebook-specific data, and also allows us to prompt public code LLMs, which might have seen similar data during pre-training.

Appendix C Performance of Python-finetuned PaLM 62B on Code Benchmarks

Appendix D Inference Setup

For CodeGen, we use the inference script from the official GitHub repository. For InCoder, we follow the official inference example script and use the release on Huggingface model hub. We convert each example in our dataset to Python source code to a prompt, as outlined in § 5.2. Notebooks are linearized using nbconverter similar as generating fine-tuning data (Appendix B). One exception is InCoder, for which we follow Fried et al. (2022) and use the Jupyter notebook linearization template used in its pre-training.

At inference time, we left-truncate notebook context up to 900 tokens (measured by PaLM’s vocabulary), which fit in the context window size of all LLMs we evaluated. We also make sure to always include NL schema descriptions in prompts given their importance in understanding NL intents. In addition, for few-shot experiments in § 5.4, we use additional 1,200 tokens to accommodate the prompt prefix, making the total maximal prompt length to be 2,100. Due to its excessive length, we only perform few-shot prompting experiments on PaChiNCo since its rotatory positional embedding Su et al. (2021) could generalize to encode longer contexts at inference time. During sampling, we set the maximum target length to be 512 tokens.

Appendix E pass@111 for PaChiNCo

For reference, we also report pass@11 of PaChiNCo on ARCADE. Samples are generated using the same hyper parameters as in § 5.3, except for the temperature, which is 0.2.

Appendix F Scaling Curve of CodeGen on ARCADE

Fig. 13 depicts the scaling curve of ARCADE with respect to the number of parameters for CodeGenmono{}_{\textrm{mono}} models. The pass rate scales nearly linearly as a function of model size, and the performance has not saturated, especially on the New Tasks split. This shows ARCADE is a reliable dataset to study the scaling behavior of code LLMs. The slope of the curve on New Tasks is also smaller than on other datasets, suggesting that this problem set is more challenging for CodeGen models.

Appendix G Comparing with Codex on ARCADE

For reference, we also report the results on ARCADE using the public Codex API, following the same evaluation setting as in Tab. 2. PaChiNCo significantly outperforms the smaller Codex-cushman-001 API, on par with Codex-davinci-002 on Existing Tasks, while Codex-davinci-002 is stronger on New Tasks. While we cannot gain much insight from the results due to limited knowledge about Codex-davinci-002, through error analysis, we find that Codex-davinci-002 is better at instruction following, especially in understanding NL descriptions of complex DataFrame schema (§ 5.2). We leave improving the instruction following skills of PaChiNCo as interesting future work.

Appendix H Performance on Problems with Different Complexity

Fig. 14 plots pass@3030 with respect to the AST size of reference programs. Similar to Fig. 5, results on New Tasks are generally lower. Meanwhile, it seems that AST size correlates better with pass@kk compared to the number of API usage, while the latter metric offers more intuitive information about the data transformation steps involved.

Appendix I Distribution of Execution Error Types w.r.t Context Size

In Fig. 9 we observe that pass@kk improves when more notebook context is available. Fig. 15 illustrates the distribution of execution error types on failed predictions. Notably, using more notebook context cells significantly reduces the chance of NameErrors caused by using undefined variables in context. The number of KeyErrors is also reduced, indicating that the model makes fewer schema understanding errors when referring to columns in DataFrames.

Appendix J Few-shot Prompting Results on Existing Tasks Split

Appendix K Further Analysis of Solution Diversity

Here we provide further analysis of the diversity in solution patterns, measured by the number of distinct pandas API call sequences used in the samples. Fig. 18 shows a cumulative distribution of the number of solution patterns for different subsets of the predictions: all predictions, only those that execute successfully, and only the correct predictions. In each case, we see that step-by-step prompting leads to increased diversity compared to the baselines, and prompting with NL explanations further increases the diversity. While increased diversity is helpful in finding a correct solution at higher sample size kk, it is also helpful when considering only correct solutions because a user might want to see a variety of solution approaches, whether for educational purposes or to choose the one they like best (which is partially subjective). Refer to § 7 for such examples.

Appendix L Error Analysis Examples

In this section, we provided a preliminary study on selected example model errors. We analyzed two types of errors: execution error and semantic error (executable but incorrect). Execution error has error message from the notebook environment. We can classify these errors into more fine-grained categories in Fig. 12. As the result shows, KeyError is the top error mode in the execution errors. Over 50% of the KeyError are associated with the pd.groupby API call. pd.groupby API call changes the dataframe schema as the model generates the code. For example, pd.groupby(*).mean() will remove non-numeric columns in the dataframe. This requires the model to have a profound understanding of the dataframe schema. We gave an example in Fig. 24. The column shipping_fee is string value which will be removed after df.groupby(ship_state).sum(). The secondary source of execution error is AttributeError, which shares a similar cause to the KeyError. This is because AttributeError is often triggered by calling a non-existing column as an attribute of a pd.dataframe. An example is given in Fig. 25, where the model tries to call the non-existing column signupdate as an attribute of df_users, leading to the AttributeError. These two error modes suggest that building a better schema aware language model is a promising future research direction. We also present Fig. 26 and Fig. 27 as the example for TypeError and ValueError respectively. These two error modes are often caused by insufficient knowledge of the column values. For example, the model tried to compare a string-value column to a integer in Fig. 26, which causes TypeError. Fig. 27 showcased that the model tries to apply numeric operation pd.DataFrame.mean() on a column with NaN values, leading to ValueError.

Semantic errors account for more than 65% of the incorrect predictions. We cannot get any error messages from these executable but incorrect programs. Therefore, we conducted a manual analysis on 50% randomly sampled incorrect predictions. We classified the semantic errors into 5 categories: 1. Complex task requiring non-trivial computation (43%43\%); 2. NL misunderstanding (26%26\%); 3. Under-specified intents (19%19\%); 4. False-negatives of the fuzzy-matching evaluation metric (6%6\%); 5. Annotation Errors (6%6\%). To complement the discussion in § 6, we showcase examples of these types of errors. The primary source of semantic error is complex reasoning. Examples are given in Fig. 28 and Fig. 29. In Fig. 28, the model need to infer that last 10 year can be computed by dt.datetime.today().year - 10 then apply string operation str.contains(accident) to select the required contents. Fig. 29 is another example of complex data wrangling steps. To generate the correct program, the model need to compare the current rank to the past rank, ensure the 2021 rank column exists and then aggregate the information. NL misunderstanding is the secondary source of semantic errors. In Fig. 30, the generated output does not reflect the institute constraint in the intent. Another source of semantic errors is under-specified intent, which happens occasionally in natural language instructions. In Fig. 31, the intent does not specify the output format. The program generated by the model calculates the sum of front and left facing trees while the reference calculates them respectively. The evaluation fails but the model output is not necessarily incorrect in this case. Fig. 32 illustrates another type of semantic error: evaluation coverage. In this example, the model gives an acceptable answer which only differs from the reference in the column order. It does not pass the evaluation due to its limited coverage.

Appendix M Data Card for the Training Data of PaChiNCo

We provide a data card for the training data of PaChiNCo as outlined in § 4, following the structure presented in Chowdhery et al. (2022). We also report training data composition in Tab. 7.

Appendix N Detailed Prompting Examples

In this section we provide detailed examples of prompts used in our experiments. As in § 5.2, there are two categories of experiments in § 5, namely prompting using notebook context (§ 5.3) and few-shot prompting with extra exemplars pre-pended to notebook context (§ 5.4). Here, we list the prompts for u2\bm{u}_{2} in Fig. 1 in these two types of prompting experiments.

In the basic setup without extra few-shot exemplars, the prompt basically consist of all the prior notebook context, including NL descriptions of schema information and previous rounds of problems. Fig. 39 shows the complete prompt for u2\bm{u}_{2} in Fig. 1 in this setup.This prompt is not exactly the same as the one in our dataset. It is adapted to align with the illustrative example in Fig. 1 At inference time, a code LM will complete the last code cell after the cell delimiter ‘# In:’. Note that for InCoder we follow Fried et al. (2022) and use a special template to linearize notebooks (Appendix D).