Training and Evaluating a Jupyter Notebook Data Science Assistant

Shubham Chandel, Colin B. Clement, Guillermo Serrato, Neel Sundaresan

Datasets

We train our models on all Jupyter notebooks from all the public GitHub repositories with Jupyter Notebooks as the primary language label as of April 2021, excluding repositories from which our Data Science Problems evaluation were curated. We first discuss the filtering pipeline which yielded 1119 high-quality data science questions with validating unit tests which compose DSP.

HumanEval are 164 hand-written programming problems, with accompanying test cases. These involve one-to-one natural language to code pairs, testing reasoning, algorithms, and simple mathematics. The Mostly Basic Programming Problems is a dataset of 974 short Python programs constructed by crowd-sourcing to an internal pool of workers who have basic knowledge of Python. These are single self-contained Python function solving the problem specified. Data Science Problems are 1119 problems from 306 notebooks with an average of 3.6 problems per notebook, and all are executable with data dependencies in a default Anaconda environment. These notebooks are curated from GitHub repositories uploaded by students, which we detect by their usage of the nbgrader notebook grading tool.

nbgrader is a tool used by instructors to create and validate the assignments by executing the code and checking assert statements written by an instructor. Each of these notebooks consists of cells classified as prompt or context cells which define the problem to be solved, solution cells in which the student (or model) should insert an implemented solution, and grading cells which contain unit tests validating the solution. Figure 1 shows one such example: the prompt cell defines the problem (in this case it loads a dataset and suggests modifying zip codes in the Pandas dataframe. The solution cell is to be filled by the student, solving the problem described by the context cell(s) preceding the solution cell. Finally, in most of the cases in nbgrader notebooks, and all of the cases in DSP, the solution cell is followed by a grading cell containing unit tests on the generated code.

To curate the DSP dataset, we start out with a set of 448 Github repositories from the Jupyter Interactive Computing (JuICe) (Agashe, Iyer, and Zettlemoyer 2019) development dataset, which contains of 33K nbgrader notebooks. Using nbclienthttps://github.com/jupyter/nbclient and a default Anaconda Python 3.9 environment we attempted to execute all of these 33K notebooks. Each cell was limited to 600 seconds of execution and any of the notebooks which violated this limit were discarded. The notebooks which did not include or could not load their data dependencies were also discarded. Any notebook depending on libraries not present in the Python standard library or the Anaconda default data science environment, non-local modules, or any imports which could not be successfully imported were also discarded. 2134 notebooks passed through this filter.

These 2134 notebooks were successfully executable, but did not necessarily implement actual problem grading. Therefore we further filtered the notebooks by checking whether the grading cells following a solution cell had assert statements which referenced the defined method name, function name, variable name or class name from the solution cell. Finally, following this highly stringent criteria to select a notebook and the corresponding solution cell, we identified a subset of 306 notebooks with 1119 solution cells, each of which is preceded by context cells defining the problem and followed by a grading cell testing the functional correctness of the solution. Table 3 shows some high-level statistics of the resulting Data Science Problems dataset.

In order to better understand the contents of the DSP notebooks, we randomly sampled 50 and hand-classified the problems therein. Table 1 contains the results of this survey, finding 39.5% of problems are ”math problems” like computing the derivative of a function via back-propagation. 26% were ”programming questions” like implementing merge sort, 20% were ”data science” using Pandas objects and operations on tables, and 12.5% built, trained and evaluated machine learning models. This survey tracks well with our measurement of 35% of problems depending on some data file, and 32.5% of the hand survey of these problems were data science or machine learning. Table 2 shows the top modules found in DSP notebooks, and they are dominated by plotting packages, math and data science packages like Pandas and SciPy.

Pretraining dataset

Our pre-training and training data consists of all Jupyter Notebooks from all GitHub repositories which were labeled by GitHub as consisting primarily of Jupyter Notebooks, and the repositories were cloned and processed April 2021. In total we obtained 1.97 million repositories and from these derived 9.06 million total Jupyter Notebooks, 7.24 million of which were unique. In order to prevent data leakage into DSP, we removed from our training set all repositories which were in the JuICe testing and development sets, and further ensured no duplicates from these holdout sets were present in our training set.

Each notebook consists of a number of cells, with a total number of cells in our corpus being 221 million. Most cells are labeled as a code cell or a Markdown cell by the user; 69.5% of all the cells, that is 153 million cells are code cells and the rest, 67 million cells, are Markdown cells. Using the whitespace-augmented byte level byte-pair encoding tokenizer from PyMT5 (Clement et al. 2020), the total number of tokens in the training set is 27.2 billion tokens. Of this, 3̃8% is markdown tokens, that is 10.3 billion tokens and the rest of 16.9 billion tokens are code tokens. For reference, Codex was trained on 100 billion total tokens and (Austin et al. 2021) is trained on 2.81 trillion tokens.

Figure 3, shows a t-SNE visualization to understand the space of how people use notebooks, so we can judge that our training and evaluation domains are similar. We randomly samples a set of 17K notebook, sampled a subset of cells, trained FastText embeddings, and reduced the dimensionality with t-SNE. Each point in the representation is a single notebook, and the color is determined by the fraction of Markdown cells the notebook possesses. The Markdown content is a clear signal in separating notebooks, as shown in Fig. 3. Sampling 1̃0 notebooks from each cluster, we hand-labeled them as shown in the figure. The blue notebooks with low markdown content, have scratchpads, (surprisingly) research code, personal projects. The yellow with high markdown content are pedagogical notebooks, for example tutorials, university assignments.

Based on the clear separation in the training data between Markdown rich and poor regions, we also elected to train the model on a subset of the notebooks containing ‘enough’ Markdown. Training on this subset essentially is used to test the hypothesis that the model can improve its DSP problem-solving performance by focusing on ‘literate’ code. In subsequent experiments we define the Markdown Focused training subset as notebooks with at least one code cell and at least 1/3 of the cells are Markdown cells. This subset contains 4.1M notebooks, or 3/5 of the total training set, and 15.7B training tokens.

Models

We use sequence-to-sequence transformers (Vaswani et al. 2017) of the large BART architecture (Lewis et al. 2019), and start all our training with a pre-trained checkpoint from the Python Method Text-to-text Transfer Transformer (PyMT5) (Clement et al. 2020), using the same training hyperparameters therein.

BART is pre-trained with a span-masking objective, in which spans of tokens are masked in the input, and the objective is to reconstruct these missing spans of tokens in the output. PyMT5 was pre-trained by masking out a syntactically defined part of Python methods (either the signature, docstring, or body) and reconstructing the missing third element. Naturally, as Jupyter notebooks are arranged as code cells, we define the cell-infilling pre-training.

For each cell in each notebook we prepared one source-target example for our sequence-to-sequence model JuPyT5. In our experiments the source was either $C=1$ context cell (we call this the baseline JuPyT5) or $C=3$ context cells directly prior to the target cell, and in our best model case, one extra cell following the target cell (called the cell infilling model). Figure 1 shows what this looks like for $C=1$ previous context cell including the subsequent grading cell in the source.

Since the target types in pretraining can be both code and natural language, we add in control codes to indicate to the model which domain it should target, following CTRL (Keskar et al. 2019) and PyMT5. We used fives control tokens, and to indicate Markdown and code, respectively, and also added in , , and tokens for other studies not included in this manuscript.

`Training Details`

Each JuPyT5 model was trained for 5 epochs (either with the entire training set or the Markdown focused subset) using 80 32GB Tesla V100 GPUs. The hyperparameters for each were kept the same as for PyMT5, except the batch size was changed to accommodate larger batch sizes for data parallelism.

`Experiments and Results`

For each problem in DSP, we copied the whole notebook context, replacing only the solution cell of the given problem being solved by JuPyT5. This is because the notebooks have dependencies between cells which can lead to execution failures which are not necessarily the fault of the model if we let its mistake propagate down the notebook. We could define a separate DSP metric in which the model must complete every problem on its own and success or failure of each problem can depend on one another. As we will see, even with this teacher forcing, that is letting the model see the correct solutions to previous problems, the DSP metric remains quite challenging. We leave more permutations of the evaluating the notebooks and their problems to future work.

For DSP a problem is marked passed if and only if the generated code passes the unit test defined in the grading cell below it. We use the pass@ $k$ (Chen et al. 2021) metric to evaluate the unbiased probability of the model correctly solving the problem in $k$ attempts. For JuPyT5 one attempt is one hypothesis generated by sampling with T=0.8 and nucleus sampling with top-p of 0.95, which was chosen to optimize the HumanEval performance. Note again that the execution environment was a default Anaconda environment with Python 3.9 and all the code in the GitHub repository original hosting the DSP notebook.

`DSP Results`

Figure 4 shows the result of the first experiment evaluating DSP for JuPyT5 pass@ $k$ for $k=1$ to $k=100$ . Similar to Codex and Austin et al. (2021) we observe log-linear behavior of the pass rate as a function of the number of attempts. Figure 4 shows two models, one trained on the whole training set, and the other trained on the Markdown focused dataset described above. We see modest gains in performance, most pronounced near $k=100$ . As a result of the modest performance improvement with only a 2/5 reduction in data size, we did not dig deeper into this line of inquiry.

Figure 7 shows the pass@ $k$ rate (with $C=3$ ) for the baseline model and for the cell infilling model (which sees one additional cell following the target cell). We see a very large improvement in performance, so much that the pass@1 of the cell infilling is comparable to the pass@100 for the baseline. We believe this improvement is for two reasons: the first is in cell infilling the model can see the tests it will be judged by (Austin et al. (2021) found their model performance was much improved by showing the model the tests as well). Our second hypothesis can be best explained in the example of Fig. 8: when the model does not see subsequent assert statement it often generates them, and they are not always correct. This could be ameliorated by simply ignoring assert statements in the generated hypothesis, something we leave to future work.

Figure 7 shows our final experiment with two levels of context, $C=1$ and $C=3$ previous context cells before the target. This yields the most consistent boost in performance regardless of the number of samples drawn, which makes sense considering the model can see solutions to some previous problems. We in fact observe this ‘template modification’ behavior in an example generated in Fig. 10. The model copies the structure of the code in the prompt cell, even adapting the comments in the function (mostly correctly).

The results of all of these experiments are summarized in Tab. 4 for a few selected $k$ values, and generally reflect our observations above that more context is better, a focused dataset is a modest improvement, and seeing the unit tests is a big boost.

`HumanEval and MBPP Results`

Table 5 and Tab. 6 compare JuPyT5 to baseline models for the Codex HumanEval and MBPP metrics, respectively. We see Codex beats JuPyT5 on HumanEval except when using a much smaller 85M parameter model. This performance gap could be explained by the different formatting between markdown cells and method docstrings (we improved our performance by taking the docstring and presenting it as Markdown). JuPyT5 can beat the Programming Synthesis model at the MBPP metric for all but their largest model. This may not be surprising as the PS model was trained on many English documents which contained some code, and not entirely code like JuPyT5.

`Discussion`

While our best model was able to solve over 77% of the DSP problems, this is a most optimistic metric as the deployment scenario may not tolerate 100 hypotheses. If users describe their problem and provide test cases however, following the test-driven development model, that could be a scenario in which JuPyT5 is a fairly effective Data Science assistant. The model also seems to effectively bootstrap off of earlier solutions, evidence by the consistent increase in passing performance regardless of samples $k$ , and so could become ever more effective as a user develops their program. We did investigate attempting to evaluate only a single hypothesis by choosing the sample with the largest log-likelihood per token, which would support a deployment scenario in which no unit tests are provided, but this offered only a modest average improvement over evaluating a single sample.

It can be perhaps most instructive to discuss the ‘easiest’ and ‘hardest’ DSP problems and how the model could solve them. The ‘easiest’ DSP problem consisted mainly of common Pandas dataframe operations like dropping a column. Two of the hardest problems are shown in Fig. 10. The bottom example problem is to implement the $L_{1}$ norm, but does not define it like some other DSP example problems, and so the model must rely on having been trained to understand the definition of that norm. This is a scenario which likely can be easily improved by larger model sizes. The top example, however, is at first glance implementing a simple least-squares regression objective, but is posed with a model defined inside. This kind of compound chaining of operations is difficult for these models, a challenge which was also reported by Chen et al. (2021) and Austin et al. (2021).

Finally we discuss Fig. 7, which plots the pass rate of each DSP problem in sorted order, along with the average CodeBLEU (Ren et al. 2020) score of the 100 generated samples. We see there is only perhaps a weak correlation between pass rate and CodeBLEU score, showing that BLEU/CodeBLEU are not useful for determining the correctness of hypothesis programs.

`Conclusion`

We introduced a new code generation evaluation metric called Data Science Problems consisting of over 1000 problems, many of which depend on data dependencies, and all of which are executable with unit tests. We train a new model, JuPyT5 on almost all publicly available Jupyter Notebooks, and show it is capable of solving over 77% of the problems. While this is an optimistic estimate, we believe this proves the feasibility of a data science assistant in the form of these large transformer models. While it is clear from the literature that larger models can solve more problems, challenges in complex code synthesis remain, and the DSP benchmark can help our community of researchers to overcome these modeling challenges.

Datasets

Pretraining dataset

Models

Training Details

Experiments and Results

DSP Results

HumanEval and MBPP Results

Discussion

Conclusion