InCoder: A Generative Model for Code Infilling and Synthesis

Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, Mike Lewis

cs.SE cs.CL cs.LG

Introduction

Large language models trained on vast repositories of code have demonstrated remarkable progress in neural program synthesis and related tasks (Chen et al., 2021a; Austin et al., 2021; Xu et al., 2022; Nijkamp et al., 2022; Chowdhery et al., 2022). However, such models generate code left-to-right, which makes them less directly applicable to many ubiquitous code editing tasks, such as fixing bugs, adding comments, or re-naming variables. We introduce InCoder, a unified model for program synthesis and editing. Like prior work, InCoder is trained to maximize the likelihood of a corpus of code. However, we adopt a causal masking objective (Aghajanyan et al., 2022a), allowing InCoder to infill blocks of code conditioned on arbitrary left and right contexts.

More specifically, we learn to infill by randomly replacing spans of code with a sentinel token and moving them to the end of the sequence (Figure 1, top). The model is trained to predict all tokens in the complete sequence in this permuted ordering. During inference, we can edit code by replacing spans with sentinel tokens, prompting the model with the new sequence, and having it generate new tokens to replace the masked spans (Figure 1, bottom). Because the model can also trivially generate without sentinel tokens, the result is a unified approach for both program synthesis (via left-to-right generation) and editing (via infilling).

We evaluate performance on a range of zero-shot code infilling tasks (Sec. 4), both new and from existing work, including challenging use cases such as type prediction, variable re-naming, comment generation, and completing missing lines of code. Zero-shot infilling with bidirectional context substantially outperforms approaches based on left-to-right-only models, and on several tasks obtains performance comparable to state-of-the-art models fine-tuned on the tasks. Ablation experiments (Sec. 5) show that this does not come at the cost of left-to-right generation ability; our causal masking model achieves similar performance to a standard language model on program synthesis benchmarks (Chen et al., 2021a; Austin et al., 2021) despite its more general training objective.

Infilling and Synthesis via Causal Masking

Neural models for code generation have either utilized a left-to-right (causal) autoregressive language modeling objective (Brown et al., 2020; Chen et al., 2021a) or, as BERT does, a masked language modeling objective (Devlin et al., 2019; Feng et al., 2020). Both approaches have strengths and weaknesses. Causal models only condition on context to the left of the generated tokens, thus preventing infilling, but they can autoregressively generate entire documents. On the other hand, masked language models can condition on both the left and right contexts to infill a masked region, however, their training objective is typically limited to generating only about 15% of a document. In this paper, we adopt the recently proposed causal masking objective (Aghajanyan et al., 2022a), which aims to combine the strengths of both causal and masked language models.

At training time, the causal masking procedure samples a number of spans of contiguous tokens in each document to mask (Figure 1, top left). We sample the number of spans from a Poisson distribution with a mean of one, truncated to the support $$, so that there are typically a small number of spans (with a single span around 50% of the time), but the distribution has a long tail (up to 256 spans). Each span’s endpoints are sampled uniformly from the length of the document and the set of sampled spans is rejected and resampled if any spans overlap.

Once spans are sampled, each span $k$ is replaced with a special mask sentinel token, . The sequence of tokens in the span is then moved to the end of the document (Figure 1, top right), with the mask sentinel token prepended and a special end-of-mask token token appended. In other words, when a mask token appears for the first time in the left-to-right ordering, it marks the location the span was removed from; when it appears for the second time, it marks the start of the moved span text. More formally, assume we have a document D with $N$ tokens, and we have sampled one span $\texttt{Span}=\texttt{D}_{i:j}$ . Let Left be the left context $\texttt{D}_{0:i}$ and Right be the right context $\texttt{D}_{j:N}$ . Then, we maximize the log probability of the masked document:

where ; denotes sequence concatenation. If more than one span were sampled, each would be similarly appended at the end of the document in order. As in standard left-to-right generative language modeling, we compute the probability of the sequence auto-regressively and train the model using cross-entropy loss on all tokens except the mask sentinel tokens , so that the model does not generate these tokens during inference.

2 Inference

During inference, the model can either be used for left-to-right generation in the standard way (by sampling autoregressively from the model, without using any special tokens), or it can insert code at arbitrary locations in an existing document by inserting a tokens at the desired location(s) and continuing generation at the end of the document. Assuming for simplicity of notation that we want to insert text at only a single location, we can generate a span to insert between the location’s Left and Right context sequences by sampling tokens autoregressively from the distribution

until either an token is generated or a task-dependent stopping criterion is achieved.In practice, we use a variation of this method that inserts an extra sentinel token into the context to counter a bias on infill length caused by the Transformer’s fixed context window size. See Section B.2. When applied to code, this allows us to perform tasks that benefit from the bidirectional context in a zero-shot fashion, as shown in Figure 1, bottom. For example, we can perform Python docstring generation conditioned on both the left context (function signature) and right context (function implementation). We can also infill multiple dependent regions, e.g., generate imports required by a function that the model is generating. See Section B.2 for details, including multi-region infilling.

Models

Our primary model is InCoder-6.7B, a 6.7B Transformer (Vaswani et al., 2017) language model. We use the same architecture as the dense 6.7B models described in Artetxe et al. (2021); the Fairseq architecture description can be found in Table 6 in the appendix. All experiments use this model unless stated otherwise (we train smaller models for comparison in Section 5).

To train our models, we collect a corpus of (1) public code with permissive, non-copyleft, open-source licenses from GitHub and GitLab and (2) StackOverflow questions, answers, and comments. Our primary focus in this paper is on the Python language, but we also include code files from 28 total languages and StackOverflow content from all available languages. We decontaminate our pre-training corpus by removing all datasets which we use in our evaluation experiments. See Section A.1 for details. Our final pre-training corpus contains a total of 159 GB of code, 52 GB of it in Python, and a total of 57 GB of content from StackOverflow. See Figure 3 for size by language.

Infilling Experiments

Our primary evaluation is performing zero-shot infilling for a diverse set of tasks: inserting lines of code, predicting function return types, generating docstrings, renaming variables, and inserting missing code tokens. We formulate each task as filling in one or more masked-out regions of code.

To evaluate how InCoder benefits from bidirectional context when generating infills, we compare three different inference methods: the causal masking inference procedure described in Section 2, a standard left-to-right generation approach (left-to-right single), and a left-to-right generation and reranking approach (left-to-right reranking). Since our model is also able to generate left-to-right, we can compare all three inference methods using the same InCoder-6.7B model and thus avoid any confounding effects due to a change in the model. For all three inference methods, we obtain generations from the model using top- $p$ (nucleus) sampling (Holtzman et al., 2020) with $p=0.95$ and a temperature tuned for each task and inference method using the task’s development data.For all generation experiments, we prefix prompts with meta-data indicating the code generated should be Python; see Section A.3) for meta-data details.

This baseline does not use the context to the right of the masked location at all. It generates a single completion for the location by conditioning on the left context and sampling tokens autoregressively from the model $P(\cdot\mid\texttt{Left})$ until a task-specific stop condition is reached (e.g., for comment generation, when a comment-ending delimiter is produced).

Left-to-right reranking.

This baseline uses only the left context to propose candidates to infill the blank, but uses both the left and right contexts to choose among these candidates. Concretely, we first generate $K$ possible completions for the blank region, $\texttt{Span}_{1}\ldots\texttt{Span}_{K}$ following the same procedure as left-to-right single, using $K=10$ unless otherwise specified. We then evaluate each candidate by substituting it into the blank and scoring the completed document. We use either total log probability of the completed document $\log P([\texttt{Left};~{}\texttt{Span}_{k};~{}\texttt{Right}])$ or, following Chen et al. (2021a), log probability averaged across the number of tokens in the completed document. We select between these two scoring methods for each task using performance on the task’s development data.

1 Infilling Lines of Code (HumanEval)

We create an infilling benchmark for complete lines of code from the HumanEval dataset (Chen et al., 2021a). This dataset provides comment descriptions of functions paired with a canonical implementation of each function and several input–output pairs that the function should pass. HumanEval was introduced as a benchmark for the synthesis of entire Python functions; we evaluate our models on this original synthesis setting in Section C.6. We use this dataset because it affords functional testing of completed code (as opposed to relying solely on an evaluation of the code surface form), which is particularly important when infilling longer regions that have more potential ways to be completed correctly. We construct two infilling tasks from the dataset, for single lines and multiple lines:

In this task, we mask out each non-blank line of code in the canonical function implementation in turn (creating $N$ examples for a function with $N$ non-blank lines). The task is to generate a single-line completion for the blank conditioned on the natural language description of the function and the code lines before and after the blank. We evaluate using (1) pass rate: the rate at which the completed function passes all of the function’s input–output pairs (i.e., analogous to the pass@1 metric from Chen et al. (2021a) and (2) exact match: percentage of times that the completed lines exactly match the masked lines in the canonical implementation. Performance is averaged across all examples generated for all programs in the dataset.

Multi-line infilling.

This task is constructed in the same way as single-line infilling above but allows each masked region to contain multiple lines of code, creating $N\times(N+1)/2$ examples for a function with $N$ non-blank lines. We again evaluate completions using pass rate and exact match, averaged across all infilling examples.

Inference details.

To choose when to end the infill produced by our inference methods, we truncate the candidates generated by the left-to-right (L-R) baselines to the actual number of lines in the blanked-out region. For our causal-masked (CM) infilling method, we end the infill when the model generates the token. For the L-R single and CM infilling methods, we sample using a temperature of 0.2. For the L-R rerank method, we use a temperature of 0.8 to sample $K=10$ candidates and rescore with the total log probability of the completed function.

Results.

Table 1 shows the results for the single-line (left) and multi-line settings (right). In both settings, CM infilling improves substantially over the L-R single baseline and the L-R reranking baseline. Note that these results are computed by averaging over all examples, which includes masked regions at all positions in functions (including the beginning, when no left context is available, and end, when no right context is available). Figure 2 shows a finer-grained comparison, where we group examples by the fraction of lines in the canonical function which are contained in the context to the right of the infill. The CM infilling method sees larger improvements over the L-R baselines as more right-sided context becomes available (i.e., when the blanked region occurs earlier in the function).For multi-line infilling, performance increases with increasing amounts of right context, as having a large right context implies that fewer lines are available to be removed when constructing infill examples, so the resulting examples are often easier to infill.

We also compare against two alternate zero-shot methods for incorporating right-sided context: (1) an encoder-decoder code model trained with a denoising infilling objective (PLBART, Ahmad et al. 2021), and (2) templated prompting of large left-to-right generative code models (the cushman-001 and davinci-001 Codex models available through OpenAI’s API). See Section C.1 for details on these experiments.PLBART outputs tokenized code, which prevents us from computing meaningful exact match metrics. InCoder outperforms all models in both single-line and multi-line infilling, despite having lower performance in left-to-right generation than Codex (see Table 11), demonstrating that causal masking training benefits infilling.

2 Docstring generation (CodeXGLUE)

We next evaluate documentation string (docstring) generation, where models must generate a natural language docstring that summarizes a Python code snippet. Right context may be particularly useful for docstring generation, as conditioning on the function body can allow models to generate more informative descriptions. Prior neural code generation models are fine-tuned on supervised docstring-code pairs to perform this task (e.g., Clement et al. 2020; Chen et al. 2021a; Lu et al. 2021; Ahmad et al. 2021), however we evaluate our model zero-shot, with no explicit supervision.

We use the CodeXGLUE code-to-text docstring generation task (Lu et al., 2021), which is constructed from CodeSearchNet (Husain et al., 2019), consisting of docstring-code pairs scraped from publicly available GitHub repositories. The L-R single candidate baseline is prompted with the function signature in the left context preceding the docstring. The CM infilling and L-R reranking methods also observe the right context, consisting of the function body.

We compare models following the original automatic evaluation setup for the task. In Table 2, we report smoothed 4-gram BLEU scores for all models, using the reference docstrings provided in the dataset. These references have been preprocessed to strip extraneous content (e.g., argument definitions) from the original scraped docstrings. We use greedy generation for the CM infilling and L-R single candidate generation methods and sample $K=10$ candidates at temperature 0.8 with average log probability scoring for the L-R reranking method (selected by tuning on the validation set of the task). For all inference methods, we stop generation if the model generates a newline. We also include the performance of the supervised baseline from the CodeXGLUE paper: an encoder-decoder model with a CodeBERT encoder fine-tuned on $\sim 250$ K training examples from the dataset. Our zero-shot performance approaches the performance of the fine-tuned CodeBERT model.

3 Return Type Prediction

Predicting return type hints for Python functions is a challenging structured generation task (see Figure 1, “type inference”). We evaluate on two datasets: one we construct from CodeXGLUE and the dataset from TypeWriter OSS (Pradel et al., 2020).

We develop a benchmark for return type prediction using the same Python CodeXGLUE dataset used in the code-to-text (docstring generation) task. We run an abstract syntax tree (AST) processor on all functions in the development and test sets of this dataset to (1) identify functions with a PEP 484https://peps.python.org/pep-0484/ return type hint annotation that is not None and (2) remove all other type hints (e.g., for function arguments and variable declarations) from the function. This leaves 232 functions in the development and 469 functions in the test set.

The task is to condition on the function signature and body and predict the type hint. We compare the type hints predicted by our various methods to the annotated type hint in the original function, using exact match accuracy on the normalized type hint.We normalize each type hint by first parsing the type to an AST and then un-parsing the AST to a surface form string, then compute exact match on these surface forms. We note that this metric is somewhat noisy, given that human-annotated type hints can be inaccurate, and that exact match does not reason about type unification or equivalence (e.g., there is no partial credit given for predicting Optional[str] rather than Union[None, str]).

To compare our three generation methods, we stop generation when a : is generated, which ends the type hint and signals the start of the function body. We tune inference hyperparameters on the development set, and we use a temperature of 0.2 for left-to-right-single, 0.8 for left-to-right reranking, and greedy generation for causal masked infilling. Results on the test set are given in 3(a). Conditioning on the right context (i.e., the function body) gives some benefit in the left-to-right reranking setting, but gives a substantial improvement via our causal masked infilling.

TypeWriter OSS.

Some recent work has developed supervised machine learning approaches for predicting type annotations for dynamically-typed languages including Python (Xu et al., 2016; Allamanis et al., 2020; Pradel et al., 2020) and TypeScript (Hellendoorn et al., 2018; Wei et al., 2020; Jesse et al., 2021). We compare our zero-shot model to one such approach for Python, TypeWriter (Pradel et al., 2020), which combines a neural architecture for type hint prediction with a search-based incremental type validation procedure.

To compare to the supervised TypeWriter approach, we obtain its predictions on the open-source software (OSS) dataset used in that work (Pradel et al., 2020), consisting of Python functions from GitHub. Unfortunately, we could not evaluate on their full evaluation set since much of it was included in our model’s training data. We filter to instances that were not included in our training data, for which we were able to obtain files and extract functions and types from via AST parsing, and which have non-None return type hints. This leaves 2,092 examples (about $12\%$ of their evaluation set). We otherwise emulate their exact setup, which allows our model to condition on file imports, the function body, and the function signature to predict return type hints. We use the same inference hyperparameters as we did for CodeXGLUE type hint prediction.

We present our results in two tables: 3(b) containing metrics across non-None types, and Table 10 in the Appendix, which includes None types as well (following Pradel et al. 2020).Our model makes a prediction for every example, so it has identical precision, recall, and F1 scores. We again see benefits from causal masked infilling’s ability to condition on the function body when generating return types, and find that our zero-shot model outperforms the supervised TypeWriter model.

4 Variable Name Prediction

Variable name prediction is a less-constrained code generation task that requires modeling bidirectional context. We again use the test set from the CodexGlue code-to-text task (docstring generation) and run an AST transform to isolate and either mask all the occurrences of the variable name (infilling) or take the left-most context from the first variable name (left-to-right mode). In the infilling setting, given that we generate the number of masks equivalent to the number of times a variable is seen, we select the most common prediction as our singular prediction. Furthermore, we only evaluate the set of variable names containing four or more characters. For our re-ranking, we consider a candidate set of 25 variables. We present our results in Table 4. We again see substantial benefits from using both left and right context: left-to-right reranking and causal-masked infilling both outperform the left-to-right single baseline (which uses only the left context). Causal-masked infilling substantially on the left-to-right reranking method, demonstrating the value of conditioning on the right context when proposing candidate completions.

Ablation Experiments

For an analysis of the effects of training a model with causal masking (rather than the standard language modeling objective, as well as model size and the training data, we train several variations of our model. We compare model pass@1 scores on the HumanEval (Chen et al., 2021a) and MBPP (Austin et al., 2021) left-to-right synthesis benchmarks, with results in Table 5.

Comparing 1.3B parameter models trained on the same training data with the causal masked (CM) objective (row 2) and the standard left-to-right language modeling (LM) objective (row 3), we see that the causal-masked model obtains slightly higher performance on the HumanEval and MBPP tasks in pass@1 score. This provides further evidence that causal masking training does not hurt the model’s ability to perform standard left-to-right generation, at least to the 1.3B parameter scale, in line with the findings of Bavarian et al. (2022).

Model size.

With data fixed, increasing model size consistently improves performance (comparing the 6.7B and 1.3B CM models in rows 1 and 2, and the 1.3B and 2.3B LM models in rows 3 and 6).

Effects of data.

We compare models trained on our entire dataset of multiple code languages and StackOverflow (multi lang + SO, described in Section A.1) to data ablations that train only on Python code files and StackOverflow (Python + SO) and only Python code files (Python). We find that training on multiple languages gives a slight reduction in performance on these Python evaluations. However, comparing rows 4 and 5, we see that including StackOverflow data in training substantially improves performance on both HumanEval and MBPP. This suggests that future work on generative code models for language-guided synthesis tasks should consider using StackOverflow or other corpora that mix natural language and code as training data.

Qualitative Examples

We show a variety of qualitative examples from our model in Section D.2 in both the infilling and left-to-right generation modes: docstring generation, metadata conditioning, class attribute inference from class usage, comment-conditioned code editing, StackOverflow title and tag generation, and zero-shot bidirectional translation of technical jargon between Chinese and English.

Related Work

There has been a flurry of recent work on training large-scale neural language models on source code. Existing models differ in their architectural design and training objectives, e.g., decoder-only language models (Austin et al., 2021; Chen et al., 2021a; Izadi et al., 2022; Xu et al., 2022; Nijkamp et al., 2022), encoder-only masked language models (Feng et al., 2020; Kanade et al., 2020), and encoder-decoder models (Ahmad et al., 2021; Li et al., 2022; Roziere et al., 2021; Wang et al., 2021). Decoder-only language models have grown in popularity as they can perform zero-shot program synthesis by generating in a left-to-right fashion. On the other hand, InCoder is a decoder-only causally-masked language model that can infill arbitrary spans of text. This allows the model to perform program synthesis and many other code infilling tasks.

Infilling Models

Many real-world applications require infilling sequences using left and right context, e.g., editing sentences (Shih et al., 2019), restoring ancient text (Assael et al., 2019), and fixing bugs in source code. Unfortunately, standard left-to-right language models cannot directly infill text, and popular masked language models are mainly trained to infill very short spans (Chan et al., 2019; Devlin et al., 2019; Raffel et al., 2020; Roziere et al., 2021). Recent work addresses this by changing model architectures, inference procedures, and training objectives (Aghajanyan et al., 2022a; Stern et al., 2019; West et al., 2021; Aghajanyan et al., 2022b). Most related to our approach is the work of Donahue et al. (2020) and CM3 (Aghajanyan et al., 2022a), who train left-to-right language models to fill in masked token segments of varying lengths; and the work of Alon et al. (2020), who train an infilling-capable, AST-structured generative model of code on a smaller scale. In addition, concurrent to our work, OpenAI developed a fill-in-the-middle (FIM) training objective similar to the causal masking objective we use, trained code models with it, and evaluated on the HumanEval infilling tasks we introduce here (Bavarian et al., 2022). Similar to our findings in Section 5, they find that the infilling capability does not adversely affect left-to-right performance. Our objective, in contrast, allows infilling multiple regions of code, and we demonstrate the benefits of infilling across a broader range of natural programming tasks.

Machine Learning for Code Assistance

There is an extensive literature on using machine learning models to aid human programmers. This includes methods to infer variable types (Pradel et al., 2020; Wei et al., 2020), generate unit tests (Fraser & Arcuri, 2011), repair programs (Gupta et al., 2017; Yasunaga & Liang, 2020; Chen et al., 2021c; Yasunaga & Liang, 2021), and verify program correctness (Ryan et al., 2020). Our model can infill arbitrary spans of code, allowing it to complete many of these tasks, as well as perform standard left-to-right generation, in a single approach.

Machine Learning for Program Synthesis

Program synthesis approaches directly generate programs from a specification of functionality (Gulwani et al., 2017). Such models work by taking e.g., input-output examples (Balog et al., 2017; Gulwani, 2011; Chen et al., 2021b; Bavishi et al., 2019), partial implementations (Solar-Lezama et al., 2006), or natural language descriptions (Zelle & Mooney, 1996; Yu et al., 2018; Yin et al., 2018; Kulal et al., 2019; Chen et al., 2021a) of the desired program as input. Our InCoder model differs from this past work as it can both synthesize and infill arbitrary spans of code, conditioning on natural language and partial implementations.

Conclusion

We demonstrated that using a causal masking objective when training a generative model of code enables strong zero-shot performance on many challenging and practical code infilling and editing tasks. The model’s additional infilling capability does not appear to harm its ability to do standard left-to-right generation: ablation and comparison experiments show that our causal-masked models have comparable performance to similarly-resourced models on standard left-to-right language-to-code synthesis benchmarks. Looking forward, we expect our model performance to continue to increase with more parameters, data, and training steps (Kaplan et al., 2020; Henighan et al., 2020). Moreover, fine-tuning would allow our models to be better able to condition on natural language instructions and other indications of human intent (Zhong et al., 2021; Wei et al., 2022; Ouyang et al., 2022). Finally, our model lays a foundation for future work on supervised infilling & editing via model fine-tuning, as well as performing iterative decoding, where the model can be used to refine its own output (Ghazvininejad et al., 2019).

References

Appendix A Data

We obtained code files and repository metadata from GitHub and GitLab through the sites’ public APIs over a period ending on December 9th, 2021. We obtained approximately 670,000 public non-fork repositories which GitHub/GitLab detected as containing primarily Python, JavaScript, or Jupyter Notebook files, and with either an MIT, Apache 2.0, BSD-2, or BSD-3 clause license. We included all code from a list of 28 languages (determined by file extension) contained in these repositories.We include source files from C, C++, CSS, C#, Common Lisp, Dart, Forth, Go, HTML, Haskell, Java, JavaScript, Julia, Jupyter, Kotlin, Lua, Matlab, PHP, Perl, Python, R, Ruby, Rust, SQL, Scala, Shell, Swift, and TypeScript, although the great majority of files are Python and JavaScript. See Figure 3. Since Python files can also be contained in non-majority-Python repositories, we also included all other Python and Jupyter files obtainable through the GitHub archive on BigQuery that we did not already obtain from GitHub directly.We use https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code. We only include repositories with one of the above permissive licenses. We preprocess Jupyter notebooks by including all text and code (with Markdown formatting removed from text cells), with cells demarcated by XML-style tags (see Section A.3).

Deduplication.

Recent work has shown that deduplicating training data can improve model performance and reduce the risk of memorizing training data (Allamanis, 2019; Lee et al., 2022; Kandpal et al., 2022). Our deduplication scheme removes code files using exact match on the sequence of alphanumeric tokens in the file.We implement match checking using a Bloom filter Bloom (1970) whose keys are: the file extension, number of tokens in the file, and an MD5 hash Rivest (1992) of the sequence of tokens, which is highly accurate at identifying files with exactly matching token sequences. This removed approximately 75% of the corpus by file size (reducing from 1 TB to 250 GB) as there are numerous duplicated repositories, library dependencies included as source files, and common boilerplate code files (e.g., for Python web frameworks). We also use regular expressions to detect email addresses in the code files and replace them with a dummy address,We replace detected email addresses with removed@example.com. to reduce the risks of the model memorizing real email addresses or hallucinating fake ones.

Decontamination.

To ensure that our code generation models can be evaluated on several current code generation benchmarks, we perform data decontamination: removing overlap between our training data and the evaluation sets of these benchmarks. We remove any repositories contained in the validation and test sets of CodeSearchNet (Husain et al., 2019), as these are used to construct validation and test sets for several of the tasks in CodeXGLUE (Lu et al., 2021).We also search for occurrences of code from the HumanEval Chen et al. (2021a) dataset using substring match, but we did not find any matches in our training set. The solutions to the problems in this dataset, at the time that we obtained our files from GitHub, were only contained in JSON files. We additionally removed repositories in the evaluation sets of the JuICe tasks (Agashe et al., 2019), although we did not evaluate our model on these tasks in this present work.

Filtering.

Our filtering is similar to past work on generative models of code Chen et al. (2021a); Nijkamp et al. (2022); Xu et al. (2022): we remove files that contain any line longer than 3000 tokens or an average line length greater than 100 tokens, have less than 40% of their characters being alphanumeric or underscores, or appear to be automatically generated, which we determine using substring match on a small number of phrases produced by automatic code and documentation generation systems.Our filters target files automatically generated by the Protocol Buffer Compiler, Django, Epydoc, Pdoc, Sphinx, MkDocs, or MathJax. Our decontamination and filtering steps together remove roughly 10% of Python files.

A.2 StackOverflow

The second component of our corpus consists of questions, answers, and comments from StackOverflow. The Pile (Gao et al., 2020), which was used to train recent generative code models that we compare to in Section 5, also contains these questions and answers but does not contain comments. We include all questions that have at least one answer, up to ten answers with a non-negative score (sorted by score) per question, and up to five comments per question/answer. Qualitatively, we find that comments, together with the infilling ability of the model, allow our model to have some capability to do interactive code editing guided by language (see Figure 11).

A.3 Metadata

We include some metadata on the code files and StackOverflow questions/answers directly in our training data to allow attribute-conditioned generation (Keskar et al., 2019; Zellers et al., 2019) and attribute prediction. For code file data, our attributes are the code filename, the file extension (as a proxy for language), the file source (GitHub or GitLab), and, for GitHub repositories, the number of stars binned into six buckets.We size bins using an inverse binary logarithmic scale, so that bucket 0 corresponds to the repositories with star counts up to the 50th percentile, bucket 1 corresponds to the 50th to 75th percentiles, and bucket 5 to those in the 97th percentile and above. To allow this metadata to be optional when performing left-to-right prompting of the model, we insert each attribute it the beginning of its document with a probability of 50% (allowing the model to learn metadata conditioning); otherwise, we insert it at the end of its document (allowing metadata prediction). See 6(a) and 6(b) for examples. For StackOverflow, our metadata attributes are the question tags for the topic (e.g., python,django) and the number of votes for each question and answer, binned in the same way as repository stars. We insert comments directly after the questions or answers they were written for. See 6(c) for examples.

A.4 Tokenization

To increase the amount of context that our code model can condition on, the length of documents that the model can generate, and the efficiency of training and inference, we train a byte-level BPE tokenizer Sennrich et al. (2016); Radford et al. (2019). We allow tokens to extend across whitespace (excluding newline characters) so that common code idioms (e.g., import numpy as np) are represented as single tokens in the vocabulary. This substantially improves the tokenizer’s efficiency—reducing the total number of tokens required to encode our training corpus by 45% relative to the byte-level BPE tokenizer and vocabulary of GPT-2.

A.5 Corpus statistics

See Figure 3 for a plot showing code corpus composition (after deduplication and filtering) by total file size for the most common languages, as determined by file extension.

Appendix B Model and Inference Details

Our primary model is InCoder-6.7B, a 6.7B Transformer Vaswani et al. (2017) language model. We use the same architecture as the dense 6.7B models described in Artetxe et al. (2021); the Fairseq architecture description can be found in Table 6. InCoder-6.7B was trained on 248 V100 GPUs for 24 days. We perform one epoch on the training data, using each training document exactly once. Our implementation utilized the causal masking implementation (Aghajanyan et al., 2022a) available in Fairseq (Ott et al., 2019), with the underlying library being PyTorch (Paszke et al., 2019). Our per-GPU batch size was 8, with a maximum token sequence length of 2048. We clip all gradient norms to 1.0 and used the Adam optimizer with $\beta_{1}=0.9$ , $\beta_{2}=0.98$ (Kingma & Ba, 2015). For our learning rate scheduler, we use the built-in polynomial decay learning rate scheduler available in Paszke et al. (2019) with 1500 warmup updates. Fairscale was used for improving memory efficiency through fully sharding model states (Baines et al., 2021).

We compare the validation perplexity of the 6B parameter model and a smaller 1.3B parameter model (see Section 5 for details on the training of this 1.3B model) in Figure 5, showing comparable scaling laws to those reported by Aghajanyan et al. Aghajanyan et al. (2022a). Our models have also not yet saturated and would benefit from further training; we report the performance of the 6.7B model on the HumanEval Python function synthesis benchmark (Chen et al., 2021a) (see Section C.6 for a description of this benchmark) and see a consistent increase in performance over the course of training (Figure 5).

B.2 Inference details

In practice, to generate a single infill we sample from the distribution $P(\cdot\mid[\texttt{Left;}~{}\texttt{<Mask:0>};~{}\texttt{Right};~{}\texttt{<Mask:1>};~{}\texttt{<Mask:0>}])$ , where we insert an artificial token. Not inserting gives an implicit size hint to the model that the token should be expanded to fill the rest of the 2048 token context window. Instead, inserting a token indicates to the model that some amount of the document is omitted after the right context. We found that including this substantially improved the ability of the model to predict appropriately when generating an infill for . See Aghajanyan et al. (2022a) for more.

More generally, when inserting at multiple locations, we condition on the document with multiple mask sentinel tokens inserted and a final mask token appended. For example, to insert at two locations we use [A; ; C; ; E; ]) and infill the masks in order, appending the appropriate sentinel tokens to signal the start of generation for the next span, i.e., the completed document for two insertion locations is represented by $[\texttt{A; <Mask:0>; C; <Mask:1>; E; <Mask:2>; <Mask:0>; B; {<EOM>}{}; <Mask:1>; D; {<EOM>}{}}]$ , where regions B and D have been infilled.

Appendix C Experimental Details and Supplementary Results

We describe our adaptation of models from prior work to the zero-shot infilling setting, for the experiments described in Section 4.

We use PLBART-Large (Ahmad et al., 2021), an encoder-decoder model trained on code (including 220GB of Python) using a BART (Lewis et al., 2019) masked denoising objective. We pre- and post-process each HumanEval infilling example as needed for PLBART: we represent each example as a stream of space-separated tokens (as identified by Python’s built-in lexer) with newlines and indentations replaced by control characters, and use a token to represent the line to be infilled. We extract the infilled region from the output by searching for the longest suffix of the left context contained in the output, and (as in our left-to-right baselines) take the ground-truth number of lines following this left context suffix as the infill.

Left-to-right with templated prompting (Codex).

We perform zero-shot prompting on the Codex code-cushman-001 and code-davinci-001 OpenAI API models using the following template:

We take [code after the infill mask] as the indicator of completion.

C.2 Code Cloze (CodeXGLUE)

CodeXGLUE cloze is created from CodeSearchNet to evaluate CodeBERT and consists of a short natural language description followed by code in several programming languages. We evaluate on the max/min subtask, where the model has to decide if the given mask should be filled with either max or min. Since there are only two options in this task, we can closely compare the causal-masked infilling and left-to-right setups by scoring both options and selecting the sequence with the highest likelihood.

Table 7 contains the main results. Using the causal-masked infill format with a single token (containing min/max) as the masked region (CM infill-token) performs better than using just the left context, but not as well as scoring the entire sequence left to right. Masking a larger region (CM infill-region), containing the left prefix and 10 right-side tokens in the masked region, performs comparably to scoring the whole sequence. Infill region length and tokenization can affect the performance, see C.3 for more details and more comparisons.

Note that comparing the scores of the sequences, which differ in their infills, with the left-to-right setup is more computationally expensive than with the CM infilling setup, as the Transformer intermediate activations can be cached and shared across identical sequence prefixes, and in the CM infill setup all sequence differences occur at the ends.

C.3 Cloze and single token infill details

As shown in Table 8, breaking tokenization (-break) on infill decreases the performance using all scoring methods. For example, whereas Math.max( was a single token in the full sequence, the sequence is broken into Math., max, and ( for infilling. Infilling with the original tokenization increases the performance slightly, but does not match full left-right scoring. We suspect this is because the model was not trained on infilling single tokens, unlike CodeBERT. A way to fix this is to include a larger region on the left and a few more tokens on the right. This will only slightly increase the scoring complexity. To show that our model uses the right context, we compare it with scoring the left-only model. More precisely, the sequences being scored are

C.4 Comparison to OpenAI’s Code API

We evaluate OpenAI’s proprietary code-davinci-002 system, accessed through their API, on our single-line HumanEval infilling task, with results given in Table 9, There is limited public information about this system, including on its training data or procedure (although Bavarian et al. 2022 describes their FIM objective as early research that helps power the model), how it performs infills, or whether any postprocessing is done on model outputs, but we report its performance to help gauge the difficulty of our new task. For both code-davinci-002 and our InCoder-6.7B model, conditioning on right-sided context improves performance, with the most substantial improvements from infilling.

C.5 Additional Type Hint Prediction Setting

Our results in Section 4.3 filtered out functions from the TypeWriter prediction set which had a return type hint of None, as these type hints are are overrepresented in the dataset in compared to naturally-occurring code, due to the filtering process used to construct it. For a closer comparison to the setting used in the original TypeWriter paper, we present results including these functions in Table 10. Given TypeWriter’s static analysis capabilities, and the overrepresentation of None types in this evaluation set, we add a simple post-processing step (return checks) that predicts None if the function does not have any non-trivial return statements, which captures some of the effect of TypeWriter’s analysis capabilities. In all settings, our zero-shot infill approach outperforms the left-to-right baselines, and obtains performance comparable to the supervised TypeWriter approach when return checks are used.

C.6 Comparison to Left-to-Right Generative Models on Code Synthesis

We compare to past published work on generative code models on the HumanEval (Chen et al., 2021a) and MBPP (Austin et al., 2021) benchmarks, which require models to condition on natural language descriptions (docstrings) to produce Python programs (typically a single function), and evaluates overall functional accuracy (pass rate) across examples using several test cases for each program.

We evaluate our InCoder-6.7B model in zero-shot evaluation on both of these benchmarks. For HumanEval, we follow past work by prompting with function signatures and docstring descriptions, sample 200 candidate program completions, and compute pass@1, pass@10, and pass@100 using the unbiased sampling estimator of Chen et al. (Chen et al., 2021a). For MBPP, which does not include function signatures, we prompt only with the docstring description and compute pass@1 (Chowdhery et al., 2022) using a single candidate.While this setting is not directly comparable to the three-shot setting where the models of Austin et al. (Austin et al., 2021) and Chowdhery et al. (Chowdhery et al., 2022) performed best, we found that our model did not benefit from additional examples in the prompt, which we attribute to much smaller size of our model (6.7B, versus 137B or 540B parameters) and the sensitivity of in-context learning to model scale. We use top-p sampling with $p=0.95$ , with a temperature of 0.2 for pass@1 and 0.8 for pass@10 and pass@100.

We compare our InCoder-6.7B model to models from past work (which have all been left-to-right only) in Table 11, giving the model size and training data summary statistics as reported (or estimated, in cases when a paper only reports token counts, as tokenizer efficiencies vary) in these papers.While differences in details of the Transformer model architectures, datasets, and training procedures across papers and experimental setups make a rigorous comparison impossible, we note that our model achieves roughly comparable performance on the HumanEval metrics to CodeGen-Multi (Nijkamp et al., 2022), which is also a $\sim$ 6B parameter model trained on roughly the same amount of Python code, as well as AlphaCode’s 1.1B decoder-only model (Li et al., 2022) which also uses a similar amount of Python training data.