CoditT5: Pretraining for Source Code and Natural Language Editing

Jiyang Zhang, Sheena Panthaplackel, Pengyu Nie, Junyi Jessy Li, Milos Gligoric

Introduction

Large language models pretrained on massive amounts of data have led to remarkable progress in recent years, with models like BART (Lewis et al., 2020), GPT (Radford et al., 2019; Brown et al., 2020), and T5 (Raffel et al., 2020) yielding huge improvements for a vast number of text generation tasks. Inspired by this, a new research initiative has emerged around building large models that are pretrained on source code and technical text to address software-related tasks. This includes models like PLBART (Ahmad et al., 2021), CodeGPT-2 (Lu et al., 2021), and CodeT5 (Wang et al., 2021a). While these models demonstrate impressive performance on generation tasks like code summarization, code generation, and code translation, it is unclear if they are well-suited for the editing nature of many software-related tasks. For instance, bug fixing (Tufano et al., 2019b) entails editing source code to resolve bugs, automated code review (Tufano et al., 2021) requires editing source code to incorporate feedback from review comments, and comment updating (Panthaplackel et al., 2020; Liu et al., 2021; Lin et al., 2021; Gao et al., 2021) pertains to updating outdated natural language comments to reflect code changes.

In principle, such editing tasks can be framed as standard generation tasks in which an input sequence (e.g., buggy code snippet) is completely re-written to form the output sequence (e.g., fixed code snippet). In this way, existing pretrained conditional generation models can be fine-tuned to autoregressively generate a sequence from scratch. However, this can be problematic in practice (Panthaplackel et al., 2020). When applying large generation models like PLBART and CodeT5 to these tasks, we find that they can generate output which merely copies the input without performing any edits (up to 34.25%) or even deviates substantially from the input, introducing irrelevant changes. We provide an example of automated code review in Figure 1, where a reviewer prescribes edits that need to be made to a given code snippet: “Generally better to qualify than making static import”. Using the code snippet and this comment, PLBART generates an output sequence which copies the original code, without applying any edits. While the output is valid and a likely sequence according to PLBART’s language model, it makes no edits based on the reviewer’s comments.

We attribute these weaknesses to the fact that such models rely on pretraining objectives designed for generating code (or software-related natural language) in sequence by exploiting patterns with respect to preceding tokens. Therefore, a model has to learn to implicitly perform edits by generating tokens one by one in accordance with the underlying probability that it has learned for which tokens belong alongside one another, rather than being aware of where information should be retained or modified.

Intuitively, edit-based generation requires a different approach that more frequently refers back to the input sequence, and can often be characterized by localized operations (e.g., insertion, deletion, substitution). To guide a model in discerning edit locations in the input sequence and reason about the necessary edit operations, we design a novel pretraining objective that explicitly models edits. Our approach is inspired by content planning in natural language generation where a skeleton of key elements are first generated and used to guide more accurate and precise generation of full text (Reiter and Dale, 1997; Pichotta and Mooney, 2016; Martin et al., 2018; Fan et al., 2019). Specifically, during decoding, a model first generates an edit plan that explicitly details the edit operations. Then, it proceeds to autoregressively generate the target edited sequence, during which it attends to the edit plan. Through this, we effectively encourage the model to learn to better reason about edits and how they should be applied to form the target sequence. Using this objective, we develop CoditT5, a large language model for software-related edit tasks that is pretrained on more than 5.9 million open-source programming language code snippets and 1.6 million natural language comments from the CodeSearchNet (Husain et al., 2019) training data.

For evaluation, we fine-tune CoditT5 on three downstream tasks: comment updating, bug fixing, and automated code review. For each of these tasks, we show that CoditT5 outperforms state-of-the-art models as well as large pretrained standard generation-based models. Through this, we demonstrate that our model and the proposed edit-based pretraining objective generalize across tasks and are better suited for editing tasks in the software domain.

Furthermore, in our evaluation, we find that our edit-based model, CoditT5, can be further improved if combined with a standard generation-based model. We find that the edit-based and standard generation-based models are complementary to one another. Namely, while the edit-based model provides better explicit modeling of concrete edits, a standard generation-based model provides certain advantages in terms of the contextual coherence of the generated target sequence. To exploit this complementary nature of these models, we combine the two models through reranking strategies which require no additional training. Our results show that the combined approaches outperform the two models individually by up to 19.35%.

We summarize our main contributions as follows:

We formulate a novel pretraining objective that entails first generating a plan consisting of edit operations to be applied to the input sequence followed by the resulting target sequence.

We build and release CoditT5, a large language model for software-related editing tasks that is pretrained on large amounts of source code and natural language with the new pretraining objective.

Upon task-specific fine-tuning, we show that CoditT5 achieves improved performance over existing models for three distinct downstream editing tasks (comment updating, bug fixing and automated code review), demonstrating its effectiveness and generalizability.

We show that by combining our edit-based CoditT5 model with a standard generation model through simple reranking strategies, we can beat each of the individual models and achieve new state-of-the-art in all three tasks, demonstrating the complementary nature of edit-based and standard generation models.

Our code and data is publicly available at https://github.com/EngineeringSoftware/CoditT5.

Background

We first give a high-level overview of the building blocks that are necessary to understand our approach.

Conditional sequence generation entails generating an output sequence given an input sequence. Many tasks are framed in this manner, including machine translation (e.g., translating a sentence from French to English) (Bahdanau et al., 2015), text summarization (e.g., generating a brief summary for a given news article) (Rush et al., 2015), and code generation (e.g., generating a code snippet for a given natural language specification) (Yin and Neubig, 2017).

In recent years, conditional sequence generation tasks are being addressed with encoder-decoder models. An encoder-decoder model consists of two neural components: an encoder and a decoder. The input sequence is fed into the encoder, which produces learned vector representations of the tokens in that sequence. These learned vector representations are then passed into the decoder, which generates the output sequence one token at a time. Specifically, the decoder predicts the next token by reasoning over the input sequence and the tokens generated at previous time steps.

Transformers (Vaswani et al., 2017) are powerful neural models that are commonly adopted as the encoder and decoder in the encoder-decoder framework. These models rely on an attention mechanism to learn representations for tokens by relating them to other tokens in the sequence. Namely, a transformer-based encoder will learn representations for each token in the input sequence by “attending” to other input tokens. For the decoder, when generating a token at timestep $t$ , it will “attend” to the representations of the output tokens generated from timestep 1 to $t-1$ as well as the representations of tokens from the input sequence. Transformer models can become very large with huge numbers of attention heads, encoder and decoder layers.

2. Large Pretrained Language Models

Large pretrained language models generally refer to the class of large transformer-based models that are trained on large amounts of unlabeled data (collected from webpages, news articles, etc.) with unsupervised training objectives. This includes a vast number of models like GPT (Radford et al., 2019; Brown et al., 2020), BART (Lewis et al., 2020), and T5 (Raffel et al., 2020).

BART and T5 models are pretrained using denoising autoencoding unsupervised training objectives. Namely, a noising function is first applied to a given input sequence inp to form inp′. Common noising functions include Token Masking: tokens in the input sequence are randomly masked; Token Deletion: random tokens are deleted from the input sequence; Token Infilling: a span of tokens are sampled and replaced with a mask token; Sentence Permutation: sentences in the document are shuffled in a random order. Then, inp′ is fed into a model’s encoder, and the encoder’s learned representation is passed into the decoder, which generates an output sequence, out, that is expected to resemble the original input sequence (inp). In other words, the model is trained to “denoise” inp′, using a training objective that minimizes the error between out and the original input, inp. Through this, the model learns to extract meaning from the input sequence and also generate fluent and coherent output. Therefore, by pretraining on massive amounts of data, the model develops an understanding of how things in the world relate to one another as a strong language modeling capability.

Since large pretrained language models are trained using unsupervised training objectives on huge amounts of data, they cannot generally be directly applied to downstream tasks (e.g., translation, summarization). Fine-tuning is a common technique to transfer the knowledge learned during pretraining to target downstream tasks. Specifically, the pretrained model is further trained for the downstream task on some amount of supervised data.

3. Large Pretrained Language Models for Software Engineering

Inspired by the success of large pretrained models in Natural Language Processing (NLP), a number of machine learning models pretrained on source code and technical text have been proposed for solving various software-related problems.

For instance, inspired by BART, Ahmad et al. (2021) developed PLBART, which is a large pretrained language model that can be fine-tuned for a number of code understanding (e.g., code summarization) and generation (e.g., code translation) tasks. Similarly, inspired by T5, Wang et al. (2021a) built a larger model CodeT5, which is pretrained on six programming languages together with their natural language comments collected from open-source repositories. Specially, it is pretrained to incorporate information from identifiers in the code. CodeT5 has shown promising results in code-related generation tasks such as code summarization, code generation and code-related understanding tasks such as clone detection and vulnerability identification. However, aforementioned models are for generation and they are only implicitly aware of edit operations if at all.

CoditT5

CoditT5 is built upon the encoder-decoder framework with the same architecture as CodeT5. As shown in Figure 2, the model is pretrained with our proposed objective: generating the edit-based output sequence given the corrupted input sequence. In this section, we first explain our proposed pretraining objective (Section 3.1). We then discuss how we build CoditT5 by pretraining on this objective, including the data used for pretraining (Section 3.2), and additional details of the pretraining setup (Section 3.3).

We formulate a new pretraining objective that is designed to encourage a model to explicitly reason about edits. At a high-level, this objective falls under the realm of denoising autoencoding in which an input sequence is first corrupted with noising functions and the model is trained to denoise the corrupted sequence by generating an output sequence that matches the original input sequence. While existing models like PLBART and CodeT5 pretrained using this setup perform very well on various generation tasks (e.g., code summarization/generation), we find that they do not generalize well when fine-tuned on editing tasks. Namely, they are susceptible to learning to copy the original input sequence instead of actually performing edits, up to 34.25% of the time (Table 3).

We propose the following edit-based output sequence representation (shown in Figure 2): [Edit Plan] [Target Sequence], where the model is trained to generate an edit plan ( 1) consisting of explicit edit operations that must be applied to the corrupted sequence to reconstruct the original input sequence, followed by a separation token (), and finally the target sequence ( 2) that matches the original input sequence. This is inspired by the concept of content planning, originating from natural language generation (Reiter and Dale, 1997). In content planning, a high-level plan is first outlined, specifying the discourse structure of the content to be generated, and then lexical realization is performed to generate the text.

The edit plan entails the specific edit operations that are needed to recover the original input sequence. For example, in Figure 2, the input sequence: “@param users List of user objects” is corrupted by masking “users” and removing token “user”: “@param [MASK] List of objects”. With this, a model must first reason about the fact that [MASK] in the corrupted input sequence needs to be replaced with “users” and “user” should be inserted between “of” and “objects” when producing the target sequence. To construct the sequence of edit operations, we closely follow the format proposed by Panthaplackel et al. (2020):

~~[span of tokens]~~

Here, is either Insert or Delete. We also include the Replace operation, with a slightly different structure (since both the old content to be replaced as well as the new content to replace it with must be specified):

~~[span of old tokens] [span of new tokens]~~

To determine the specific edit operations for a given example, we use difflibhttps://docs.python.org/3/library/difflib.html to compute the optimal set of edits needed to transform the corrupted input sequence into the original input sequence. Multiple edit operations are placed in the same order as the span of tokens under editing appears in the input sequence (for example, the edit plan in Figure 2 consists of two edit operations).

1.2. Target Sequence

One might ask whether we could simply apply the sequence of edit operations in the generated edit plan to the corrupted input sequence directly to recover the original input sequence heuristically. For example, if we align “ [MASK] user ” with a corrupted input sequence “@param [MASK] List of user objects”, it is very clear that all we need to do is replace [MASK] with “user”and no additional generation is needed. However, there are two main issues with this. First, not all operations will be specified in a deterministic manner. For example, if the edit plan is “ user ”, it is not clear where the new token “user” should be added to. Second, the generated edit plan does not correspond to contiguous output tokens since it consists of fragmented information (edit operations and token spans) rather than a complete sentence. As a result, neural language models may fail to generate correct edit plans due to their lack of language properties such as fluency and coherency (Panthaplackel et al., 2020).

Therefore, we need an additional step for learning to apply edits while simultaneously maintaining fluency and coherency. For this reason, once the edit plan is outlined as a sequence of edit operations, the target sequence (which is expected to recover the original input sequence) must also be generated: “@param users List of user objects”. The decoder generates tokens in a left-to-right manner, meaning that when generating a token at a given timestep, it is aware of all tokens generated in previous timesteps. So, when generating the target sequence, the decoder can exploit the sequence of edits that was generated in the edit plan earlier. In this way, the model can reason the edits and the generation simultaneously.

1.3. Noising Functions

To support learning across a diverse set of edit actions during pretraining, we consider multiple noising functions for corrupting the input sequence: 1) randomly masking spans with the special [MASK] token which requires the model to replace it with the correct spans, 2) inserting [MASK] token at random positions which requires the model to identify the useless spans and delete them and 3) deleting spans of tokens in the input sequence which requires the model pinpoint the position and add back the missing pieces.

2. Pretraining Data

Following prior work, we pretrain CoditT5 on large amounts of source code and natural language comments from the CodeSearchNet (Husain et al., 2019) dataset which consists of functions of six programming languages (Java, Python, Ruby, Php, Go and JavaScript) together with the natural language comments. CodeSearchNet is widely used to pretrain large language models, such as CodeT5 (Wang et al., 2021a) and UniXcoder (Guo et al., 2022). We use the training set of the processed CodeSearchNet dataset provided by Guo et al. (2022) which contains 6.1 million programming languages code snippets (functions/methods) and 1.9 million natural language comments.

2.2. Data Preparation

To enable CoditT5 to capture common edit patterns, we want the pretraining dataset to reflect the common activities conducted by software developers. Specifically, in the pretraining dataset, the probability of each edit operations applied to the spans in the input sequence and the length (number of tokens) of the corrupted span should be consistent with the distributions and sizes of real-world edits in downstream editing tasks.

To this end, we collect statistics for source code edits from the training sets of the bug fixing and automated code review downstream tasks and statistics for natural language edits from the comment updating’s training set. As shown in Table 1, we collect the probability of each edit operation (insert, delete and replace) to be performed on a span; the average number of tokens in each span that is edited; and the average number of spans that are edited in each input sequence. For each example in the pretraining dataset, we then uniformly sample the spans and the edit operations that should be applied in accordance with the statistics collected from the downstream datasets.

Similar to CodeT5 (Wang et al., 2021a), we use the RoBERTa (Liu et al., 2019) tokenizer to tokenize all sequences (input, edit plan, target). More concretely, the tokenizer splits words in the sequence into tokens (subwords) that are used by the model. Moreover, we remove input sequences that are shorter than 3 tokens and longer than 512 tokens after tokenization which leave us with 5.9 million programming language code snippets and 1.6 million natural language comments. This is because too short inputs are usually incomplete and CodeT5 is designed to only handle sequence of length 512. Table 2 presents the statistics of the pretraining dataset.

3. Pretraining Setup

CoditT5 consists of 12 encoder and decoder layers, 12 attention heads, and a hidden dimension size of 768. The total number of parameters is 223M. Model parameters are initialized from the CodeT5-base model, and we further pretrain it on the CodeSearchNet pretraining dataset (Section 3.2) using our proposed objective (Section 3.1).

We implement CoditT5 using PyTorch 1.9.0 and use 16 NVidia 1080-TI GPUs, Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz for pretraining for 4 days. For fine-tuning, we run the experiments on 4 NVidia 1080-TI GPUs, Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz with the same hyper-parameters as CodeT5.

Experimental Design

To assess CoditT5 and our proposed pretraining objective, we fine-tune the model on three software-related downstream tasks. Note that during fine-tuning, the model is still trained to generate the edit-based output sequence. However, at test time, we discard the edit plan and take the generated target sequence as the final model output. Namely, we use the generated sequence after the separation token ~~as model’s prediction.~~

The task of comment updating entails automatically updating a natural language comment to reflect changes in the corresponding body of code (Panthaplackel et al., 2020). For instance, in Example 2 in Figure 5, the old @return comment needs to be revised based on the changes in the method. Instead of directly returning the yaw Euler angle measured in radians, the unit of the return value is changed to degrees in the new version, with the method call Math.toDegrees().

~~Given a buggy code snippet, the task of bug fixing entails generating a fixed code snippet, which no longer contains the bug (Tufano et al., 2019a).~~

Given a code snippet under review and a brief natural language sentence prescribing code edits, automated code review requires automatically generating the revised code snippet, which captures the recommended changes (Tufano et al., 2021). For example, in Figure 1, emptyList() should be changed to Collections.emptyList() because the reviewer suggests not using static import.

2. Data for Downstream Tasks

We use datasets that have been established and previously used for each of the three tasks. The statistics of the datasets is shown in Table 4. Unlike pretraining where the goal is to recover the corrupted input sequences, during fine-tuning, CoditT5 is trained to generate an edit plan for completing the downstream editing task, that can be applied to a part of the input (e.g., old comment), followed by the target sequence (e.g., new comment).

For this task, Panthaplackel et al. (2021b) has released a corpus of Java method changes paired with changes in the corresponding comments (spanning @return, @param, and summary comments). This dataset also comes with a clean subset of the test set which was manually curated. The input sequence used for fine-tuning is formed by concatenating the old comment and code edits. The code edits follow the representation described in Section 3.1.1, except that an additional Keep operation is included to denote spans that are left unchanged.

We consider the Java BugFixPairs-Small ( $B2F_{s}$ ) and BugFixPairs-Medium ( $B2F_{m}$ ) datasets, originally released by Tufano et al. (2019a). Chakraborty and Ray (2021) supplemented these datasets with additional context, namely natural language guidance from the developer, and the method where the patch should be applied. $B2F_{s}$ contains shorter methods with a maximum token length 50, and $B2F_{m}$ contains longer methods with up to 100 tokens in length. The input sequence used for fine-tuning is formed with the buggy code, natural language guidance, and code context.

We use the automated code review dataset released by Tufano et al. (2021), which consists of Java methods (before and after the review) paired with pull request comments, derived from pull request reviews on GitHub and Gerrit. To reduce the vocabulary size, they further abstracted Java methods by replacing identifiers and literals with special tokens. In this work, we use the data with concrete tokens. The input sequence used for fine-tuning is formed using the code snippet before review and the pull request comment from reviewers.

3. Baselines

We consider two large standard generation language models trained with denoising autoencoding pretraining objectives which are not edit-based: PLBART and CodeT5. Both of these are fine-tuned to directly generate the target output sequence. Furthermore, to better assess the value of actually pretraining using the proposed objective instead of simply fine-tuning a model to generate an edit-based output sequence, we also consider fine-tuning CodeT5 to generate the specialized edit-based output sequence representation. We refer to this as CodeT5 (w/ edit-based output). We fine-tune each of these models using the same input context as CoditT5.

3.2. Task-Specific Baselines

~~We additionally compare against the state-of-the-art models for each of the downstream tasks.~~

For comment updating, the state-of-the-art model is Panthaplackel et al. (2020), which entails Recurrent Neural Network (RNN) based encoders for representing the old comment and code edits, and an RNN-based decoder for decoding edits. These edits are parsed at test time and reranked based on similarity to the old comment and likelihood based on a comment generation model.

~~For bug fixing, the state-of-the-art model is essentially PLBART fine-tuned on the $B2F_{s}$ and $B2F_{m}$ to generate the fixed code (Chakraborty and Ray, 2021).~~

For automated code review, no baselines are available for the specific version of the dataset we used with concrete identifiers and literals (rather than the one with abstracted identifiers and literals). Therefore, we rely on those described in Section 4.3.1 and establish new baselines for this version of the dataset.

4. Evaluation Metrics

For comment updating, we report performance on the same metrics that have been used previously to benchmark models for this task (Panthaplackel et al., 2020). This includes: xMatch (whether the model prediction exactly matches the ground truth), common metrics that measure lexical overlap for evaluating text generation (BLEU-4 We measure 1 $\sim$ 4-gram overlap and compute the average. (Papineni et al., 2002) and METEOR (Banerjee and Lavie, 2005)), and common metrics for measuring text editing (GLEU (Napoles et al., 2015) and SARI (Xu et al., 2016)). For bug fixing, we use xMatch, as done in prior work (Chakraborty and Ray, 2021). For automated code review, we report performance on xMatch and BLEU-4, which have been used previously to benchmark models for this task (Tufano et al., 2021).

Evaluation

~~We organize our evaluation around three main research questions:~~

~~RQ1: How does our edit-based model, CoditT5, compare to generation and task-specific baselines for edit-related tasks?~~

~~RQ2: Does our proposed pretraining objective help a model in better reasoning about and performing edits?~~

~~RQ3: Can a standard generation model complement CoditT5 by integrating the two models?~~

We present results in Tables 5-8. Note that the results shown in the last two rows in each of the tables are explained later in Section 5.3. We perform statistical significance testing using bootstrap tests (Berg-Kirkpatrick et al., 2012) with confidence level 95%.

~~RQ1: How does our edit-based model, CoditT5, compare to generation and task-specific baselines for edit-related tasks?~~

We find that CoditT5 (and most of the pretrained models) drastically outperforms Panthaplackel et al. (2020) (a non-pretrained model) across metrics for comment updating. This demonstrates the value of large language model pretrained on vast amounts of data using unsupervised pretraining objectives.

Next, across all three tasks, CoditT5 achieves higher performance than the two standard generation-based pretrained models, significantly outperforming PLBART and CodeT5 for most of the metrics, highlighting the benefit of explicitly modeling edits for these editing tasks. In fact, CodeT5 (w/ edit-based output), which explicitly models edits only during fine-tuning rather than pretraining, outperforms CodeT5 on edit-based metrics (xMatch, SARI). This further underlines the utility of the edit-based output sequence representation that we developed.

Nonetheless, across most metrics, CoditT5 still outperforms CodeT5 (w/ edit-based output), which is not pretrained using the pretraining objective but uses the same edit-based output sequence representation during fine-tuning. This demonstrates the importance of actually pretraining with this representation rather than relying on fine-tuning alone.

2. Evaluating our Pretraining Objective

While we observe that CoditT5 tends to achieve slightly lower performance than CodeT5 on generation-based metrics (BLEU-4, METEOR) for two of the tasks, we find that it significantly outperforms other metrics which capture whether the correct edits are generated, such as xMatch and GLEU and SARI for comment updating. This suggests that CoditT5 is indeed better at editing. By inspecting the outputs of the two models, we find that CodeT5 tends to make drastic and unnecessary edits while CoditT5 appears to be better at making more fine-grained edits. For example, in Figure 3, CodeT5 generates output that completely discards critical statements in the code, whereas CoditT5 is able to correctly localize the part of the input code that needs to be changed and make editions properly. We attribute this to the fact that CodeT5 is not designed to reason about edits while CoditT5 is. We further evaluate the influence of our proposed pretraining objective on this editing capability.

~~RQ2: Does our proposed pretraining objective help a model in better reasoning about and performing edits?~~

First, we compare how often CoditT5 naively copies the input content without actually performing any edits, to two pretrained models which use generation-based pretraining objectives. We report the percentages in Table 3. By copying substantially less often than the PLBART and CodeT5, we find that CoditT5 learns to more frequently perform edits with our proposed edit-based pretraining objective which indicates it is suitable for editing tasks.

CoditT5’s decoder is encouraged to generate a target sequence that follows the outlined edit plan; however, we do not constrain the decoder in any way to do this.We do not want potential errors in the edit plan to propagate to the target sequence. Nonetheless, we find that in the majority of cases (74%-92%), the target sequence is consistent with the edit plan, as shown in Table 9. More concretely, the target sequence generally resembles what would be produced if the edit operations in the edit plan were applied to the original content. This suggests that the pretraining objective does in fact guide the model in reasoning about edits.

For cases in which there is ambiguity or errors in the edit plan, we find that CoditT5 still often manages to generate the correct target sequence, by disregarding unreasonable edits or disambiguating ambiguous edits. We show two examples in automated code review in Figure 4 with the Java method before review, the generated edit plan, and the generated target sequence. In Example 1, the edit plan is ambiguous since there are multiple instances of “(” and it does not specify which one(s) should be deleted. However, the generated target sequence is correct, as the model was able to correctly reason about the most appropriate edit locations. In Example 2, the edit plan is imprecise and blindly following this plan would result in syntactically incorrect code, but the model still managed to perform the correct edits and produced valid output by ignoring the fallacious edit. Overall, we find that both components of the edit-based output sequence representation used in the pretraining objective (edit plan and target sequence) are critical.

3. Integrating CoditT5 and CodeT5

CoditT5 is designed to complement a generation model by providing more explicit guidance for edits. However, a model that is trained to generate edits can struggle with coherence and fluency since it is not actually trained to generate consecutive text (Panthaplackel et al., 2020). By including the generation of the target sequence in the pretraining objective, we do mitigate this to some extent, even when there are ambiguities or errors in the edit plan. However, there appears to be a trade-off between performing the correct edits while maintaining performance with respect to generation metrics. More specifically, in Tables 5-8, CoditT5 outperforms CodeT5 with respect to xMatch (and SARI for comment updating), but underperforms with respect to BLEU-4. To exploit the slight superiority of CodeT5 in this respect, we consider incorporating CodeT5 into our approach.

~~RQ3: Can a pure generation model complement CoditT5 by integrating the two models?~~

We combine the two models using simple likelihood-based reranking strategies at test time (with no additional training). Namely, at test time, CoditT5 and CodeT5 each generate 20 candidates using beam search. While we have been only looking at the top one prediction for all previous experiments, we will consider all 20 candidates for reranking. We compute a reranking score for each of these to essentially re-score them. The candidate which has the highest reranking score will be the final model prediction. We investigate two different reranking strategies:

To exploit the language-specific norms learned by CodeT5, we rerank the candidates generated by CoditT5 based on the probability score CodeT5’s language model assigns to the corresponding target sequences (namely after ).

~~We compute the length-normalized conditional log probability score of CodeT5 generating the target sequence, conditioned on the same input:~~

where $T$ is the target sequence, $I$ is the model’s input, $N$ is the length of $T$ . We also length-normalize the log probability of the candidate, as scored by CoditT5, and then add the two probability scores together to obtain the reranking score.

Conversely, we also rerank the output of CodeT5 based on the likelihood of CoditT5, such that the generated sequence can be assessed in terms of explicit edits. We first parse the output of CodeT5 into the edit-based output sequence representation (as described in Section 3.1.1) and then concatenate it with the model’s output using . Then we compute the likelihood of CoditT5 generating this sequence, conditioned on the same input. We then add the length-normalized log probability score of CoditT5 with the score originally assigned by CodeT5 (after length-normalizing and applying log).

3.2. Results

We provide results in the bottom two rows of Tables 5-8. By reranking the output of CoditT5 using CodeT5, we are able to achieve improved performance on all the metrics including BLEU-4 across tasks (and the other generation-based metric, METEOR, for comment updating). To illustrate this, consider Example 1 in Figure 5, with a buggy code snippet and outputs corresponding to CoditT5 before and after reranking. We observe that CoditT5 correctly localizes the bug and correctly identifies that the edit entails initializing an $\mathtt{ArrayList}$ in the return statement. However, the generated target sequence is a defective code snippet which does not properly initialize an $\mathtt{ArrayList}$ with the correct type $\mathtt{TagVFilter}$ . By leveraging CodeT5’s likelihood score, we are able to effectively filter out the defective prediction and obtain the correct output.

By reranking the output of CodeT5 using CoditT5, we see significant improvements with respect to CodeT5 on metrics that more directly evaluate whether the correct edits were performed, including xMatch as well as GLEU and SARI for comment updating. This suggests that the edit-based and generation-based models are indeed complementary to one another. As a case study, consider Example 2 in Figure 5. CodeT5 produces a sequence which simply copies the old comment, without capturing the code changes. While this may be a likely comment sequence, according to CodeT5’s language model, copying without applying any edits is not a likely edit plan to be generated for CoditT5.

By combining CoditT5 and CodeT5 through reranking, we can further boost performance substantially across most metrics for all three tasks, outperforming the two models individually, and achieving new state-of-the-art.

Limitations

Other Programming Languages. The downstream editing tasks we studied in this work are using Java. Since CoditT5’s pretraining is on the dataset consisting of six programming languages, we expect it to also perform well on editing tasks in other programming languages, but we leave empirically verifying this as future work.

Data Contamination. CoditT5 is pretrained on data collected from open-source projects. It is possible that similar examples in pretraining data exist in downstream tasks’ test set. While prior work (Brown et al., 2020) has shown that data contamination may have little impact on the performance of pretrained models in natural language processing tasks, future work can investigate this problem for pretrained models for software engineering.

Related Work

~~In this section, we consider the most closely related work on learning edits, large pretrained models for code, pretrained models for code edits and combining complementary models.~~

Learning Edits. Prior work has studied learning edits in both natural language and programming language. We followed the approach of explicitly representing edits as sequences with edit actions. Our edit representation is inspired by Panthaplackel et al. (2020, 2021b), who studied learning comment edits based on code edits. Brody et al. (2020); Tarlow et al. (2020); Chen et al. (2021a); Yao et al. (2021) represented code as ASTs (abstract syntax trees) and the code edits as edit actions over the AST nodes rather than tokens. We do not focus on editing structured data (AST) as it can not be generalized to natural language, and it can not be easily combined with large pretrained models which are primarily based on sequence of tokens.

Alternatively, edits can be encoded into vector representations (or embeddings). Guu et al. (2018) studied learning edit embeddings for natural language generation in a prototype-then-edit style. Yin et al. (2018) studied learning code edits as embeddings and then applying them to natural language insertion and code bug fixing. Hashimoto et al. (2018) developed a retrieve-and-edit framework for text-to-code generation, where the edits are learned as parameters of a seq2seq model. Similarly, Li et al. (2021) proposed a retrieve-and-edit framework for code summarization task where the model first learns an edit vector and then generate the revised summary conditioned on it. Although learning edits as embeddings can be effective for individual tasks, it is not suitable to be used in the pretraining fine-tuning paradigm, because there is a large domain gap between the edit embeddings learned on different tasks. More over, edit embeddings are less explainable compared to the explicit edit representations we use.

Another line of work that carries out the idea of learning edits is copying mechanism, including copying individual tokens (Vinyals et al., 2015; Gu et al., 2016) and spans (Zhou et al., 2018; Panthaplackel et al., 2021a), which helps the model to “keep” unchanged tokens and focus on generating the edited part. Iv et al. (2022) built a T5-based model to update the existing articles based on the given new evidence. The model is trained to output a copy token instead of the copied sentence and a special reference token before the updated text which identifies the evidence to support the update. Ding et al. (2020) trained the model to emit pointers that indicate the positions for editions and new tokens to be inserted at the same time. Similarly, Tarlow et al. (2020); Chen et al. (2021a) augmented the transformer-based decoder with pointers to the input graph representation of the code which specify the input locations to edit. Although related, it is orthogonal to our work of learning edits with pretraining.

Large Pretrained Models for Code. Motivated by the success of large pretrained models for many NLP tasks, domain-specific models that are pretrained on source code and technical text have emerged, including CodeBERT (Feng et al., 2020), GraphCodeBERT (Guo et al., 2020), CodeGPT-2 (Lu et al., 2021), CodeT5 (Wang et al., 2021a), PLBART (Ahmad et al., 2021), PyMT5 (Clement et al., 2020), SynCoBERT (Wang et al., 2021b), SPT-Code (Niu et al., 2022), Codex (Chen et al., 2021b) and UniXcoder (Guo et al., 2022). Similar to our approach, GraphCodeBERT, CodeT5, SynCoBERT, SPT-Code and UniXcoder also designed specialized pretraining objectives driven by their targeted tasks. As we showed in this work, the combination of an edit-based language model and a standard language model can achieve better performance than using the standard language model alone.

Pretrained Models for Code Edits. Prior work already explored applying pretrained models, despite not well-suited, on editing tasks. Chakraborty and Ray (2021) used PLBART for code bug fixing, which we compared to in our work. Similarly, Drain et al. (2021) further pretrained BART model on 67K Java repositories mined from GitHub and fine-tuned specifically on the bug fixing dataset (Tufano et al., 2019b). Wang et al. (2021a); Mastropaolo et al. (2021) both pretrained T5 model on CodeSerchNet and used it for bug fixing, which we included as a baseline (CodeT5). Codex (Chen et al., 2021b) showed promising performance on editing tasks by specifying the existing code as a prompt and providing an edit instruction to the model. Tufano et al. (2022) and Li et al. (2022) both proposed a transformer-based encoder-decoder model pretrained on large code reviewer specific data for code review related tasks including code change quality estimation, review comment generation and code refinement. While they demonstrate impressive performance on various tasks, none of them are fundamentally well-suited for edit tasks. In this work, we develop CoditT5 with a novel pretraining objective for generating edit sequences, which can complement the generation model such as CodeT5 for edit tasks.

Combining Complementary Models. We used reranking (Neubig et al., 2015; Kriz et al., 2019) to combine complementary models in this work. Ensembling (LeClair et al., 2021) is another approach for combining complementary models for generation tasks, but requires additional training. Co-training (Blum and Mitchell, 1998) and tri-training (Zhou and Li, 2005) approaches, although shown to be very effective in combining complementary models, are designed for classification models rather than generation models.

Conclusion

In this paper, we present a novel edit-driven pretraining objective and use it to develop CoditT5, a pretrained language model for software-related editing tasks. CoditT5 is pretrained on large amounts of source code and natural language comments to perform edits, and we evaluate this model by fine-tuning it on three distinct downstream tasks: comment updating, bug fixing and automated code review. By outperforming task-specific baselines and pure generation baselines across tasks, we demonstrate the suitability of CoditT5 (and our pretraining objective) for editing tasks and its generalizability. We additionally find that a pure generation-based model and CoditT5 can complement one another through simple reranking strategies, which outperform each of the models individually and also achieve new state-of-the-art performance for the three downstream editing tasks that we consider.