On Multi-Modal Learning of Editing Source Code

Saikat Chakraborty, Baishakhi Ray

I Introduction

Programmers often develop software incrementally, adding gradual changes to the source code. In a continuous software development environment, programmers modify their source code for various reasons, including adding additional functionality, fixing bugs, refactoring, etc. It turns out that many of these changes follow repetitive edit patterns resulting in a surge of research effort to automatically generate code-changes learned from past examples .

In particular, Neural Machine Translation (NMT) models have been successful in learning automatic code changes . At the core, these models contain an encoder and a decoder — the encoder encodes the code that needs to be edited, and the decoder sequentially generates the edited code. Such NMT models are trained with a large corpus of previous edits to learn generic code change patterns. In the inference time, given a code fragment that needs to be edited, a trained NMT model should automatically generate the corresponding edited code.

However, learning such generic code changes is challenging. A programmer may change an identical piece of code in different ways in two different contexts, both can potentially be correct patches (see Figure 1). For example, an identical code fragment pictures = new ArrayList $<>$ () was changed in two different ways: pictures = new HashSet $<>$ (); and pictures = new LinkedList $<>$ () in two different code contexts. Without knowing the developers’ intention and the edit context, the automated code editing tools have no way to predict the most intended patches. For instance, in the above example, LinkedList was used to fix a sublist-related problem. Once such an intention is known, it is easy to choose a LinkedList-related patch from the alternate options. Thus, such an additional modality of information can reinforce the performance of automated code-editing tools.

In fact, given just a piece of code without any additional information, it is perhaps unlikely that even a human developer can comprehend how to change it. Consider another real-life example shown in Figure 2. If a programmer only considers the edited expression in line 9, it is difficult to decide how to modify it. However, with additional information modalities – i.e., the guidance (line 1,2) and the context (the whole method before the patch), the correct patch often becomes evident to the programmer since the guidance effectively summarises how to change the code and the context provides necessary ingredients for generating a concretely patched code. We hypothesize that such multi-modal information could be beneficial to an automated code-editing tool. To that end, we design Modit, a multi-modal code editing engine that is based on three information modalities: (i) the code fragment that needs to be edited, (ii) developers’ intention written in natural language, and (iii) explicitly given edit context.

In particular, Modit is based on a transformer-based NMT model. As input, Modit takes the code that needs to be edited (e.g., the lines that need to be patched), additional guidance describing developers’ intent, and the context of the edits that are explicitly identified by the developer (e.g., the surrounding method body, or surrounding lines of code, etc.). Note that previous works also provided context and the edit location while generating edits; however, they are fed together to the model as a unified code element. Thus, the model had the burden of identifying the edit location and then generating the patch. In contrast, isolating the context from the edit location and feeding them to the model as different modalities provides Modit with additional information about the edits.

Curating developers’ intent for a large number of edits that can train the model is non-trivial. As a proof of concept, we leverage the commit messages associated with the edits to simulate developers’ intent automatically. We acknowledge that commit messages could be noisy and may not always reflect the change summary . Nonetheless, our extensive empirical result shows that, even with such noisy guidance, Modit performs better in generating correctly edited code.

Being a model that encodes and generates source code, Modit needs to both clearly understand and correctly generate programming languages (PL). While several previous approaches designed sophisticated tree/grammar-based models to embed the knowledge of PL into the model, the most recent transformer-based approaches showed considerable promise with pre-training with a large volume of source code. Since these models are pre-trained with billions of source code written by actual developers, and transformers are known to learn distant dependencies between the nodes, these models can learn about code structures during the pre-training step. Among such pre-trained models, PLBART learns jointly to understand and generate source code and showed much promise in generative tasks. Thus, we chose PLBART as the starting point to train Modit, i.e., we initialize Modit’s model with learned parameters from PLBART.

We evaluate Modit on two different datasets ( $B2F_{s}$ , and $B2F_{m}$ ) proposed by Tufano et al. consisting of an extensive collection of bug-fix commits from GitHub. Our empirical investigation shows that a summary of the change written in natural language as additional guidance from the developer improves Modit’s performance by narrowing down the search space for change patterns. The code-edit context, presented as a separate information modality, helps Modit to generate edited code correctly by providing necessary code ingredients (e.g., variable names, method names, etc.). Modit generates $\sim$ 3.5 times more correct patches than Codit showing that Modit is robust enough to learn PL syntax implicitly. Furthermore, Modit generates two times as many correct patches as a large transformer model could generate.

Additionally, our empirical investigation reveals that when we use one encoder to encode all information modalities rather than learning from individual modalities separately, the model learns representation based on inter-modality reasoning. In contrast, a dedicated encoder for each individual modality only learns intra-modality reasoning. Our experiment shows that a multi-modal/single-encoder model outperforms multi-modal/multi-encoder model by up to 46.5%.

We summarize our main contributions in this paper as follows.

We propose Modit– a novel multi-modal NMT-based tool for automatic code editing. Our extensive empirical evaluation shows that Automatic Code Editing can be vastly improved with additional information modalities like code context and developer guidance.

We empirically investigate different design choices for Modit. We provide a summary of the lessons that we learned in our experiments. We believe such lessons are valuable for guiding future research.

We prototype and build Modit and open-source all our code, data in https://git.io/JOudU.

II Background

Neural Machine Translation(NMT) is a very well studied field, which has been very successful in translating a sentence from one language to another. At a very high level, input to an NMT model is a sentence ( $X=x_{1},x_{2},...,x_{n}$ ), which is usually a sequence of tokens ( $x_{i}$ ), and the output is also a sentence ( $Y=y_{1},y_{2},...,y_{m})$ – sequence of tokens ( $y_{i}$ ). While learning to translate from $X$ to $Y$ , NMT models learn conditional probability distribution $P(Y|X)$ . Such probability distributions are learned w.r.t. model parameters $\theta$ , where model training process optimizes $\theta$ in such a way that maximizes the expected probability distribution of a dataset. An NMT model usually contains an encoder and a decoder. The encoder processes, understands, and generates vector representations of the input sentence. The decoder starts after the encoder and sequentially generates the target sentence by reasoning about the encoder-generated input representation. While sequentially generating the target sentence, the decoder usually performs different heuristic searches (for instance, beam search) to balance exploration and exploitation.

In recent few years, Software Engineering has seen a wide spectrum of adaptation of NMT. Some prominent application of NMT is SE include Program Synthesis , Code summarization , Edit summarization , Code Edit Generation , Automatic Program Repair , etc. These research efforts capitalize on NMTs’ capability to understand and generate complex patterns and establish NMT as a viable tool for SE-related tasks.

II-B Transformer Model for Sequence Processing

Transformer model revolutionized sequence processing with attention mechanism. Unlike the traditional RNN-based model where input tokens are processed sequentially, the transformer assumes soft-dependency between each pair of tokens in a sequence. Such dependency weights are learned in the form of attention weights based on the task of the transformer. While learning the representation of a token, the transformer learns to attend to all the input tokens. From a conceptual point of view, the transformer converts a sequence to a complete graphhttps://en.wikipedia.org/wiki/Complete_graph, where each node is a token. The weights of the edges are attention weights between tokens which are learned based on the task of the transformer. The transformer encodes each token’s position in the sequence (positional encoding) as part of the input. In such a way, the transformer learns long-range dependency. Since its inception, the transformer is very successful in different NLP understanding and generation tasks. Transformers’ ability of reasoning about long-range dependency is proved useful for several source code processing task including code completions , code generation , code summarization .

II-C Transfer Learning for Source Code

In recent few years, Transfer learning shows promise for a wide variety of SE tasks. Such transfer learning aims at learning task agnostic representation of source code and reuse such knowledge for different tasks. One way to learn such task agnostic representation of input is pre-training a model with a large collection of source code. The learning objective of such pre-training is often understanding the code or generating the correct code. A pre-trained model is expected to embed the knowledge about source code through its parameters. Such pre-trained models are later fine-tuned for task-specific objectives. CuBERT , CodeBERT , GraphCodeBERT are all transformer-based encoder models which are pre-trained to understand code. Such models are primarily trained using Masked Language Model , replaced token prediction , semantic link prediction , etc.

For code generation, CodeGPT pre-trains a transformer-based model to generate general-purpose code sequentially. More recently, PLBART pre-trained transformer-based model jointly for understanding and generating code with denoising auto-encoding . PLBART consists of an encoder and a decoder. The encoder is presented with slight noise (for instance, token replacement) induced code, and the decoder is expected to generate noise-free code. Since code editing task requires both the understanding of code and code generation, we chose PLBART as the base model for Modit.

III Modit

Figure 3 shows an overview of Modit’s working procedure. Modit is a multi-layer encoder-decoder based model consisting of a Transformer-based encoder and a Transformer-based decoder. Both the encoder and decoder consist of 6 layers. Modit works on three different modalities of information: (i) Code that needs to be edited ( $e_{p}$ ), (ii) natural language guidance from the developer ( $\mathcal{G}$ ), and (iii) the context code where the patch is applied ( $C$ ). We acknowledge that $e_{p}$ is essentially a substring of $C$ . However, by explicitly extracting and presenting $e_{p}$ to Modit, we provide Modit with additional information about the change location. Thus, despite being a part of the context, we consider $e_{p}$ a separate modality. Nevertheless, Modit consists of three steps. First, the pre-processing step processes and tokenizes these input modalities (§III-A). Then the encoder in Modit encodes the processed input, and the decoder sequentially generates the patched code as a sequence of tokens (§III-B). At final step, Modit post-processes the decoder generated output and prepares the edited code (§III-C).

Input Consolidation. In the pre-processing step, Modit generates consolidated multi-modal input ( $X$ ) from the three input modalities (i.e., $e_{p}$ , $\mathcal{G}$ , and $C$ ). Modit combines these input modalities as a sequence separated by a special token i.e., $X=$ $e_{p}$ $\mathcal{G}$ $C$ . In the example shown in Figure 2, $e_{p}$ is newJson.charAt(1)) != wrappingQuote , $\mathcal{G}$ is fix problem which occurred when the resulting json is empty, and $C$ is the whole function before the edit (see Input Modalities in Figure 3). Modit generates a consolidates multi-modal input sequence as newJson.charAt(1)) ... ~~fix problem which occurred ... ~~private String ... }.~~~~

Tokenization. Modit uses sentence-piece tokenizer . Sentence-piece tokenizer divides every token into sequence of subtokens. Such subword tokenization is similar to previously used byte-pair encoding in automatic code editing literature . We use PLBART ’s sentence-piece tokenizer which is trained on billions of code from GitHub. After tokenizing the consolidated input $X$ from Figure 2, we get _new Json . char At ( 1 ) ... ~~_fix _problem _which _oc cur red ... ~~_private _String ... _}.~~~~

III-B Encoder-Decoder Model

~~The input to Modit’s encoder-decoder model is a sequence of subtokens generated in the previous step.~~

~~Transformer Encoder. Given an input sequence $X=x_{1},x_{1},...,x_{n}$ , the encoder learns the representation of every token at layer $l$ as $R^{e}_{l}(x_{i})$ using self-attention computed as~~

Where $R^{e}_{l-1}(x_{j})$ is the representation of subtoken $x_{j}$ as generated by layer $l-1$ , and $a_{i,j}$ is the attention weight of subtoken $x_{i}$ to $x_{j}$ . Such attention weights are learned by multi-head attention . Final layer generated representation (i.e., $R^{e}_{6}(x_{i})$ ) is the final representation for every subtoken $x_{i}$ in the input. Note that, the encoder learns the representation of Equation 1 of a subtoken, using all subtokens in the sequence. Thus the learned representation of every subtoken contains information about the whole input sequence. Since we encode all the information modalities in one sequence, the learned representation of every subtoken encodes information about other modalities.

Transformer Decoder. The decoder in Modit is a transformer-based sequential left-to-right decoder consisting of 6 layers. It sequentially generates one subtoken at a time using previously generated subtokens and the final representation ( ${R^{e}_{l}(x_{i})}$ ) from the encoder. The decoder contains two modules – (i) self-attention, and (ii) cross-attention. The self-attention layer work similar to the self-attention in the encoder. First, with self attention, decoder generates representation $R^{d}{l}(y_{i})$ of last generated token $y_{i}$ with self attention on all previously generated tokens $(y_{1},y_{2},...,y_{i})$ . This self attention follows same mechanism described in Equation 1. After learning decoder representation by self attention, decoder applies attention of encoder generated input representation using the following equation,

Where $\alpha^{l}_{i,j}=softmax\left(dot\left(R^{e}_{6}\left(x_{j}\right),R^{d}{l}\left(y_{i}\right)\right)\right)$ is the attention weight between output subtoken $y_{i}$ to input subtoken $x_{j}$ . The softmax generates an attention probability distribution over the length of input tokens. Finally the decoder learned representation, ${D}_{l}(y_{i})$ is projected to the vocabulary to predict maximally likely subtoken from the vocabulary as next token.

In summary, the encoder learns representation of every subtokens in the input using all input subtoken, essentially encoding the whole input information in every input subtoken representation. The decoder’s self-attention mechanism allows the decoder to attend to all previously generated subtokens allowing the decoder decide on generating correct token at correct place. The cross-attention allows the decoder to attend to encoded representation - implicitly letting the model decide where to copy from the input where to choose from new tokens in the vocabulary. We initialize the end-to-end encoder-decoder in Modit using pre-trained weights of PLBART .

III-C Output Generation

~~The decoder in Modit continue predicting subtoken until it predicts the end of sequence~~ token. During inference, Modit uses beam search to generate sequence of subtokens. Once the decoder finishes, Modit post-processes the top ranked sequence in the beam search. First, Modit removes the end of sequence token. It then detokenizes the subtokens sequence to code token sequence. In this step, Modit merges generated subtokens that are fragments of a code token into one code token. For the example shown in Figure 2, Modit generates the subtoken sequence _! _json . is Empty () _&& _( _new Json . char At ( 1 ) _!= _wrap ping Quote _) . After detokenization, Modit generates ! json.isEmpty() && ( newJson.charAt(1) != wrappingQuote ) .

IV Experimental Design

To prove our concept of Modit, we experiment on two different datasets (i.e., $B2F_{s}$ , and $B2F_{m}$ ) proposed by Tufano et al. . In these two datasets, they collected large collections of bug-fix code changes along with commit messages from Java projects in GitHub. Each example in these datasets contains the java method before the change ( $C_{p}$ ), the method after the change ( $C_{n}$ ), and the commit message for the change. There are some examples ( $<100$ ) with corrupted bytes in the commit message, which we could not process. We excluded such examples from the dataset. Table I shows statistics of the two datasets we used in this paper. $B2F_{s}$ contains smaller methods with maximum token length 50, and $B2F_{m}$ contains bigger methods with up to 100 tokens in length. The average size of the change (edit distance) is 7.39, and 8.83 respectively, in $B2F_{s}$ and $B2F_{m}$ .

IV-B Data Preparation

For the datasets described in Section IV-A, we extract the input modalities and the expected output to train Modit. For every method pair (i.e., before edit - $C_{p}$ , after edit - $C_{n}$ ) in those dataset, we use GumTree to extract a sequence of tree edit locations. We identify the root of the smallest subtree of $C_{p}$ ’s AST that encompasses all the edit operations. We call the code fragment corresponding to that subtree as code to be edited( $e_{p}$ ) and used as Modit’s first modality. Similarly, we extract the code corresponding to the smallest subtree encompassing all the edit operations from $C_{n}$ and use that as code after edit( $e_{n}$ ). We use the commit message associated with the function pair as Modit’s second modality, guidance( $\mathcal{G}$ ). Finally, we use the full method before edit ( $C_{p}$ ) as Modit’s third modality, context( $C$ ).

IV-C Training

After combining every example in the datasets in Modit’s input ( $e_{p}$ , $\mathcal{G}$ , $C$ ) and expected output ( $e_{n}$ ), we use this combined dataset to train Modit. For training Modit, we use Label Smoothed Cross Entropy as loss function. We use Adam optimizer, with a learning rate of $5e^{-5}$ . We train Modit for 30 epochs, after every epoch, we run beam search inference on the validation dataset. We stop training if the validation performance does not improve for five consecutive validations.

IV-D Evaluation Metric

We use the top-1 accuracy as the evaluation metric throughout the paper. For proof-of-concept, we evaluate all techniques with beam size 5. When the generated patched code matches exactly with the expected patched code $e_{n}$ , it is correct, incorrect otherwise. Note that this is the most stringent metric for evaluation. Previous approaches talked about filtering out infeasible patches from a ranked list of top k patches using test cases. However, we conjecture that such test cases may not always be available for general purpose code edits. Thus, we only compare top-1 accuracy.

IV-E Research Questions

Modit contains several design components: (i) use of multimodal information, (ii) use of transformer and initializing it with the pre-trained model, and (iii) use of end-to-end encoder-decoder (using PLBART) to generate patches instead of separately using pre-trained encoder or pre-trained decoder, as used by previous tools. First, we are interested in evaluating Modit w.r.t. state-of-the-art methods. In particular, we evaluate how these three design choices effect Modit’s performance. So, we start with investigating, RQ1. How accurately does Modit generate edited code w.r.t. other techniques?

Modit uses three input modalities. Our next evaluation target is how these individual modalities effect Modit’s performance? Thus we ask, RQ2. What are the contribution of different input modalities in Modit ’s performance?

Finally, recall from Section III-A, Modit proposes to encode all the input modalities as a sequence and use one encoder for the consolidated multi-modal input. An alternative to this encoding mechanism is to encode individual input modality with dedicated input encoder. Our next evaluation aims at finding out the best strategy to encode input modalities. Hence, we investigate, RQ3. What is the best strategy to encode multiple input modalities?

V Empirical Results

In our first research question, we evaluate Modit’s performance w.r.t. other techniques and the effect of Modit’s design components.

RQ1. How accurately does Modit generate edited code w.r.t. other techniques?

Experimental Setup. We carefully chose the baselines to understand the contribution from different design choices of Modit. We evaluated our model in two experimental settings. First, we train different baseline models where the full model is trained from scratch. In this setting, the first baseline we consider is an LSTM with attention NMT model. Various existing code patching approaches used such settings. Second baseline is Transformer based $\mathcal{S}2\mathcal{S}$ model. We consider two different-sized transformers. This enables us to contrast effect of model size in code-editing performance. The Transformer-base model consists of six encoder layers and six decoder layers. The Transformer-base model’s architecture is the same as Modit’s architecture. Furthermore, we consider another transformer with a much larger architecture. Transformer-large contains twelve encoder layers and twelve decoder layers with three times as many learnable parameters as the Transformer-base model. The final baseline in this group is Codit, which is a tree-based model. Comparison w.r.t. Codit allows us to contrast externally given syntax information (in the form of CFG) and learned syntax by transformers (i.e., Modit). We use all three input modalities (see Figure 3 for example) as input to the LSTM and Transformer. Using auxiliary modalities is non-trivial with Codit since the input to Codit must be a syntax-tree. Thus, we use uni-modal input ( $e_{p}$ ) with Codit.

In the second setting, we consider different pre-trained models, which we used to fine-tune for patch generation. Figure 4 shows schematic diagrams of the pre-trained models we compared in this evaluation. First two models we considered are CodeBERT , and GraphCodeBERT . Both of these models are pretrained encoders primarily trained to understand code. To use these for the patching task, we add a six-layered transformer-based decoder along with the encoder. The decoder is trained from scratch (see Figure 4(a)). Another pre-trained baseline is CodeGPT . GPT is a single left-to-right decoder model primarily pre-trained to generate code. For the code editing task, a special token combines the input and the output as a sequence separated. Jiang et al. showed the effectiveness of GPT for the source code patching task (see Figure 4(b)). In contrast to these pre-trained models, Modit uses PLBART, an end-to-end encoder-decoder model trained to understand and generate code simultaneously (see Figure 4(c)). To compare from a fairground, we evaluate these pre-trained models with uni-modal input ( $e_{p}$ ), and multi-modal input ( $e_{p}$ ~~$\mathcal{G}$ ~~$C$ ), separately.~~~~

Results. Table II shows the accuracy in top 1 predicted patch by Modit along with different baselines. LSTM based $\mathcal{S}2\mathcal{S}$ model predicted 6.14% and 1.04% correct patches in $B2F_{s}$ and $B2F_{m}$ respectively. The Transformer-base model achieves 11.18% and 6.61% top-1 accuracy in those datasets, which improves further to 13.40% and 8.63% with the Transformer-large model. Codit predicts 6.53% and 4.79% correct patches in $B2F_{s}$ and $B2F_{m}$ , respectively. Note that Codit takes the external information in the form of CFG; thus, the patches Codit generate are syntactically correct. Nevertheless, the transformers, even the smaller model, perform better to predict the correct patch. We conjecture that the transformer model can implicitly learn the code syntax without direct supervision.

In contrast to the models trained from scratch, when we fine-tune a pretrained model, it generates significantly more correct patches than models trained from scratch. For instance, Modit (initialized with pretrained PLBART) generates 168% and 248% more correct patches than the Transformer-base model (with randomly initialized parameters), despite both of these models having the same architecture and the same number of parameters. In fact, the smallest fine-tuned model (CodeGPT) performs much better than the larger model trained from scratch (Transformer-large).

All the fine-tuned models exhibit better performance when the input data are multi-modal with various degrees of improvement. With all three input modalities, CodeBERT generates 7% and 2.2% more correct patches in $B2F_{s}$ and $B2F_{m}$ , respectively, compared to a unimodal CodeBERT model. In case of Modit, such improvement is 11.07% in $B2F_{s}$ and 16.23% in $B2F_{m}$ . The $\mathcal{G}$ in the multi-modal data often contains explicit hints about how to change the code. For instance, consider the example shown in Figure 2, the guidance explicitly says there is a problem with the json when it is empty. Furthermore, with the presence of $C$ in the input, the model can identify different variables, methods used in the method and potentially copy something from the context. We conjecture that such additional information from these two additional input modalities (i) reduce the search space for change patterns, (ii) help models copy relevant identifiers from the context.

Among the fine-tuned models multi-modalities, Modit generates 15.12% more correct patches than CodeBERT, 16.82% than GraphCodeBERT, and 5.49% than CodeGPT in $B2F_{s}$ . In the case of $B2F_{m}$ dataset, Modit’s improvement in performance is 34.38%, 25.72%, 30.50% higher than CodeBERT, GraphCodeBERT, and CodeGPT, respectively. To understand these results better, let us look at some of the examples.

Figure 5 shows an example patch where Modit correctly generated the expected patch but CodeGPT could not. If we look closely, we can see that the code to be changed ( $e_{p}$ ) is a boolean expression where the two clauses are combines with &&. While only the first clause, one.isSimilar(two) is the expected output, CodeGPT chooses the second clause, one.toString().equals(two.toString()) from the original. Recall from Figure 4(b), CodeGPT processes the combined input and output sequence (separated by special token) in left-to-right fashion. Thus, encodes representation of the input tokens do not contain information about the whole input sequence. In contrast, the Modit uses a pre-trained bi-direction encoder which helps Modit to understand the input fully. Based on the examples we have seen and the empirical result, we conjecture that, for code-editing tasks, the model must fully understand the input in a bi-directional fashion.

Figure 6 shows an example where Modit generated correct patch, CodeBERT could not. Note that the guidance text explicitly asks about code refactoring, implying that the patched code should be semantically similar to the original code. Similar to the original code, patched could should return true when first == null , otherwise it should return false . An automated code change tool should not add additional code features when doing the refactoring. However, CodeBERT generated patch which introduced an additional clause first.get() == null in the return expression, which make CodeBERT’s generate code semantically different from the original. Modit was able to generate the correct patch for this example.

~~Finally, we summarize the empirical lessons we learned in this research question as~~

Multi-modal input improves Code-Editing capability, irrespective of the underlying model used. The guidance often narrows the edit pattern search space, and the context narrows down the token generation search space.

Transformer models (especially larger ones) are robust enough to learn the code’s syntax information without direct supervision. When a pre-trained model is used to initialize transformer parameters, the improvement is notably higher.

For code-editing task, both understanding the input and correctly generated output are important. While a pre-trained encoder understands the code and a pre-trained decoder generates correct code, an end-to-end pre-trained encoder-decoder model (e.g., PLBART) the best choice to fine-tune for this task.

Result IV-E: Modit generates 29.99%, and 23.02% correct patches in top-1 position for two different datasets outperforming CodeBERT by up to 25.72%, GraphCodeBERT by up to 34.38%, and CodeGPT by up to 30.50%. Pre-trained models tend to be more effective than models trained from scratch for code editing—Modit improves the performance by 167% than the best model trained from the scratch.

~~Modit combines multiple modalities of information to generate patches. Now we investigate,~~

~~RQ2. What are the contribution of different input modalities in Modit ’s performance?~~

Experimental Setup. In this experiment, we investigate the contribution of different input modalities in Modit’s performance. Recall from Section III-A that we use three inputs in Modit (i.e., $e_{p}$ , $C$ , $\mathcal{G}$ ). Here, we investigate different combinations of such input modalities. More precisely, we investigate the influence of three information sources: (i) code that needs to be changed ( $e_{p}$ ), (ii) context ( $C$ ), and (iii) guidance ( $\mathcal{G}$ ). Note that, by presenting $e_{p}$ as a separate information modality, we are essentially providing Modit with the information about the location of the change. To study the effect of such presentation, we study another alternative experimental setup, where we annotate the change location inside the context with two unique tokens and .

Result. Table III shows Modit’s performance with different combination of input modalities. When we present only the context to Modit, it predicts 13.05% correct patches in $B2F_{s}$ and 4.50% in the $B2F_{m}$ , which improves further to 17.89%, and 4.51% in those two datasets respectively when we add $\mathcal{G}$ . Note that in these two scenarios, the model does not explicitly know which portion of the code needs to be edited; it sees the whole method and predicts (only) the patched code ( $e_{n}$ ). In addition to learning how to patch, the model implicitly learns where to apply the patch in this setup. To test whether the identification of such location is the performance bottleneck, we surround the code that needs to be patched with two special tokens and . SequenceR also proposed such annotation of buggy code. Surprisingly, such annotation resulted in comparable (slightly worse in one case) performance by Modit.

In the next set of experiments, we extract the code that needs to be edited ( $e_{p}$ ) and present it as a separate input modality. First, we only present the $e_{p}$ without the other two modalities. When we only present the $e_{p}$ and generate the edited code ( $e_{n}$ ), it results in 26.67% top-1 accuracy in the $B2F_{s}$ and 19.79% in the $B2F_{m}$ . Ding et al. attributed such improvement to the reduced search space due to shorter input. Our result corroborates their empirical findings. Nevertheless, when we add the $\mathcal{G}$ modality with the $e_{p}$ , Modit’s performance improves to 28.76% and 21.63% in $B2F_{s}$ and $B2F_{m}$ , respectively.

In our final set of experiments in this research question, we augment $e_{p}$ with the $C$ . In this evaluation setup, Modit predicts 29.79% correct patches in the $B2F_{s}$ and 21.40% in the $B2F_{m}$ , which is improved further to 29.99%, and 23.02% correct patches in those two datasets when we add $\mathcal{G}$ .

Figure 7 shows an example where Modit with all modalities could successfully generate correct patch. The text guidance ( $\mathcal{G}$ ) provides hint that variable expression should somehow associate with the construction of TypeCheckInfo in the patched code. However, without this guidance Modit generated a wrong patch by accessing existing parameters from this object. Essentially, without the guidance, Modit refactored the input code.

Figure 8 shows the effect of context as input modality to Modit. The before edit version of the code( $e_{p}$ ) passed the wrong parameter (m) to sendMessage function. When the context ( $C$ ) is presented to Modit, it saw another variable (sent) in the context. In contrast, without context( $C$ ), Modit indeed changed the parameter; but sent m.toString() — resulting in a wrong patch.

When we extract the buggy code and present the buggy code along with the context, we see a big performance improvement (see the difference between $\Phi_{c}$ , and $\Phi_{ec}$ in Table III). We hypothesize that, when only context (i.e., full code) is presented ( $\Phi_{c}$ ), the model gets confused to identify which portion from the context needs to be edited since any portion of the code is a likely candidate for patching. However, when we extract the exact code that needs to be edited and present as a separate input modality to Modit, it can focus on patching just that code using other modalities (including the context) as a supporting source of information. In a recent study, Ding et al. pointed out the need for effective ways to include context in the NMT based code editors. Our empirical results show that Modit’s way of including context as a separate modality is a potential solution to that problem.

~~In summary, each of the modalities contribute to the overall performances of Modit. Lessons learned in these experiments are:~~

~~Additional textual guidance helps the patch generation. Such guidance can provide important clue about how to modify the code and sometimes provide ingredients necessary for the change.~~

~~Adding context explicitly in the input enables the model to select appropriate identifiers for patching.~~

~~Isolating buggy code help the model put proper focus on the necessary part of the code while leveraging auxiliary information from other modalities.~~

Result IV-E: All three modalities (code to be edited, context, and guidance) are essential for Modit to perform the best. Without either one of those, performance decreases. Modit’s performance improves up to 37.37% when additional textual guidance is used as an input modality. Context modality improves Modit’s performance up to 6.4%.

~~We investigate alternative ways to combine multiple input modalities. We ask, {questionbox} RQ3. What is the best strategy to encode multiple input modalities?~~

Experimental setup. To validate Modit’s design choice of appending all input modalities into one sequence, we test alternative ways to combine input modalities. In particular, we follow the design choice proposed by Lutellier et al. , where they used multiple encoders to encode the $e_{p}$ and the $C$ . Tufano et al. also leverages a similar idea to encode input code and code review messages. Nevertheless, we use a multi-encoder model shown in Figure 9.

In a multi-encoder setting, we first encode each input modality with a corresponding dedicated encoder. After the encoder finishes encoding, we concatenate the encoded representations and pass those to the decoder for generating patched code. To retain maximum effectiveness, we initialize each individual encoder with pretrained weights from CodeBERT . We consider a single-encoder model (also initialized with CodeBERT) as a baseline to compare on the fairground. While presenting the inputs to the single encoder model, we concatenate input modalities with a unique separator token . Finally, to test the robustness of our empirical finding, we propose two different experimental settings. In the first evaluation setup, we use all three input modalities. We compare a tri-encoder model with a single-encoder model. Next, we consider bimodal input data – $e_{p}$ and $\mathcal{G}$ . We use a dual-encoder model and compare it with a single-encoder model in this setup.

Result. Table IV shows the result of multi-encoder models. For tri-modal input data, if we use three different encoders, the model can predict 20.63% correct patches in the $B2F_{s}$ and 11.69% correct patches in the $B2F_{m}$ . In contrast, if we use a single encoder, the model’s predictive performance increases to 26.05% and 17.13% top-1 accuracy in the $B2F_{s}$ and the $B2F_{m}$ , respectively.

In the bimodal dataset (where the input modalities are $e_{p}$ and $\mathcal{G}$ ), the dual-encoder model predicts 23.12% correct patches in the top-1 position for the $B2F_{s}$ and 15.49% correct for the $B2F_{m}$ . The single encoder counterpart, in this setup, predicts 23.81% correct patches for the $B2F_{s}$ and 17.46% for the $B2F_{m}$ . The empirical results show that the single-encoder model performs better in both the experimental setup than the multi-encoder setup. We find similar results with GraphCodeBERT .

To explain why single-encoder is performing better than multi-encoder, let us look at the encoders’ working procedure. Figure 10 depicts how the encoder generates representation for input tokens. Note that the encoders we used in this research question are transformer-based, and recall from the Section II, transformer generates representation for an input token by learning its dependency on all other tokens in the sequence. When we present all the input modalities to a single encoder, it generates input representation for those tokens w.r.t. and other tokens in the same modality and tokens from other modalities. For instance, in Figure 10(a), the encoder generates $X_{2}$ ’s representation considering $X_{1}$ , $Y_{1}$ , and $Y_{2}$ . In contrast, in Figure 10, $X_{2}$ ’s representation is learned only w.r.t. $X_{1}$ , since encoder1 does not see the input modality $Y$ . Thus, when we present all the input modalities to one single encoder, we conjecture that learned representations are more robust than that of learning with multi-encoder.

~~Finally, we summarize the lessons we learned in this research question as~~

~~In multi-modal translation, using single encoder results in better performance than using a separate encoder for each modality.~~

~~Single-encoder generates input representation by inter-modality reasoning (attention), hence learns more robust representation than that of multi-encoder.~~

Result IV-E: Encoding all the input modalities by a single encoder is the best way to learn in a multi-modal setting. A single encoder improves code-editing performance by up to 46.5% than the corresponding multi-encoder setting.

VI Discussion

An alternative modeling approach for code editing is to generate the sequence of edit operations (i.e., INSERT, DELETE, UPDATE) , where the model must know the precise location of an edit operation (often a node in the AST) before applying it. Throughout this paper, we also assumed that such edit location is known to Modit. This assumption may pose a threat to the usefulness of Modit in a real development scenario. To mitigate such a threat, we perform an experiment where we pass the whole function as input to Modit and expect the whole edited function to be generated.

The possible number of source code can be virtually infinite. Vocabulary explosion has been a big challenge while processing source code with Machine Learning technique . Previous research efforts have addressed this problem using several different heuristics. For instance, Tufano et al. identifiers abstraction, which drastically reduces the vocabulary size considered making it easier to learn patterns by the model. Recent studies found that Byte-Pair Encoding partially solves the open-vocabulary problem by sub-dividing rare words into relatively less rare sub-words. Such sub-division is also learned from large corpora of data. All the pre-trained models used in this paper used sub-word tokenization techniques. CodeBERT and GraphCodeBERT used RoBERTa tokenizer , CodeGPT used GPT tokenizer , and PLBART used sentence-piece tokenizer . The use of such tokenizers strips away the burden of identifier abstraction. Our investigation shows that, in some cases, pre-trained models perform better with concrete tokens than abstract tokens (see Table VI for detailed result). Thus, we champion using input and outputs with concrete tokens when a pre-trained model is used.

There are a lot of research efforts to capture repetitiveness of developers’ way of editing source code. These researches show the potential of automatic refactoring , boilerplate code etc. These research efforts include (semi-)automatic tools involving traditional program analysis techniques (e.g., clone detection, dependency analysis, graph matching) . Other research direction aims at learning source code edit from previous edits and applying those edit patterns in similar context . Some of these efforts targets very specific code changes; For example, Nguyen et al. proposed a graph-matching-based approach for automatically updating API usage. Tansey et al. semantic preserving transformation of java classes for automated refactoring. Other directions of works address more general-purpose code change learned from open source repositories . Such approaches target solving automated code editing tasks in a data-driven approach, and the edit patterns are learned from example changes. In this research, we also investigated general purpose source code changes in the wild. More closely to Modit, Rolim et al. ’s proposed technique constraints source code generation with additional input/output specification or test cases. Nevertheless, we argue that textual guidance could be a very good surrogate specification.

VII-B NMT for Code Change Modeling

NMT has been studied for past couple of years to learn automatic source code change modeling. Tufano et al. presented initial investigation of using NMT in learning general purpose code changes. Chakraborty et al. proposed a tree based hierarchical NMT model for general purpose source code change. Instead of viewing code as sequence of tokens, they first generated syntax tree by sampling from Context Free Grammar, and then another model to fill up the gaps for identifier. To reduce the search space, they performed scope analysis to search for suitable identifier. Chen et al. proposed a copy mechanism based NMT model for APR where the input is the code before change along with the context, and the output is the code after change. Their work treated the input as uni-modal way where the whole code is one singe modality. In this work, we consider multi-modal way of modeling, where we isolate the code fragment that needs to be changed from its context and present that code fragment concatenated with context to the model. Lutellier et al. treated code needs to be changed and the context as two difference modalities and use separate encoders. However, our empirical evidence showed that using one encoder to encode all the modalities result in the best performance. More recently, Ding et al. presented empirical evidence that instead of generating a whole code element (i.e., context+change) of the target version, only generating the sequence of changes might perform better for code change modeling. Recent works proposed models for generating such edit sequence. Such models may augment or outperform NMT based code editing – we leave such investigation as future work.

VII-C Machine Learning for Source Code Analysis

In recent years, Machine Learning, especially Deep Learning has been widely adopted across different area of software engineering due to Availability of large collection of source code in open source platforms (e.g., GitHub, Bitbucket, etc.) Application of ML based source code analysis include bug detection in code , clone detection , code completion , vulnerability detection , code summarization , code translation , etc. Recent works also approached to learn general purpose transferable representation learning for source code, which can later be used for various source code related tasks . The approaches for learning such transferable representations can be broadly categorized in two ways. The first category of approaches (e.g., Code2Vec ) aims at learning explicit representation for tokens in the code. Another category of approaches (e.g., CodeBERT ) transfers syntactic and semantic interaction between code components in the form of pre-trained models. In this approach, a model for a specific task is initialized with a general-purpose pre-trained model, trained to understand and generate code. In this paper, we empirically found that such pre-trained models (PLBART) increase accuracy upto 248% in patch generation.

Bias in the dataset. Both $B2F_{s}$ , and $B2F_{m}$ are collection of bug-fix commits, and thus there is a threat that these dataset may exhibit specific bias towards bug-fix patches. While the commits in these datasets are filtered and classified as bug fix commits, these changes are made by real developers as part of development life cycle. Unlike other bugfix datasets , $B2F_{s}$ and $B2F_{m}$ do not isolate the bug. Thus, we conjecture that possibility of existence of any such bias is minimal.

Noise in commit message. We used commit message as a guidance for code editing. While previous research efforts showed that commit messages are very useful to summarize the changes in a commit, other research efforts also elucidated noises present in the commit message. To mitigate this threat, we carefully chose the dataset we tested Modit on. The original authors of the the dataset reported that they carefully investigated the dataset and after manual investigation, they reported that 97.6% of the commits in their datasets are true positive. Despite this threat, Modit’s performance seems to improve with commit message as additional input.

VIII-B Construct Validity

In general, developers write commit message after they edited the code, in theory, summarizing the edits they made. In this paper, we assumed an experimental setup where developer would write the summary before editing the code. Such assumption may pose a threat to the applicability of Modit in real world, since in some cases, the developer may not know what edits they are going to make prior to the actual editing. Regardless, we consider Modit as a proof-of-concept, where empirically we show that, if a developer had the idea of change in mind, that could help an automated code editor.

VIII-C Internal Threat

All Deep Learning based techniques are sensitive to hyper-parameters. Thus using a sub-optimal hyper-parameter can pose a threat to the validity of Modit, especially while comparing with other baselines. As we compared with other pre-trained models, we cannot really modify the architecture and dimensions of other pre-trained models. As for other hyper-parameters (i.e., learning rate, batch size, etc.), we use the exact same hyper-parameters described by respective paper. Nevertheless, we open source out code and data for broader dissemination.

In this paper, we highlight that an automatic code edit tool should possess knowledge about the underlying programming language, in general. Also, it can benefit from additional information such as edit context and developers’ intention expressed in natural language. To that end, we design, present, and evaluate Modit– a multi-modal NMT-based automated code editor. Our in-depth evaluation shows that Modit improves code-editing by leveraging knowledge about programming language through pre-training. In addition, we showed that leveraging additional modalities of information could benefit the source code editor. Our empirical evaluation reveals some critical lessons about the design choices of building an automated code editor that we believe will guide future research in automatic code editing.

This work is supported in part by NSF grants SHF-2107405, SHF-1845893, IIS-2040961, IBM, and VMWare. Any opinions, findings, conclusions, or recommendations expressed herein are those of the authors and do not necessarily reflect those of the US Government, NSF, IBM or VMWare.