Evaluating Representation Learning of Code Changes for Predicting Patch Correctness in Program Repair

Haoye Tian, Kui Liu, Abdoul Kader Kaboreé, Anil Koyuncu, Li Li, Jacques Klein, Tegawendé F. Bissyandé

Introduction

Automation in software engineering has recently reached new heights with the promising results recorded in the research direction of automated program repair (APR) (Le Goues et al., 2019; Monperrus, 2018). While a few techniques try to model program semantics and synthesize execution constraints towards producing quality patches, they often fail to scale to large programs. Instead, the large majority of research contributions (monperrus2018living) explore search-based approaches where patch candidates are generated and then validated against an oracle.

In the absence of strong program specifications, test suites represent affordable approximations that are widely used as the oracle in APR. In their seminal approach to test-based APR, Weimer et al. (Weimer et al., 2009) considered that a patch is acceptable if it makes the program pass all test cases in the test suite. Since then, a number of studies (Qi et al., 2015; Smith et al., 2015) have explored the overfitting problem in patch validation: a given patch is synthesized to pass a test suite and yet is incorrect with respect to the intended program specification. Since limited test suites only weakly approximate program specifications, a patched program can indeed satisfy the requirements encoded in the test cases, and present a behavior outside of those tests that are significantly different from the behavior initially expected by the developer.

Overfitting patches constitute a key challenge in generate-and-validate APR approaches. Recent evaluation campaigns (Liu et al., 2019c, d; Jiang et al., 2018; Saha et al., 2019; Wen et al., 2018; Liu et al., 2018c; Koyuncu et al., 2019; Liu et al., 2019b; Koyuncu et al., 2020) on APR systems are stressing on the importance of estimating the correctness ratio among the valid patches that can be found. To improve this ratio, researchers are investigating several research directions. We categorize them in three main axes that focus on actions before, during or after patch generation:

test-suite augmentation: Yang et al. (Yang et al., 2017) proposed to generate better test cases to enhance the validation of patches, while Xin and Reiss (Xin and Reiss, 2017) opted for increasing test inputs.

post-processing of generated patches: Long and Rinard (Long and Rinard, 2016) studied some heuristics to discard patches that are likely overfitting.

curation of repair operators: approaches such as CapGen (Wen et al., 2018) successfully demonstrated that carefully-designed (e.g., fine-grained fix ingredients) repair operators can lead to more correct patches.

Our work is related to the latter thrust. So far, the state-of-the-art works targeting the identification of patch correctness are mainly implemented based on computing the similarity of test case execution traces (Xiong et al., 2018). Ye et al. (Ye et al., 2019a) followed up by presenting preliminary results suggesting that statically-extracted code features at the syntax level could be used to predict overfitting patches. While such an approach is appealing, the feature engineering effort can be huge when researchers target generalizable approaches. To cope with this problem, Csuvik et al. (Csuvik et al., 2020) have proposed a preliminary small-scale study on the use of embeddings: leveraging pre-trained natural language sentence embedding models, they claim to have been able to filter out 45% incorrect patches generated for 40 bugs from the QuixBugs dataset (Ye et al., 2019b).

This paper. Embeddings have been successfully applied to various prediction tasks in software engineering research (Soto and Le Goues, 2018; Wang et al., 2016; Liu et al., 2019a; Allamanis et al., 2014). For patch correctness prediction, the literature does not yet provide extensive experimental results to guide future research. Our work fills this gap. We investigate in this paper the feasibility of leveraging advances in deep representation learning to produce embeddings that are amenable to reasoning about correctness.

We investigate different representation learning models adapted to natural language tokens and source code tokens that are more specialized to code changes. Our study considers both pre-trained models and the retraining of models.

We empirically investigate whether, with learned representations, the hypothesis of minimal changes incurred by correct patches remains valid: experiments are performed to check the statistical difference between similarity scores yielded by correct patches and those yielded by incorrect patches.

We run exploratory experiments assessing the possibility to select cutoff similarity scores between learned representations of buggy code and patched code fragments for heuristically filtering out incorrect patches.

Finally, we investigate the discriminative power of deep learned features in a classification training pipeline aimed at learning to predict patch correctness.

Background

Our work deals with various concepts and techniques from the fields of program repair and machine learning. We present the relevant details in this section to facilitate readers’ understanding of our study design and the scope of our experiments.

Defining patch correctness is a non-trivial challenge in APR. Until the release of empirical investigations by Smith et al. (Smith et al., 2015), actual correctness (w.r.t. program behavior intended by developers) was seldom used as a performance criterion of patch generation systems. Instead, experimental results were focused on the number of patches that make the program pass all test cases. Such patches are actually only plausible. Qi et al. (Qi et al., 2015) demonstrated in their study that an overwhelming majority of plausible patches generated by GenProg (Le Goues et al., 2012), RSRepair (Qi et al., 2014) and AE (Weimer et al., 2013)) are overfitting the test suite while actually being incorrect. To improve the probability of program repair systems to generate correct patches, researchers have mainly invested in strengthening the validation oracle (i.e., the test suites). Opad (Yang et al., 2017), DiffTGen (Xin and Reiss, 2017), UnsatGuided (Yu et al., 2019), PATCH-SIM/TEST-SIM (Xiong et al., 2018) generate new test inputs that trigger behavior cases which are not addressed by APR-generated patches.

More recent works (Ye et al., 2019a; Csuvik et al., 2020) are starting to investigate static features and heuristics (or machine learning) to build predictive models of patch correctness. Ye et al. (Ye et al., 2019a) presented the ODS approach which relates to our study since it investigated machine learning with static features extracted from Java program patches. Their approach however builds on carefully hand-crafted features, which may not generalize to other programming languages or even to varied datasets. The study of Csuvik et al. (Csuvik et al., 2020) is also close to ours as it explores BERT embeddings to define similarity thresholds. Their work remains preliminary (it does not investigate the discriminative power of features) and has been performed at a small scale (single pre-trained model on 40 one-line bugs from simple programs).

2. Distributed Representation Learning

Learning distributed representations have been widely used to advance several machine learning tasks. In particular, in the field of Natural Language Processing embedding techniques such as Word2Vec (Le and Mikolov, 2014), Doc2Vec (Le and Mikolov, 2014) and BERT (Devlin et al., 2019) have been successfully applied for different semantics-related tasks. By building on the hypothesis of code naturalness (Hindle et al., 2012; Allamanis et al., 2018), a number of software engineering research works have also leveraged the aforementioned approaches for learning distributed representations of code. Alon et al. (Alon et al., 2019) have then proposed code2vec, an embedding technique that explores AST paths to take into account structural information in code. More recently, Hoang et al. (Hoang et al., 2020) have proposed CC2Vec, which further specializes to code changes.

Our work explores different techniques across the spectrum of distributed representation learning. We therefore consider four variants from the seemingly-least specialized to code (i.e., Doc2Vec) to the state of the art for code change representation (i.e., CC2Vec).

Doc2Vec (Le and Mikolov, 2014) is an unsupervised framework mostly used to learn continuous distributed vector representations of sentences, paragraphs and documents, regardless of their lengths. It works on the intuition, inspired by the method of learning word vectors (Mikolov et al., 2013), that the document representation should be good enough to predict the words in the document Doc2Vec has been applied in various software engineering tasks. For example, Wei and Li (Wei and Li, 2017) leveraged Doc2Vec to exploit deep lexical and syntactical features for software functional clone detection. Ndichu et al. (Ndichu et al., 2019) employed Doc2Vec to learn code structure representation at AST level to predict JavaScript-based attacks.

2.2. BERT

BERT (Devlin et al., 2019) is a language representation model that has been introduced by an AI language team in Google. BERT is devoted to pre-train deep bidirectional representations from unlabelled texts. Then a pre-trained BERT model can be fine-tuned to accomplish various natural language processing tasks such as question answering or language inference. Zhou et al. (Zhou et al., 2019) employed a BERT pre-trained model to extract deep semantic features from code name information of programs in order to perform code recommendation. Yu et al. (Yu et al., 2020) even leveraged BERT on binary code to identify similar binaries.

2.3. code2vec

code2vec (Alon et al., 2019) is an attention-based neural code embedding model developed to represent code fragments as continuous distributed vectors, by training on AST paths and code tokens. Its embeddings have notably been used to predict the semantic properties of code fragments (Alon et al., 2019), in order, for instance, to predict method names. In a recent work, however, Kang et al. (Kang et al., 2019) reported an empirical study, which highlighted that the yielded token code2vec embeddings may not generalize to other code-related tasks such as code comment generation, code authorship identification or code clone detection. code2vec remains however the state of the art in code embeddings: Compton et al. (Compton et al., 2020) recently leveraged code2vec to embed Java classes and learn code structures for the task of variable naming obfuscation.

2.4. CC2Vec

CC2Vec (Hoang et al., 2020) is a specialized hierarchical attention neural network model which learns vector representations of code changes (i.e., patches) guided by the associated commit messages (which is used as a semantic representation of the patch). As the authors demonstrated in their in large empirical evaluation, CC2Vec presents promising performance on commit message generation, bug fixing patch identification, and just-in-time defect prediction.

Study Design

First, we overview the research questions that we investigate. Then we present the datasets that are leveraged to answer these research questions. Finally, we discuss the actual training of (or use of pre-trained) models for embedding the code changes.

RQ1: Do different representation learning models yield comparable distributions of similarity values between buggy code and patched code? A widespread hypothesis in program repair is that bug fixing generally induce minimal changes (Chen et al., 2017; Xiong et al., 2017; Jiang et al., 2019, 2018; Liu et al., 2019c, d; Wen et al., 2018; Weimer et al., 2009; Liu et al., 2018b; Martinez and Monperrus, 2015; Barr et al., 2014). We propose to investigate whether embeddings can be a reliable means for assessing the extent of changes through computation of cosine similarity between vector representations.

RQ2: To what extent similarity distributions can be generalized for inferring a cutoff value to filter out incorrect patches? Following up on RQ1, We propose in this research question to experiment ranking patches based on cosine similarity of their vector representations, and rely on naively-defined similarity thresholds to decide on filtering of incorrect patches.

RQ3: Can we learn to predict patch correctness by training classifiers with code embeddings input? We investigate whether deep learned features are indeed relevant for building machine learning predictors for patch correctness.

2. Datasets

We collect patch datasets by building on previous efforts in the community. An initial dataset of correct patches is collected by using five literature benchmarks, namely Bugs.jar (Saha et al., 2018), Bears (Madeiral et al., 2019), Defects4J (Just et al., 2014), QuixBugs (Lin et al., 2017) and ManySStuBs4J (Karampatsis and Sutton, 2020). These are developer patches as committed in open source project repositories.

We also consider patches generated by APR tools integrated into the RepairThemAll framework. We use all patch samples released by Durieux et al. (Durieux et al., 2019). This only includes sample patches that make the programs pass all test cases. They are thus plausible. However, no validation information on correctness was given. In this work, we proceed to manually validate the generated patches, among which we identified 900 correct patches. The correctness validation follows the criteria defined by Liu et al. (Liu et al., 2020).

In a recent study on the efficiency of program repair, Liu et al. (Liu et al., 2020) released a labeled dataset of patches generated by 16 APR systems for the Defects4J bugs. We consider this dataset as well as the labeled dataset that was used to evaluate the PATCH-SIM (Xiong et al., 2018) approach.

Overall, Table 1 summarizes the data sets that we used for our experiments. Each experiment in Section 4 has specific requirements on the data (e.g., large patch sets for training models, labeled datasets for benchmarking classifiers, etc.). For each experiment, we will recall which sub-dataset has been leveraged and why.

3. Model input pre-processing

Samples in our datasets are patches such as the one presented in Figure 1 extracted from the Defects4J dataset. Our investigations with representation learning however require input data about the buggy and patched code. A straightforward approach to derive those inputs would be to consider the code files before and after the patch. Unfortunately, depending on the size of the code file, the differences could be too minimal to be captured by any similarity measurement. To that end, we propose to focus on the code fragment that appears in the patch. Thus, to represent the buggy code fragment (cf. Figure 2), we keep all removed lines (i.e., starting with ‘-’) as well as the patch context lines (i.e., those not starting with either ‘-’, ‘+’ or ‘@’). Similarly, the patched code fragment (cf. Figure 3) is represented by added lines (i.e., starting with ’+’) as well as the same context lines. Since tool support for the representation learning techniques BERT, Doc2Vec, and CC2Vec require each input sample to be on a single line, we flatten multi-line code fragments into a single line.

In contrast to BERT, Doc2Vec, and CC2Vec, which can take as input some syntax-incomplete code fragments, code2vec requires the fragment to be fully parsable in order to extract information on Abstract Syntax Tree paths. Since patch datasets include only text-based diffs, code context is generally truncated and is likely not parsable. However, as just explained, we opt to consider only the removed/added lines to build the buggy and patched code input data. By doing so, we substantially improved the success rate of the JavaExtractor tool used to build the tokens in the code2vec pipeline.

4. Embedding models

When representation learning algorithms are applied to some training data, they produce embedding models that have learned to map a set of code tokens in the vocabulary of the training data to vectors of numerical values. These vectors are also referred to as embeddings. Figure 4 illustrates the process of embedding buggy code and patched code for the purpose of our experiments.

The embedding models used in this work are obtained from different sources and training scenarios.

BERT. In the first scenario, we consider an embedding model that initially targets natural language data, both in terms of the learning algorithm and in terms of training data. The network structure of BERT, however, is deep, meaning that it requires large datasets for training the embedding model. As it is now custom in the literature, we instead leverage a pre-trained 24-layer BERT model, which was trained on a Wikipedia corpus.

Doc2Vec. In the second scenario, we consider an embedding model that is trained on code data but using a representation learning technique that was developed for text data. To that end, we have trained the Doc2Vec model with code data of 36,364 patches from the 5 repair benchmarks (cf. Table 1).

code2vec. In the third scenario, we consider an embedding model that primarily targets code, both in terms of the learning algorithm and in terms of training data. We use in this case a pre-trained model of code2vec, which was trained by the authors using ~14 million code examples from Java projects.

CC2Vec. Finally, in the fourth scenario, we consider an embedding model that was built in representation learning experiments for code changes. The pre-trained model that we leveraged from the work of Hoang et al. (Hoang et al., 2020) is embedding each patch into a single vector. We investigate the layers and identify the middle CNN-3D layer as the sweet spot to extract embeddings for buggy code and patched code fragments. Figure 5 illustrates the process.

Experiments

We present the experiments that we designed to answer the research questions of our study. For each experiment, we state the objective, overview the execution details before presenting the results.

Objective: We investigate the capability of different learned embeddings to capture the similarity/dissimilarity between code fragments. The experiments are performed towards providing answers for two sub-questions:

Is correct code actually similar to buggy code based on learned embeddings?

To what extent is buggy code more similar to correctly-patched code than to incorrectly-patched code?

Experimental Design: We perform two distinct experiments with available datasets to answer RQ-1.1 and RQ-1.2.

[Experiment ❶] Using the four embedding models considered in our study (cf. Section 3.4), we produce the embeddings for buggy and patched code fragments associated to 36k patches from five benchmark shown in Table 2. In this case, the patched code fragment represents correct code since it comes from labeled benchmark data (generally representing developer fix patches). Given those embeddings (i.e., code representation vectors), we compute the cosine similarity between the vector representing the buggy code fragment and the vector representing the patched code fragment.

[Experiment ❷] To compare the similarity scores of correct code fragment vs incorrect code fragment to the buggy code, we consider combining datasets with correct patches and datasets with incorrect patches. Note that, all patches in our experiments are plausible since we are focused on correctness: plausibility is straightforward to decide based on test suites. Correct patches are provided in benchmarks. However, incorrect patches associated to all benchmark bugs are not available. We rely on the dataset released by Liu et al. (Liu et al., 2020): 674 plausible but incorrect patches generated by 16 repair tools for 184 Defects4J bugs are considered from this dataset. Those 674 incorrect patches are selected within a larger set of incorrect patches by adding the constraint that the incorrect patch should be changing the same code location as the developer-provided patch in the benchmark: such incorrect patch cases may indeed be the most challenging to identify with heuristics. We propose to compare the similarity scores between the incorrect code and buggy code associated to the dataset with the similarity scores between correct code and buggy associated to all benchmarks, all Defects4J benchmark data, or only the subset of Defects4J that corresponds to the 184 patches for which relevant incorrect patches are available.

Results: Figure 6 presents the boxplots of the similarity distributions with different embedding models and for samples in different datasets. Doc2Vec and code2vec models appear to yield similarity values that are lower than BERT and CC2Vec models.

Figure 7 zooms in the boxplot region for each embedding model experiment to overview the differences across different benchmark data. We obverse that, when embedding the patches with BERT, the similarity distribution for the patches in Defects4J dataset is similar to Bugs.jar and Bears dataset, but is different from the dataset ManySStBs4J and QuixBugs. The Mann-Whitney-Wilcoxon (MWW) tests (Wilcoxon, 1945; Mann and Whitney, 1947) confirm that the similarity of median scores for Defects4J, Bugs.jar and Bears is indeed statistically significant. MWW tests further confirms the statistical significance of the difference between Defects4J and ManySStBs4J/QuixBugs scores.

Defects4J, Bugs.jar and Bears include diverse human-written patches for a large spectrum of bugs from real-world open-source Java projects. In contrast, ManySStuBs4J only contains patches for single statement bugs. Quixbugs dataset is limited by its size and the fact that the patches are built by simply mutating the code of small Java implementation of 40 algorithms (e.g., sort, sieve, etc.).

While CC2Vec and Doc2Vec exhibit roughly similar patterns with BERT (although at different scales), the experimental results with code2vec present different patterns across datasets. Note that, due to parsing failures of code2vec, we eventually considered only 118 Bears patches, 123 Bugs.jar patches, 46 Defects4J patches, 20,840 ManySStuBs4J patches and 8 QuixBugs. The change of dataset size could explain the difference with the other embedding models.

✍ RQ1.1 $\blacktriangleright$ Learned representations of buggy and correct code fragments exhibit high cosine similarity scores. Median scores are similar for patches that are collected with similar heuristics (e.g., in the wild patches vs single-line patches vs debugging example patches). The pre-trained BERT natural language model captures more similarity variations than the CC2Vec model, which is specialized for code changes. $\blacktriangleleft$ In the second experiment, we further assess whether incorrectly-patched code exhibits different similarity score distributions than correctly-patched code. Figure 8 shows the distributions of cosine similarity scores for correct patches (i.e., similarity between buggy code and correct code fragments) and incorrect patches (i.e., similarity between buggy code and incorrect code fragments). The comparison is done with different scenarios specified in Table 3.

The comparisons do not include the case of embeddings for code2vec. Indeed, unlike the previous experiment where code2vec was able to parse enough code fragments, for the considered 184 correct patches of Defects4J, code2vec failed to parse most of the relevant code fragments. Hence, we focus the comparison on the other three embedding models (BERT, Doc2Vec and CC2Vec). Overall, we observe that the distribution of cosine similarity scores is substantially different for correct and incorrect code.

We observe that the similarity distributions of buggy code and patched code from incorrect patches are significantly different from the similarities for correct patches. The difference of median values is confirmed to be statistically significant by an MWW test. Note that the difference remains high for BERT, Doc2Vec and CC2Vec whether the correct code is the counterpart of the incorrect ones (i.e., the scenario of Balanced-Defects4J) or whether the correct code is from a larger dataset (i.e., Imbalanced-all and Imbalanced-Defects4J scenarios).

2. [Filtering of Incorrect Patches based on Similarity Thresholds]

Objective: Following up on the findings related to the first research question, we investigate the selection of cut-off similarity scores to decide on which APR-generated patches are likely incorrect. Results from this investigation will provide insights to guide the exploitation of code embeddings in program repair pipelines.

Experimental design: To select threshold values, we consider the distributions of similarity scores from the above experiments (cf. Section 4.1). Table 4 summarizes relevant statistics on the distributions on the similarity scores distribution for correct patches. Given the differences that were exhibited with incorrect patches in previous experiments, we use, for example, the 1st quartile value as an inferred threshold value.

Given our previous findings that different datasets exhibit different similarity score distributions, we also consider inferring a specific threshold for the QuixBugs dataset (cf. statistics in Table 5). We do not compute any threshold based on ManySStuBs4J since it has not yet been applied to program repair tools.

Our test data is constituted of 64,293 patches generated by 11 APR tools in the empirical study of Durieux et al. (Durieux et al., 2019). First, we use the four embedding models to generate embeddings of buggy and patched code fragments and compute cosine similarity scores. Second, for each bug, we rank all generated patches based on the similarity score between the patched code and the buggy, where we consider that the higher the score, the more likely the correctness. Finally, to filter incorrect candidates, we consider two experiments:

Patches that lead to similarity scores that are lower to the inferred threshold (i.e., 1st Quartile in previous experimental data) will be considered as incorrect. Patches where patched code exhibit higher similarity scores than the threshold are considered likely correct.

Another approach is to consider only the top-1 patches with the highest similarity scores as correct patches. Other patches are considered incorrect.

In all cases, we systematically validate the correctness of all 64,293 patches to have the correctness labels, for which the dataset authors did not provide (all plausible patches having been considered as valid). First, if the file(s) modified by a patch are not the same buggy files in the benchmark, we systematically consider it as incorrect: with this simple scheme, 33 489 patches are found incorrect. Second, with the same file, if the patch is not making changes at the same code locations, we consider it to be incorrect: 26 386 patches are further tagged as incorrect with this decision (cf. Threats to validity in Section 5). Finally, for the remaining 4 418 plausible patches in the dataset, we manually validate correctness by following the strict criteria enumerated by Liu et al. (Liu et al., 2020) to enable reproducibility. Overall, we could label 900 correct patches. The remainders are considered as incorrect.

Results: By considering the patch with the highest (top-1) similarity score between the patched code and buggy code as correct, we were able to identify a correct patch for 10% (with BERT), 9% (with CC2Vec) and 10% (with Doc2Vec) of the bug cases. Overall we also misclassified 96% correct patches as incorrect. However, only 1.5% of incorrect patches were misclassified as correct patches.

Given that a given bug can be fixed with several correct patches, the top-1 criterion may not be adequate. Furthermore, this criterion makes the assumption that a correct patch indeed exists among the patch candidates. By using filtering thresholds inferred from previous experiments (which do not include the test dataset in this experiment), we can attempt to filter all incorrect patches generated by APR tools. Filtering results presented in Table 6 show the recall scores that can be reached. We provide experimental results when we use 1st Quartile and Mean values of similarity scores in the ”training” set as threshold values. The threshold are also applied by taking into account the datasets: thresholds learned on QuixBugs benchmark are applied to generated patches for QuixBugs bugs.

3. [Classification of Correct Patches with supervised learning]

Objective: Cosine similarity between embeddings (which was used in the previous experiments) considers every deep learned feature as having the same weight as the others in the embedding vector. We investigate the feasibility to infer, using machine learning, the weights that different features may present with respect to patch correctness. We compare the prediction evaluation results with the achievements of related approaches in the literature.

Experimental design: To perform our machine learning experiments, we first require a ground-truth dataset. To that end, we rely on labeled datasets in the literature. Since incorrect patches generated by actual APR tools are only available for the Defects4J bugs, we focus on labeled patches provided by two independent teams (Liu et al. (Liu et al., 2020) and Xiong et al. (Xiong et al., 2018)). Very few patches generated by the different tools are actually labeled as correct, leading to an imbalanced dataset. To reduce the imbalance issue, we supplement the dataset with developer (correct) patches as supplied in the Defects4J benchmark. Eventually, our dataset shown in Table 7 included 1000 patches after removing duplicates to avoid data bias.

Our ground truth dataset patches are then fed to our embedding models to produce embedding vectors. As for previous experiments, the parsability of Defects4J patch code fragments prevented the application of code2vec: we use pre-trained models of BERT (trained with natural language text) and CC2Vec (trained with code changes) as well as a retrained model of Doc2Vec (trained with patches).

Since the representation learning models are applied to code fragments inferred from patches (and not to the patch themselves), we collect the embeddings of both buggy code fragment and patched code fragment for each patch. Then we must merge these vectors back into a single input vector for the classification algorithm. We follow an approach that was demonstrated by Hoang et al. (Hoang et al., 2020) in a recent work on bug fix patch prediction: the classification model performs best when features of patched code fragment and buggy code fragment are crossed together. We thus propose a classification pipeline (cf. Figure 9) where the feature extraction for a given patch is done by applying subtraction, multiplication, cosine similarity and euclidean similarity to capture crossed features between the buggy code vector and the patched code vector. The resulting patch embedding has 2*n+2 dimensions where n is the dimension of input code fragment embeddings.

Results: We compare the performance of different predictors (varying the embeding models) using different learners (i.e., classification algorithms). Results presented in Table 8 are averaged from a 5-fold cross validation setup. All classical metrics used for assessing predictors are reproted: Accuracy, Precision, Recall, F1-Measure, Area Under Curve (AUC). Logistic Regression (LR) applied to BERT embeddings yield the best performance measurements: 0.720 for F1 and 0.808 for AUC.

✍ RQ3.1 $\blacktriangleright$ An ML classifier trained using Logistic Regression with BERT embeddings yield very promising performance on patch correctness prediction (F-Measure at 72.0% and AUC at 80.8%). $\blacktriangleleft$ [Comparison against the state of the art]. There are two related works for patch prediction which were both evaluated on 139 patches released by Xiong et al. (Xiong et al., 2018). PATCH-SIM (Xiong et al., 2018) compares execution traces of patched programs to identify correctness. ODS (Ye et al., 2019a) leverages manually-crafted features to build machine learning classifiers.

We consider the 139 patches as test set and the remainder in our dataset ( $870=1000-130$ 9 patches in the ground truth dataset by Xiong et al. (Xiong et al., 2018) were duplicates (e.g., Patch151 $\equiv$ Patch23).) for training. Note that the 139 patches are associated to bug cases where repair tools can generate patches. These patches may thus be substantially different from the rest in our dataset. Indeed our best learner (Logistic Regression with BERT embeddings) yields an AUC of 0.765. The Receiver Operating Characteristic (ROC) curve is presented in Figure 10.

In the validation of PATCH-SIM (Xiong et al., 2018), the authors aimed for avoiding to filter out any correct patches. Eventually, when guaranteeing that no correct patch is excluded, they could still exclude 62 (56.3%) incorrect patches. If we constrain the threshold of our predictor to avoid misclassifying any correct patch (threshold value = 0.219), our predictor is able to exclude up to 43 (39.4%) incorrect patches, which represents a reasonably promising achievement since no dynamic information is used (in contrast to PATCH-SIM). Table 9 overviews the prediction results comparison.

We also compare the predictive power of our models against that of ODS (Ye et al., 2019a), which builds on manually engineered features. We directly compare against the results reported by the authors on the 139 test patches. While the pre-trained BERT model associated with Logistic Regression (LR) achieves better AUC than ODS LR-based model (0.765 vs 0.705), ODS Random Forest-based model achieves a higher AUC at 0.841. Note however that ODS has been trained on over 13 thousand patches (including patches for bugs associated to the test set patch), our training dataset includes only 870 patches (i.e., $\sim$ 1/20th of their dataset).

Tables 10 and 11 provide confusion matrices for different cut-off thresholds of the classifiers for ODS and our BERT embeddings-based classifiers: TP (true positives) represent correct patches that were classified as such; TN (true negatives) represent incorrect patches that were classified as such; FP (false positives) represent incorrect patches that were classified as correct; and FN (false negatives) represent correct patches that were classified as incorrect. Overall, the BERT-based predictor is very sensitive to the cut-off thresholds while ODS is less sensitive. We also note that BERT embeddings applied to Random Forrest does not yield good performance: decision trees are indeed known to be good for categorical data and request large datasets for training. In our case, the data set is small, while ODS has a training dataset that is about 20 times larger. The hand-crafted features of ODS may also help split the patches into categories while our deep learned features are based on a large vocabulary of natural language text.

We observe nevertheless that LR classifiers fed with BERT embeddings are able to recall high numbers of incorrect patches (#TN is high and #FP is low on threshold ¿ 0.5). In contrast ODS consistently recalls correct patches (however with high false positives). These experimental results suggest that both approaches can be used in a complementary way. In future work, we will propose an approach that carefully merges deep learned features to hand-crafted features towards yielded a better predictors of patch correctness.

Discussions

We enumerate a few insights from our experiments with representation learning models and discuss some threats to validity.

[Code-oriented embedding models may not yield the best embeddings for training predictors.] Our experiments have revealed that the BERT model which was pre-trained on Wikipedia is yielding the best recall in the identification of incorrect patches. There are several possible reasons to that: Bert implements the deepest neural network and builds on the largest training data. Its performance suggests that code-oriented embeddings should aim for being accurate with small training datasets in order to become competitive against BERT. While we were completing the experiments, a pre-trained CodeBERT (Feng et al., 2020) model has been released (on April 27). In future work, we will investigate its relevance for producing embeddings that may yield higher performance in patch correctness prediction. In any case, we note that CC2Vec provided the best embeddings for yielding the best recall in identifying correct. patches (using similarity thresholds). This suggests that future research should investigate the value of merging different representations or combining the eventual prediction probabilities to improve performance on identifying correct patches and excluding incorrect patches.

[The small sizes of the code fragments lead to similar embeddings.]. Figure 11 illustrates the different cosine similarity scores that can be obtained for the BERT embeddings of different pairs of short sentences. Although the sentences are semantically (dis)similar, the cosine similarity scores are quite close. This explains why recalling correct patches based on a similarity threshold was a failed attempt ( $\sim 5\%$ for APR-generated patches for. Defects4J+Bears+Bugs.jar bugs). Nevertheless, experimental results demonstrated that deep learned features were relevant for learning to discriminate.

[Embeddings are most suitable when applied to simple ML algorithms.] Because embeddings are yielded from neural networks, they are actually formed by complex crossed features. When they are fed to a complex discriminant model such as Random Forrest, it may lead to overfitting with small datasets. Our experiments however show that simple Logistic Regression yields the best AUC, suggesting that this learner was able to better identifying discriminating features for the prediction task.

2. Threats to validity

Our empirical study carries a number of threats to validity that we have tried to mitigate.

Threats to External Validity. There are a variety of representation learning models in the literature. A threat to validity of our study is that we may have a selection bias by considering only four embedding models. We have mitigated this threat by considering representative models in different scenarios (pre-trained vs retrained, code change specific vs natural language oriented).

Another threat to validity is related to the use of Defects4J data in evaluating the ML classifiers. This choice however was dictated by the data available and the aim to compare against related work.

Finally, with respect to the explored models, the attention system of CC2Vec requires some execution parameters to perform well. Since the relevant code was not available, we use use a non-attention version instead, potentially making CC2Vec embeddings be under-performing. We release the artifacts for future comparisons by the research community.

Threats to Internal Validity. A major threat to internal validity lies in the manual assessment heuristics that we applied to the RepairThemAll-generated dataset. We may have misclassified some patches due to mistakes or conservatism. This threat however holds for all APR work that relies on manual assessment. We mitigate this threat by following clear and reproducible decision criteria, and by releasing our labelled datasets for the community to reviewsee: https://github.com/SerVal-DTF/DL4PatchCorrectness.

Threats to Construct Validity. For our experiment, the considered embedding models are not perfect and they may have been under-trained for the prediction task that we envisioned. For this reason, the results that we have reported are likely an under-estimation of the capability of representation learning models to capture discriminative features for the prediction of patch correctness. Our future studies on representation learning will address this threat by considering different re-training experiments.

Related Work

To assess the performance of fixing bugs of repair tools and approaches, checking the correctness of patches is key, but not trivial. However, this task was largely ignored or unconcerned in the community until the analysis study of patch correctness conducted by Qi et al. (Qi et al., 2015). Thanks to their systematic analysis of the patches reported by three generate-and-validate program repair systems (i.e., GenProg, RSRepair and AE), they shown that the overwhelming majority of the generated patches are not correct but just overfit the test inputs in the test suites of buggy programs. In another study, Smith et al. (Smith et al., 2015) uncover that patches generated with lower coverage test suites overfit more. Actually, these overfitting patches often simply break under-tested functionalities, and some of them even make the “patched” program worse than the un-patched program. Since then, the overfitting issue has been widely studied in the literature. For example, Le et al. (Le et al., 2018) revisit the overfitting problem in semantics-based APR systems. In (Le et al., 2019), they further assess the reliability of authors and automated annotations in assessing patch correctness. They recommend to make publicly available to the community the patch correctness evaluations of the authors. Yang and Yang (Yang and Yang, 2020) explore the difference between the runtime behavior of programs patched with developer’s patches and those by APR-generated plausible patches. They unveil that most APR-generated plausible patches lead to different runtime behaviors compared to correct patches.

To predict the correctness of patches, one of the first explored research directions relied on the idea of augmenting test inputs, i.e., more tests need to be proposed. Yang et al. (Yang et al., 2017) design a framework to detect overfitting patches. This framework leverages fuzz strategies on existing test cases in order to automatically generate new test inputs. In addition, it leverages additional oracles (i.e., memory-safety oracles) to improve the validation of APR-generated patches. In a contemporary study, Xin and Reiss (Xin and Reiss, 2017) also explored to generate new test inputs, with the syntactic differences between the buggy code and its patched code, for validating the correctness of APR-generated patches. As complemented by Xiong et al. (Xiong et al., 2018), they proposed to assess the patch correctness of APR systems by leveraging the automated generation of new test cases and measuring behavior similarity of the failing tests on buggy and patched programs.

Through an empirical investigation, Yu et al. (Yu et al., 2019) summarized two common overfitting issues: incomplete fixing and regression introduction. To assist alleviating the overfitting issue for synthesis-based APR systems, they further proposed UnsatGuided that relies on additional generated test cases to strengthen patch synthesis, and thus reduce the generation of incorrect overfitting patches.

Predicting patch correctness with thanks to an augmented set of test cases heavily relies on the quality of tests. In practice, tests with high coverage might be unavailable (Ye et al., 2019a). In our paper, we do not rely on any new test cases to assess patch correctness, but leverage representation learning techniques to build representation vectors for buggy and patched code of APR-generated patches.

To predict overfitting patches yielded by APR tools, Ye et al. (Ye et al., 2019a) propose ODS, an overfitting detection system. ODS first statically extracts 4,199 code features at the AST level from the buggy code and generated patch code of APR-generated patches. Those features are fed into three machine learning algorithms (logistic regression, KNN, and random forest) to learn an ensemble probabilistic model for classifying and ranking potentially overfitting patches. To evaluate the performance of ODS, the authors considered 19,253 training samples and 713 testing samples from the Durieux et al. empirical study (Durieux et al., 2019). With these settings, ODS is capable of detecting 57% of overfitting patches. The ODS approach relates to our study since both leverage machine learning and static features. However, ODS only relies on manually identified features which may not generalize to other programming languages or even other datasets.

In a recent work, Csuvik et al. (Csuvik et al., 2020) exploit the textual and structural similarity between the buggy code and the APR-patched code with two representation learning models (BERT (Devlin et al., 2019) and Doc2Vec (Le and Mikolov, 2014)) by considering three patch code representation (i.e., source code, abstract syntax tree and identifiers). Their results show that the source code representation is likely to be more effective in correct patch identification than the other two representations, and the similarity-based patch validation can filter out incorrect patches for APR tools. However, to assess the performance of the approach, only 64 patches from QuixBugs (Ye et al., 2019b) have been considered (including 14 in-the-lab bugs). This low number of considered patches raises questions about the generalization of the approach for fixing bugs in the wild. Moreover, unlike our study, new representation learning models (code2vec (Alon et al., 2019) and CC2Vec (Hoang et al., 2020)) dedicated to code representation have not been exploited.

In the literature, representation learning techniques have been widely explored to boost program repair tasks. Long and Rinard explored the topic of learning correct code for patch generation (Long and Rinard, 2016). Their approach learns code transformation for three kinds of bugs from their related human-written patches. After mining the most recent 100 bug-fixing commits from each of the 500 most popular Java projects, Soto and Le Goues (Soto and Le Goues, 2018) have built a probabilistic model to predict bug fixes for program repair. To identify stable Linux patches, Hoang et al. (Hoang et al., 2019) proposed a hierarchical deep learning-based method with features extracted from both commit messages and commit code. Liu et al. (Liu et al., 2018a) and Bader et al. (Bader et al., 2019) proposed to learn recurring fix patterns from human-written patches and suggest fixes. Our paper is not aiming at proposing a new automated patch generation approach. We indeed rather focus on assessing representation learning techniques for predicting correctness of patches generated by program repair tools.

Conclusion

In this paper, we investigated the feasibility of statically predicting patch correctness by leveraging representation learning models and supervised learning algorithms. The objective is to provide insights for the APR research community towards improving the quality of repair candidates generated by APR tools. To that end, we, first investigated the use of different distributed representation learning to capture the similarity/dissimilarity between buggy and patched code fragments. These experiments gave similarity scores that substantially differ for across embedding models such as BERT, Doc2Vec, code2vec and CC2Vec. Building on these results and in order to guide the exploitation of code embeddings in program repair pipelines, we investigated in subsequent experiments the selection of cut-off similarity scores to decide which APR-generated patches are likely incorrect. This allowed us to filter out between 31.5% and 94.9% incorrect patches based on brute cosine similarity scores. Finally, we investigated the discriminative power of the deep learned features by training machine learning classifiers to predict correct Patches. DecisionTree, Logistic Regression and Naive Bayes are tried with code embeddings from BERT, Doc2Vec and CC2Vec. Logistic Regression with BERT embeddings yielded very promising performance on patch correctness prediction with metrics like F-Measure at 0.72% and AUC at 0.8% on a labeled deduplicated dataset of 1000 patches. We further showed that the performance of these models on static features is promising when comparing against the state of the art (PATCH-SIM (Xiong et al., 2018)), which uses dynamic execution traces. Experimental results suggests that the deep learned features can be complementary to hand-crafted features (such as those engineered by ODS (Ye et al., 2019a)).

Availability. All artifacts of this study are available in the following public repository:

https://github.com/SerVal-DTF/DL4PatchCorrectness

Acknowledgements

This work is supported by the Project 1015-YAH20102, the National Natural Science Foundation of China (Grant No.61802180), the Natural Science Foundation of Jiangsu Province (Grant No.BK20180421), the National Cryptography Development Fund (Grant No.MMJJ20180105) and the Fundamental Research Funds for the Central Universities (Grant No.NE2018106).