On the Efficiency of Test Suite based Program Repair: A Systematic Assessment of 16 Automated Repair Systems for Java Programs

Kui Liu, Shangwen Wang, Anil Koyuncu, Kisub Kim, Tegawendé F. Bissyandé, Dongsun Kim, Peng Wu, Jacques Klein, Xiaoguang Mao, Yves Le Traon

cs.SE

Introduction

In the last decade, Automated Program Repair (APR) (Gazzola et al., 2019; Monperrus, 2018a; Le Goues et al., 2019) has extensively grown as a prominent research topic in the software engineering community. Figure 2 overviews the research activities of this topic. The associated literature includes a broad range of techniques that use heuristics (e.g., via random mutation operations (Le Goues et al., 2012b)), constraints solving (e.g., via symbolic execution (Nguyen et al., 2013)), or machine learning (e.g., via building a code transformation model (Gupta et al., 2017)) to drive patch generation. A living review of automated program repair research appears in (Monperrus, 2018b), which shows that the research in this field has been revived with the seminal work, ten years ago, of Weimer et al. (Weimer et al., 2009) on generate-and-validate approaches. Patches are generated to be applied on a buggy program until the patched program meets the desired behaviour. In the absence of formal specifications of the desired behaviour, test suites are leveraged as affordable partial specifications for validating generated patches. Over the years, the community has incrementally advanced the state-of-the-art with numerous test-based approaches that have been shown effective in generating valid patches for a significant fraction of defects within well-established benchmarks (Just et al., 2014; Madeiral et al., 2019; Lin et al., 2017; Saha et al., 2018).

Several studies have revisited the constraints and performance of program repair systems, and have thus contributed to shaping research directions towards improving the state-of-the-art. For example, Qi et al. (Qi et al., 2015) have early shown that repair tools generate mostly overfitting patches (i.e., patches that pass the incomplete test suites) but are actually incorrect. Their study led to assessment results being now carefully presented in a way that highlights the capability of new approaches to correctly repair programs. Motwani et al. (Motwani et al., 2018) then questioned whether state-of-the-art approaches can deal with hard and important bugs. Liu et al. (Liu et al., 2019a) recently revealed significant biases with fault localization configurations in APR system evaluations. More recently, Durieux et al. (Durieux et al., 2019) have shown that state-of-the-art tools may actually be overfitting the associated study benchmarks.

Performance measurement of repair systems has evolved to progressively consider the number of correctly-fixed bugs or the diversity of benchmark bugs (Durieux et al., 2019) that are fixed. Another performance aspect that deserves investigation is the efficiency of the patch generation system. It is however mentioned in only a few assessment reports (Xiong et al., 2017; Ghanbari et al., 2019). Yet, efficiency is a key property for bringing program repair into general use within practitioners’ settings. Indeed, APR aims to alleviate the manual effort involved in resolving software bugs, and holds this promise in two scenarios: in production, it is expected to drastically reduce the time-to-fix delays and minimize downtime; in a development cycle, APR can help suggest changes to accelerate debugging. Yet, until now, literature approaches (Jiang et al., 2018; Liu et al., 2019c; Saha et al., 2019; Liu et al., 2019c; Xiong et al., 2017) have mainly focused on highlighting the increased performance on eventually fixing more and more benchmark bugs. In recent work, Ghanbari et al. (Ghanbari et al., 2019) raised the efficiency issue and built on the time cost criterion to demonstrate the efficiency of their PraPR tool (which does not require re-compiling source code). This criterion, which was already mentioned in a few of the previous work (Xiong et al., 2017; Wen et al., 2018; Liu and Zhong, 2018), however, has limitations with respect to generalizability (cf. Section 2): execution time is (1) dependent on many variables that are unrelated to the approach implemented in the repair system; and (2) is generally unstable.

We postulate that the efficiency of test-based program repair should be assessed along with the following question: how many attempts does the repair system make before catching a valid patch? In previous work, Qi et al. (Qi et al., 2013) have formulated this question into a metric that served to assess the effectiveness of fault localization techniques in a platform-agnostic manner. To the best of our knowledge, little attention has been paid to measuring repair efficiency by estimating the number of validated patch candidates.

In this paper, we report on the results of a large scale empirical study on the efficiency of test-based program repair systems. Our study considers 16 APR systems targeting Java programs, and performs a systematic assessment under identical and controlled fault localization configurations. The objective of this work is to contribute a comprehensive analysis of repair efficiency to the literature with respect to generated patches for a large spectrum of APR systems. Eventually, we gather insights on how the strategies of approaches in the literature affect repair efficiency. Overall, we mainly find that:

So far, efficiency is not a widely-valued performance target. We found that state-of-the-art APR tools are the least efficient. This calls for an industry investigation of the impact of efficiency on adoption (or lack thereof).

Across time, repair tools subsume each other in terms of which benchmark bugs can be fixed. Unfortunately, effectiveness (i.e., how many bugs are eventually fixed) is increased at the expense of efficiency (i.e., how many repair attempts are made before a given bug is fixed).

Template-based repair systems are generally inefficient as they produce too many patch candidates. However, when the templates are mined from clean datasets or are specialized to specific bugs, efficiency can be substantially improved.

Literature approaches develop a few strategies, such as constraint solving or donor code search, which contribute to drastically reducing the nonsensical or in-plausible patches.

APR systems that implement random search over the repair search space require large sets of patch candidates to increase the likelihood of hitting a correct patch.

Implementation details can diversely influence the repair efficiency of an APR approach.

Background and Motivation

Test suite based program repair systems commonly implement a three-step pipeline as illustrated in Figure 2: fault localization, which produces a ranked list of suspicious code locations that should be modified to fix the bug; patch generation, which implements the change operators that are applied on the buggy code locations; and patch validation, which executes the test cases to check that the patched program meets the behaviour (approximatively) specified by the test suite.

If a patch candidate can pass all the given test cases (both previously-passing and previously-failing test cases on the buggy version), it is regarded as a valid patch. This criterion was first used by Weimer et al. (Weimer et al., 2009) in their seminal work on GenProg, and has become the de-facto metric of repair performance (Le Goues et al., 2019). Nevertheless, as later studies have revealed, even if a generated patch can pass all test cases, it might break a necessary behaviour or introduce other faults, which are not covered by the given test suite (Smith et al., 2015). Besides, a developer may not accept the patch due to several reasons such as coding convention (Kim et al., 2013; Monperrus, 2014). All such valid patches in terms of the test suite are therefore now referred to as plausible since they require further investigations to ensure that they are correct, i.e., acceptable to developers. In the literature, correctness is generally assessed manually by comparing the APR-generated patch against the developer-provided patch available in the benchmark.

Studies in the literature, such as the recent work of Durieux et al. (Durieux et al., 2019) on benchmark overfitting, generally focus on information about plausible patches given that correctness is hard to assess. Our work is the first to explore artifacts from the literature, where researchers provide correctness labels of their generated patches, in order to extract and categorize implicit rules used by the community to define correctness. We expect that these rules will be studied and augmented by the community to enable systematic assessment of correctness. Efficiency of APR tools has been assessed in the literature (Xiong et al., 2017; Wen et al., 2018; Hua et al., 2018; Ghanbari et al., 2019) via measuring the time-to-generate-and-validate patches. Table 1 presents the time cost of the PraPR (Ghanbari et al., 2019) state-of-the-art repair tool on Defects4J (Just et al., 2014) program samples. On average, for each Closure bug, PraPR generated and validated more than 29 thousand patches, approximately 10 times more than the average number of patches that are generated and validated for each Chart bug. Yet, the time cost for Closure bugs is 20 times more than the time cost for Chart bugs. This suggests that it is challenging to define a generically-suitable time budget for repairing bugs. We further note that correlation tests did not reveal any linear correlation between the time cost of repairing a bug and benchmark properties such as the number of test cases or program sizes. Consequently, time cost may not be a reliable metric for efficiency.

To further highlight the biases that execution time may carry, we refer to literature settings of time budgets for running APR systems: ACS (Xiong et al., 2017) and SimFix (Jiang et al., 2018) are evaluated with repair time budgets of 30 minutes and 5 hours, respectively. Furthermore, in (Jiang et al., 2018), assessment comparison between ACS and SimFix does not consider the bias related to the difference between the execution platforms. A comparison of performance (in terms of how many bugs each tool can fix) may, therefore, be misleading: a given bug may have been fixed by one tool because the time budget is sufficient while it cannot be fixed by the other due to lack of time.

With two simple experimental runs of compiling and testing Defects4J samples, we confirm our concerns: time budgets could introduce biases for different bugs. Indeed, as revealed in Figure 3, different machine configurations may lead to drastically divergent compiling and testing time: irrespectively of projects. The Mann–Whitney–Wilcoxon tests (Mann and Whitney, 1947; Wilcoxon, 1945) confirm that the first machine consumes statistically significantly more CPU time than the second machine either for compilation or for testing Defects4J buggy programs. These results definitively suggest that time cost is not a reliable metric to enable reproducible and comparable experiments on the efficiency of program repair.

Instead, we propose to rely on the metric of number of generated patch candidates, which should be intrinsic to the approach and agnostic of machine configuration variabilities.

Study Design

This section presents the design details of this empirical study.

Overall, our investigation into the efficiency of test-based APR systems seeks answers for the following research questions (RQs):

RQ1. Repairability across time: We first revisit the classical performance criterion of APR systems, which is about the repairability (i.e., effectiveness): how many bugs can be fixed by test suite based repair approaches? Our investigation goes beyond previous studies in the literature by (i) systematically assessing a large range of repair systems under the same configurations (see Section 3.3.2); and (ii) exploring not only plausibility but also the correctness of patches (see Section 3.3.3). Eventually, we investigate the evolution across time of effectiveness to better discuss the need for revisiting efficiency as an important complementary performance criterion.

RQ2. Patch generation efficiency: Based on the experimental outputs of benchmarking repair systems in RQ1, we can investigate the efficiency of test-based repair: how many patch candidates are generated and checked before fixing a given bug? Although program repair is often regarded as a background/offline task, efficiency remains critical since resource budgets are limited. Therefore, efficiency may have adverse effects on the adoption of the repair system and even on its effectiveness. In this RQ, we extensively review two cases of invalid patches whose generation may undermine efficiency: nonsensical and in-plausible patches (see Section 3.5).

RQ3. Fault Localization noise impact on efficiency: Finally, given that fault localization is known to provide noisy inputs to repair, we investigate its impact on efficiency to highlight repair directions for mitigations. Mainly, we question whether some repair strategies are more or less resilient to repair attempts on wrong code locations. Our study differs from recent work (Liu et al., 2019a) in the literature, which explores the bias of fault localization on repairability with only one repair system.

2. Subject Selection

Our study focuses on APR systems targeting Java programs. Java is indeed today the most targeted language in the community of program repair. Furthermore, a well-formed dataset of real-world Java program bugs is available, with the necessary tool support to readily compile and execute programs. Although we initially planned to consider all repair approaches proposed in the last decade, we were limited by the fact that many APR tools are not open-source or even publicly available.

In the end, APR systems considered for our study are systematically selected based on the following criteria:

Availability: our study involves the execution of APR tools, thus APR approaches without publicly available tools are excluded.

Executability: some APR approaches provide publicly available tools, which however cannot be executed as-is for diverse issues (e.g., ssFix (Xin and Reiss, 2017) failed to execute because of an online connection to a private search engine fails). We exclude such approaches from the study.

Configurability: to limit biases, we need to configure the different tools to use the same input information (e.g., fault localization details). We, therefore, exclude APR approaches whose tools cannot be readily configured. For example, HDRepair (Le et al., 2016a) implementation is tied to an assumption that exact information on the faulty method is first available.

Standalone: finally, our selection ensures that we focus on APR approaches where the tools can be run if provided with Java program source code and the available test suite. Therefore, any tool that would require extra data is excluded (e.g., LSRepair (Liu et al., 2018b) requires run-time code search over Github repositories).

We consider two sources of information to identify Java APR tools: the community-led program-repair.org website and the living review of APR by Monperrus (Monperrus, 2018b). As of July 2019, 31 APR tools were targeting Java programs listed in the literature. After systematically examining these tools, 16 are found to satisfy our criteria and are therefore finally selected. Table 2 enumerates all Java-based APR tools and provides arguments for rejection/consideration. We categorize them into three main categories: heuristic-based (Le Goues et al., 2019), constraint-based (Le Goues et al., 2019), and template-based (Kim et al., 2013) repair approaches.

These approaches construct and iterate over a search space of syntactic program modifications (Le Goues et al., 2019). Associated tools include jGenProg (Martinez and Monperrus, 2016), GenProg-A (Yuan and Banzhaf, 2018), ARJA (Yuan and Banzhaf, 2018), RSRepair-A (Yuan and Banzhaf, 2018), SimFix (Jiang et al., 2018), jKali (Martinez and Monperrus, 2016), Kali-A (Yuan and Banzhaf, 2018), and jMutRepair (Martinez and Monperrus, 2016). jGenProg and GenProg-A are Java implementations of GenProg (Weimer et al., 2009), which generates patches by searching donor code from existing code with the genetic programming method. ARJA is also a genetic programming approach to optimizing the exploration of the search space by combining three different approaches. RSRepair-A is a Java implementation of RSRepair (Qi et al., 2014), a Random-Search-based Repair tool, which tries to repair faulty programs with the same mutation operations as GenProg but uses random search, rather than genetic programming, to guide the patch generation process. SimFix utilizes code change operations from existing patches and similar code to build two search spaces, of which intersection is further used to search fix ingredients for repairing bugs. jKali and Kali-A are Java implementations of Kali (Qi et al., 2015) that fixes bugs with three actions: removal of statements, modification of if conditions to true/false, and insertion of return statements. jMutRepair implements the mutation-based repair approach (Debroy and Wong, 2010) for Java programs, with three kinds of mutation operators (i.e., relational, logical and unary) to fix buggy if-condition statements.

These approaches generally focus on fixing a single conditional expression that is more prone to defects than other types of program elements. Nopol (Xuan et al., 2017), DynaMoth (Durieux and Monperrus, 2016) ACS (Xiong et al., 2017), and Cardumen (Martinez and Monperrus, 2018) are dedicated to repairing buggy if conditions and to adding missing if preconditions. Nopol relies on an SMT solver to solve the condition synthesis problem. DynaMoth leverages the runtime context, which is a collection of variable and method calls, to synthesize conditional expressions. ACS is proposed to refine the ranking of ingredients for condition synthesis. Cardumen repairs bugs by synthesizing patch candidates at the level of expressions with its mined templates from the program under repair to replace the buggy expression.

These approaches are also often referred to as pattern-based and include kPAR (Liu et al., 2019a), AVATAR (Liu et al., 2019b), FixMiner (Koyuncu et al., 2020) and TBar (Liu et al., 2019c). kPAR is the Java implementation of PAR (Kim et al., 2013) that repairs bugs with fix patterns manually summarized from human-written patches. FixMiner automatically mines fix patterns from the code repository for patch generation. AVATAR relies on the fix patterns for static analysis violations. TBar combines diverse fix patterns collected from the literature.

Note that, technically, template-based repair approaches can be viewed as heuristics-based approaches. In this study, however, we separate them in their category to highlight their specificity. Finally, there exist some repair approaches that are enhanced by machine learning techniques. Le Goues et al. (Le Goues et al., 2019) refer to them as learning-based repair approaches. One example of such approaches is the Prophet tool by Long and Rinard (Long and Rinard, 2016b): it learns from a corpus of code a model of correct code, which indicates how likely a given piece of code is w.r.t. the code corpus. Our criteria of subject selection however excluded all learning-based repair as they are generally not “standalone”.

3. Experiment Settings

We now overview the inputs (i.e., buggy programs and fault localization information) and the validation process used in our study.

The APR literature includes several benchmarks (Saha et al., 2018; Just et al., 2014; Madeiral et al., 2019; Kim et al., 2013). In recent work, Durieux et al. showed that APR system may overfit the study benchmarks in terms of repairability. Since our objective is on efficiency, we focus on a single commonly used benchmark in the literature. We consider Defects4J (Just et al., 2014) as it has been widely employed to assess approaches (Le et al., 2016a; Liu et al., 2018b; Jiang et al., 2018; Wen et al., 2018), or to conduct various APR studies (Long and Rinard, 2016a; Wen et al., 2019; Wang et al., 2019b; Sobreira et al., 2018), as well as other software engineering research (Le et al., 2016b; Qi et al., 2013; Allamanis et al., 2018; Pearson et al., 2017). Defects4J consists of 395 bugs across six Java open source projects. Its dissection information (Sobreira et al., 2018) shows that the dataset contains a diversity of bug types. Our experiments thus consist of running each selected APR tool to generate patches in an attempt to fix each Defects4J bug. Overall, our experiments led to 347,603 repair attempts (each attempt requiring program compilation and testing against the test suite).

3.2. Fault Localization

As reported by Liu et al. (Liu et al., 2019a), repair performance of APR tools could be biased by fault localization settings. To minimize such potential bias, we take on the challenge and implementation effort to re-configure all APR tools so that they are using the same fault localization information for each Defects4J bug. In our experiments, we employ the latest release of GZoltar v1.7.2, an on-hand test automation framework. Note that early versions of this tool were widely used in the APR community (Martinez and Monperrus, 2016; Xiong et al., 2017; Jiang et al., 2018; Wen et al., 2018). However, Liu et al. revealed that the new version yields better results in the context of program repair (Liu et al., 2019a). For sorting suspicious statements, we use the Ochiai(Abreu et al., 2007) ranking metric. Eventually, APR tools are fed with a ranked list of suspicious source code statements that should be changed within the buggy program to repair it.

3.3. Patch Validation

Patch validation is performed by APR systems based on the execution outcome of regression and bug-triggering test cases, i.e., test cases that are passed by the buggy program and those that, because they are not passed, reveal the existence of a bug. If a patch candidate can make the revised buggy program pass the entire test suite successfully, it is considered as a valid patch. Such a patch, however, could be incorrect if it is just overfitting the test suite (Qi et al., 2015; Xiong et al., 2018). Thus, the community has adopted the terminology of plausible (Qi et al., 2015) patches to refer to patches that pass all test cases.

In recent literature, following the criticism on overfitting, researchers are shifting towards investigating correctness (Xiong et al., 2018; Le et al., 2019). So far, this has been a manual effort based on a recurrent criterion: a plausible patch is considered as correct when it is semantically similar to the developer’s patch in the benchmark. Unfortunately, the scope of semantics for APR is not explicitly defined as it is subjective.

We propose in this work to provide a first attempt of explicitly determining semantic similarity among patches. Our objective is to reduce the threat of subjectivity and enable reproducible experiments. To that end, we call on the community and consider labels of patches within APR research artifacts. We manually revisit patches that are generated by APR tools and which researchers have considered as correct in the literature. The objective is to unveil the implicit rules that researchers use to make the decisions on correctness. We find that there are broadly two scenarios when comparing a generated patch against the developer-provided patch:

Identical patches: in this case, the two patches are exactly identical, excluding variations in whitespace, layout, and comments.

Semantically-similar patches: in this case, the patches are not identical, although developers regard that they have the same effect on the program behavior. In Table 3 we summarize a taxonomy of correctness decision based on our study of patches labeled as correct by the research community. This taxonomy is based on the patches generated by ACS, SimFix, AVATAR, FixMiner, kPAR, and TBar whose authors investigated correctness and provided their manually labeled patches as research artifacts.

We applied these rules to determine whether a plausible patch is a correct one when it is syntactically different from the patch that a developer wrote. In the second column, “tool_name $\rightarrow$ bugID” denotes that the patch generated by the tool is identified as correct. The patches in the grey background are generated by APR tools while the patches in the white background are patches written by the developers.

In the remainder of this paper, for the experiments with the 16 APR tools, we will systematically build on the rules of Table 3We enumerated only 10 rules in this paper due to space limitation. Please visit https://github.com/SerVal-DTF/APR-Efficiency for more rules and detailed descriptions. to label plausible patches as correct. Thus, unless a generated patch is identical to the developer patch, it must fall under rules R1-10 to be labeled as correct. Our rules are certainly not exhaustive neither for defining semantic similarity nor for defining patch correctness. We call on a community effort to augment these rules to enable reproducible research.

Due to space constraints, we only detail here a single rule. Consider rule R5: In the illustration example, the developer patch ensures that boundaries are checked by calling a function that implements it. In contrast, a patch generated by ACS (Xiong et al., 2017) directly inserts the necessary code to check the boundary. Both patches, which are not syntactically identical, are semantically similar.

In the end, plausible and correct patches have the following relationship: Let $P$ and $C$ be sets of plausible and correct patches, respectively. It always holds $C\subseteq P$ . We compute $\frac{|C|}{|P|}$ as the Correctness Ratio (CR) of generated plausible patches that are correct.

3.4. Halting Threshold

In the APR community, it is commonly accepted that patch generation processes are halted if a system runs out of the time budget before being able to find a valid patch. As discussed in Section 2, time can be a biased metric. Therefore, in this study, we propose to halt the repair systems by setting a threshold of repair attempts for a given bug. We set the threshold of attempts as 10,000. This number is selected based on the reported average number (9,696.5) of patch candidates generated by PraPR (Ghanbari et al., 2019) for its fixed bugs. Given that PraPR works at the mutation level and does not require re-compilation, the number of attempts could be higher than that of other tools and it is high enough for the 16 APR tools employed in this study.

4. Terminology

Given that correct patches are first and foremost plausible patches, we propose in this work to use the term valid patches when referring to all plausible patches (including correct ones). Unless otherwise specified, we will also refer to as plausible all valid patches that have not yet been manually assessed as correct. We consciously avoid the term incorrect since the definition of correctness in Section 3.3.3 is sound, to some extentThe developer-patch provided in the benchmark, which we use as ground truth, may be erroneous as well., but is not complete (i.e, there are some cases of semantic similarity that are missed).

5. Efficiency Metric: NPC

As motivated in Section 2, we employ as efficiency metric in this study the number of patch candidates (NPC) generated by APR tools until the first plausible patch is found. This metric was initially proposed by Qi et al. (Qi et al., 2013) as a proxy to measure the performance of fault localization techniques based on program repair tools. JAID (Chen et al., 2017) and PraPR (Ghanbari et al., 2019) recently used them to highlight the performance of their approaches. Nevertheless, efficiency has not been systematically assessed before. In this study, we further differentiate generated patches that turn out to be invalid into two groups:

Nonsensical patch: Such a patch cannot even make the patched buggy program successfully compile (Kim et al., 2013; Monperrus, 2014).

In-plausible patch: Such a patch lets the patched buggy program successfully compile, but fails to pass some test cases in the available test suite.

Our efficiency metric is then computed by summing the number of patches in each category:

In practice, $NPC_{valid}==1$ since the generation of patches is halted as soon as the first valid patch is found. In this study, since we aim to investigate the repair efficiency, we focus on bugs for which the repair attempts were successfully concluded. Thus, our experimental data do not mention the cases where many patch candidates are generated but none of them was valid. We leave this investigation as a future study.

Study Results

We now provide experimental data as well as the key insights that are relevant to our research questions.

Table 4 provides execution outcomes of 16 repair tools on the Defects4J benchmark. We count the number of bugs that are plausibly fixed by each tool implementation, and further provide the number of plausible patches that can be considered as correct following the rules of patch validation (cf. Section 3.3.3).

$\bullet$ [Template-based repair tools are the most effective.] We observe that kPAR, FixMiner, AVATAR and TBar, which are template-based repair tools, present better repair performance than other tools in terms of the number of fixed bugs. The state-of-the-art, SimFix, also performs among the top. Note that, although it is classified as heuristics-based, and does not use templates explicitly, it performs transformations based on similar changes, and thus has been presented in previous studies (Liu et al., 2019c) as template-based.

$\bullet$ [Patch ordering strategies are necessary to increase the likelihood of hitting correct patches.] Among the 16 repair tools, ACS exhibits the highest ratio of plausible patches that are found to be correct. This experimental finding confirms the strategy used by the authors to increase “precision”Precision is the terminology employed by its authors to refer to the ratio of correct patches to plausible patches. in patch generation: these are dependency-based ordering, document analysis, and predicate mining.

$\bullet$ [Through time, repair tools tend to subsume their predecessors in terms of which bugs are fixed.] Table 5 provides statistics on the percentage of fixed bugs that are overlapping between two repair tools. In this table, the tools in column headers and row headers are ordered chronologically concerning the date of approach publication. Note that jGenProg ranked based on the GenProg publication year although the tool itself was implemented years later. We note that the upper-right side of the table is relatively darker than the rest: the percentages of overlapping are higher for these cells. These results suggest that, overall, the bugs that are fixed by earlier tools are also generally covered by more recent tools. Besides, evolution trends presented in Figure 4 show that, although the number of bugs that are fixed by the different tools over the years is increasing, the number of new bugs is increasing with small increments. This result suggests that the strategies implemented in new approaches tend to have similar outcomes as merging past techniques to cover previous bug sets that were fixed each via different approaches.

$\bullet$ [Recent APR tools tend to correctly fix more bugs than their predecessors.] In the right part of Figure 4, a visible breakthrough is the sharp increase of the light grey area indicating that recent tools increasingly correctly fix bugs which have not been fixed by previous tools. We further summarize in Figure 5 the number of bugs that each tool can correctly fix exclusively or not. SimFix, ACS, AVATAR, and TBar are leading repair tools that generate correct fixes for more bugs. In contrast, jGenProg, GenProg-A, jMutRepair, RSRepair-A, jKali, Kali-A, DynaMoth, and Cardumen do not correctly fix any Defects4J bug that is not also correctly fixed by another tool.

$\bullet$ [Implementation details can make a difference.] Finally, we observe that Java-targeted implementations of GenProg (i.e, jGenProg and GenProg-A) and Kali (i.e., jKali and Kali-A) by different research groups yield diverging repair performance on the same benchmark.

2. RQ2: Patch Generation Efficiency

Following our motivation argument in Section 2, we use the $NPC$ scores (i.e., number of generated patch candidates that are checked until a valid patch is found) to measure repair efficiency of APR tools. For each tool, the results focus on Defects4J bugs that are fixed (i.e., a valid patch was eventually found). Indeed, through efficiency, we attempt to measure the ability of the APR tool to avoid wasting computing resource, time and energy in patch validation towards generating a valid patch.

Figure 6 overviews the general distributions of $NPC$ scores of the 16 repair tools on the Defects4J benchmark. For all tools, the median $NPC$ is lower than 250 patch candidates. However, the distribution spread among bugs is not only significant for several (8 out of 16) tools but also varies across tools.

$\bullet$ [Efficiency is not yet a widely-valued performance target.] SimFix, TBar and kPAR exhibit the highest $NPC$ scores which can go beyond 1,000 patch candidates for some bugs. Correlating this data with repairability findings (Section 4.1), we note that tools with highest repairability scores also have the highest $NPC$ scores (hence, lower efficiency). In particular, we note that APR approaches, which rely on change patterns (i.e., standard template-based tools) or heuristically search for donor code based on code similarity (e.g., SimFix), produce the largest number of patch candidates. They are effective since they end-up finding a valid patch, but they are not efficient as they generate too many patches (comparing against other approaches) for repair attempts. On the other hand, constraint-based APR tools (e.g., ACS) have the lowest $NPC$ scores. There is, therefore, an insight that constraint-solving and synthesis strategies, although they might require more computing effort to traverse the search space, eventually yield patches which waste less resource during test-based validation.

$\bullet$ [The state-of-the-art can avoid generating nonsensical patches.] Figure 7 illustrates the contribution of nonsensical and in-plausible patches to the $NPC$ scores. The distributions of nonsensical patches are interesting with respect to different claims in the literature. Indeed, to motivate their seminal work on template-based program repair, Kim et al. (Kim et al., 2013), authors of the PAR tool, stated that pioneer genetic programming based repair tools had the limitation that it could generate nonsensical patches. Our empirical assessment results back up this claim. However, our results also reveal that template-based repair tools (e.g., kPAR and TBar) have not fulfilled the claimed promise since they produce the largest numbers of nonsensical patches. This finding calls for a triaging strategy targeting nonsensical patches within the search space. In this regard, our experimental results highlight three tools (i.e., DynaMoth, Nopol, and SimFix), which do not generate any nonsensical patches.

Nopol uses an SMT solver to address the condition patch synthesis problem. DynaMoth leverages the runtime context, collects variable and method calls to synthesize conditional expression patches. SimFix heuristically searches similar code from the intersection of two search spaces: one is for donor code and the other one is for code change actions, to generate patches. A noteworthy result is that, while Nopol and DynaMoth overall generate few candidates, SimFix generates the largest number of patch candidates, none of which is ever found nonsensical. This finding suggests that code similarity has a large influence and can be useful for effectively triaging the repair search space.

Besides Nopol, Dynamoth, and SimFix, five repair tools (i.e, jMutRepair, jKali, Kali-A, Cardumen and ARJA) generate significantly more in-plausible patches than nonsensical ones. jMutRepair, jKali and Kali-A are implemented with simple mutation operators that are unlikely to prevent the programs from compiling. However, these mutation operations can lead to test failures. ARJA’s efficiency w.r.t. nonsensical patch generation is likely due to the combination of different search strategies that drive its genetic programming.

$\bullet$ [The more templates an APR system considers, the more nonsensical and in-plausible patches it will generate.] TBar contains more fix templates than kPAR, FixMiner and AVATAR since it merges all literature templates. Therefore, each suspicious buggy location has a higher probability in TBar to be matched with more templates, leading to more patch candidates than other tools. This finding highlights the importance of strategies for fix template matching and donor code searching to improve the repair efficiency of template-based repair tools.

$\bullet$ [Specialized templates increase the efficiency of APR tools.] Among the template-based repair tools, kPAR has the smallest number of templates. Indeed it includes 10 templates manually prepared by Kim et al. (Kim et al., 2013), while AVATAR includes 11, TBar integrates 35 and FixMiner considers 28. Nevertheless, experimental results for NPC scores (cf. Figure 6) and the dissection in non-sensical and in-plausible categories (cf. Figure 7) reveal that kPAR is the least efficient. According to the authors’ source code of the tools, these tools use the same search space traversal strategy and implementation. Therefore, the only difference being about the included templates, we can safely conclude that the nature of these templates is driving the efficiency performance. AVATAR indeed focuses on templates obtained by curated datasets of fixes: all mined code changes are for static analysis violations which are systematically validated as actual fixes. FixMiner, on the other hand, augments its templates with relevant contextual information to ensure that they are applied on code locations that are syntactically similar to the locations where the templates where mined.

$\bullet$ [Correct patches are sparse in the search space.] Long et al. (Long and Rinard, 2016a) presented an initial study which revealed that correct patches can be considered as sparse in the search space and that overfitting patches (Qi et al., 2015; Xiong et al., 2018; Le et al., 2018, 2019) (i.e., only plausible but not correct) are vastly more abundant. We extend their study to consider the cases of in-plausible patches that are produced ”before any plausible patch” (i.e., including if it is correct) vs. ”before a correct patch” (i.e., only if the plausible is correct). Figure 8 illustrates the distributions of $NPC_{in-plausible}$ scores for all fixed bugs and only correctly-fixed ones. We observe that for tools such as TBar, AVATAR, FixMiner, and kPAR, the median of $NPC_{in-plausible}$ scores for correctly-fixed bugs is lower than the median for all fixed bugs. This means that, when a correct patch can be found, the number of in-plausible patches that are generated before is fewer than when only a plausible patch can be found. The situation is the converse for SimFix and ARJA. Therefore, we note that for most tools, a correct patch is more efficiently found when the search space is less noised (i.e., fewer in-plausible patches).

Table 6 provides more detailed statistics to drive an in-depth correlation study around efficiency and correctness. Based on the mean values, except for ACS, ARJA, and AVATAR, APR tools tend to generate more patch candidates when considering all bugs than when considering only the correctly-fixed ones. This tendency is much more apparent for search-based APR techniques such as jGenProg (Martinez and Monperrus, 2016), GenProg-A (Yuan and Banzhaf, 2018), SimFix (Jiang et al., 2018), and RSRepair-A (Yuan and Banzhaf, 2018). Although TBar is a template-based approach, it has characteristics of search-based tools since its search-space has been enlarged by incorporating any fix templates in the literature.

The previous experimental data overall suggest that simply giving more time to the APR tool to repair a buggy program does not guarantee to find correct patches. On the contrary, it seems that when allowing less attempts, the correctness ratio is improved. We propose to simulate a simple strategy of threshold setting to investigate the impact on the correctness ratio (i.e., ratio of correctly-fixed bugs to plausibly-fixed bugs). We consider a scenario where the APR tool is halted when a certain number of in-plausible patches is checked.

Table 7 presents the results on how correctness ratio is influenced when we set a threshold on the number of in-plausible patches: basically, we propose to stop the repair attempts by a given tool if a certain number of generated patches turned out to be in-plausible (i.e., do not pass the test cases). We observe that the ratio of generated plausible patches to be correct is increased at varying degrees for 14 (out of 16) repair tools. Nopol and ACS do not show any improvement: initially, they produce few in-plausible patches. It should be noted that this result should be put in perspective as when discussing precision and recall: threshold setting, while useful to increase correctness ratio, may also lead to an overall reduction of the number of bugs that are correctly fixed.

3. RQ3: Impact of Fault Localization Noise

A recent study by Liu et al. (Liu et al., 2019a) has reported empirical results suggesting that fault localization results can adversely affect the performance of the repair. The authors experimented on a single tool, kPAR, and focused on repairability (i.e., how many bugs are not fixed due to localization errors). Our study already takes steps to avoid the bias of presenting various experimental results with APR tools which use different fault localization inputs. Thus, we have put an effort to harmonize all fault localization configurations for the 16 APR tools under study (cf. Section 3.3.2).

To evaluate the impact of fault localization noise for different tools, we propose to compare results obtained so far with our standard spectrum-based fault localization (GZoltar+Ochiai) against experimental results where the APR systems are directly given the ground-truth fix locations. We compare the results both in terms of repairability and repair efficiency.

First, we measure the impact on repairability, where we estimate for each repair tool how many bugs can be fixed by each APR system if it is precisely pointed to the ground-truth fix locations? Table 8 illustrates the details on the impact of repairability. Except for Cardumen, we observe that in general the correctness ratio improves (by up to 30 percentage points) if the fix locations are provided. It suggests that false-positive bug locations, hence fault localization noise, has an impact on the likelihood to generate correct patches. There are however anecdotical cases that are noteworthy:

$\bullet$ [Ground truth incompleteness.] Although our configuration of fault localization did not yield the developer-provided fix position for bug Lang-35, ACS patch generation eventually produced a correct patch for this bug. This patch, which targets a different code location, was found semantically-similar to the developer-provided patch following rule R2 (cf. Section 3.3.3). This finding reminds us that the benchmark that is used is not a complete ground-truth, neither for repair-oriented fault localization nor for patch generation.

$\bullet$ [Fix location is different from bug location.] We observe that jKali now fails to correctly fix respectively 2 when it is given the developer-provided fix locations. This finding suggests that the repair tool is rather misled, in the cases of specific bugs, when it is given the right bug positions. Instead, some sibling positions are better inputs to drive correct fixing. However, data in Table 8 show fault localization has different impacts on performance for plausible fixing than for correct fixing.

Furthermore, based on results of overlapping in repairability (in terms of plausible patches) performance as depicted in Figure 9, we note that many bugs are only fixed (plausibly) when the fault localization does not precisely point to the fix locations. This is a surprising but interesting finding to be investigated by APR-targeted fault localization research.

$\bullet$ [Mockito bugs are not repairable.] Another immediate observation that we make from the experimental results in Table 8 is that bugs from the Mockito project are not easy to fix. According to reported results in Table 8, only three tools (i.e., FixMiner, AVATAR, and TBar) are able to fix Mockito bugs even if ground-truth fix locations are provided. We carefully proceed to investigate the possible reasons for this situation: 13 Mockito bugs (i.e., bug IDs 1-10 and 18-20) are associated to program code that cannot be compiled under JDK 7 (which is the JDK that is mentioned in the requirements of Defects4J). Our results further confirm a recent study (Wang et al., 2019b) by Wang et al., who reported that the state-of-the-art SimFix and CapGen are not able to fix any Mockito bugs even when provided with ground-truth fix locations. Our study enlarges the scope of their studies. In the end, our systematic assessment results for all bugs better sheds light on a common phenomenon in the literature where Mockito project bugs are not considered when reporting repair performance. These results call for modular configuration of execution environment as well as for better integration of advances in fault localization to support APR systems. Besides Mockito bugs, many bugs in other projects cannot be fixed since they are not precisely localized. Overall, consider again Figure 9. For all tools (except jMutRepair), we observe that some bugs are fixed only when the actual fix locations are directly given to the system.

3.2. Impact of fault localization noise on repair efficiency

We investigate the $NPC$ scores, i.e., the number of patch candidates that are generated by the different APR systems when they are pointed to the developer-provided fix locations. Figure 10 shows the corresponding distribution of $NPC$ scores for each repair tool.

$\bullet$ [Template-based program repair tools are highly sensitive to fault localization noise.] We observe from Figure 10 that, except for DynaMoth, Nopol, and ACS, the remaining 13 repair tools have significantly smaller distribution ranges of $NPC$ scores than the distribution ranges when the APR system was run under our generic fault localization configuration (cf. Figure 6). A straightforward explanation is that, under a typical fault localization configuration, a repair tool will attempt to generate patch candidates for each suspicious statement that is ranked by the fault localization. When the fault localization is noisy (i.e., top suspicious statement(s) are false positives), more in-plausible and even non-sensical patches might be generated. In particular, for repair tools that are based on pattern matching and code similarity (i.e., SimFix, and the template-based repair tools) the gap of repair efficiency has reduced substantially by an order of magnitude when correct fix locations are given to the tool. For example, the median $NPC$ score of SimFix is around 200 when using our generic configuration of fault localization, but is around 20 when using directly correct fix locations. Such tools are thus more sensitive to fault localization noise than other tools. In conclusion, we confirm the finding of the study of Liu et al. (Liu et al., 2019a). However, we delimitate its validity to template-based repair tools. Other tools, e.g., constraint-based repair tools such as ACS or Nopol, which use specific techniques to triage the search space do not present any increase in repair efficiency when pointed to the fix locations. This finding suggests that they have limited sensitivity to fault localization noise.

Threats to Validity

External validity. Our study considers only the Defects4J benchmark and only java repair tools. All findings might thus be valid only for this configuration. Nevertheless, this threat is mitigated by the fact that we use a large set of repair tools and a renowned defect benchmark to study a performance criterion that was largely ignored in the literature.

Internal validity. Our implementation of fault localization as well as the manual assessment of patch correctness may threaten the validity of some of our conclusions. We mitigate this threat by reusing common fault localization components from the repair literature as well as by enumerating and sharing the rules for defining patch correctness. Two authors were in charge of assessing the correctness and they cross-reviewed each other’s decisions. In case of conflict other authors were called to create a consensus.

Construct validity. By construct, to limit resource exhaustion, we added a threshold on the number of patches to validate. However, this threshold may penalize some tools. We mitigate this threat by carefully selecting a threshold based on empirical results on PraPR, a recent related work which mutates directly bytecode, allowing it to generate many more patches (since no compilation is needed).

Related Work

Initially, evaluation of test-based program repair was focused on counting the number of bugs fixed by a repair tool out of all bugs in a benchmark (Weimer et al., 2009; Le Goues et al., 2012b; Kim et al., 2013; Le et al., 2016a). However, valid patches are sometimes incorrect as they overfit on incomplete test suites (Qi et al., 2015), and might cause issues during maintainance (Fry et al., 2012; Smith et al., 2015). Thus, plausibility and correctness became widely accepted to define metrics for assessing repairability of repair tools (Xiong et al., 2017; Wen et al., 2018; Chen et al., 2017; Hua et al., 2018; Saha et al., 2017, 2019; Ghanbari et al., 2019; Liu et al., 2018b, 2019c, 2019b; Liu et al., 2019a; koyuncu2018fixminer). In this study, we also follow the metric to revisit the repairability of repair tools. Nevertheless, we differ from studies in the literature by ensuring that APR tools use the same controlled configuration for fault localization.

Along with the performance evaluation, serval studies simply reported the repair efficiency in terms of CPU time consumption of fixing bugs (Weimer et al., 2009; Xiong et al., 2017; Wen et al., 2018; Hua et al., 2018; Ghanbari et al., 2019). However, it could be biased to assess the efficiency with time cost for various reasons (cf. Section 2). Instead, we leverage the number of patch candidates generated by repair tools to measure the repair efficiency, which should be intrinsic to the repair approaches. Ghanbari et al. (Ghanbari et al., 2019) provided information on the number of patch candidates generated by PraPR. This information, however, could not be put into perspective against other tools. Our study fills this gap.

To boost the development of program repair, various empirical studies have been conducted in this direction. Le Goues et al. (Le Goues et al., 2012a) re-assessed GenProg on real bugs, while several studies on overfitting followed (Qi et al., 2013, 2015; Le et al., 2018; Xiong et al., 2018; Le et al., 2019; Wang et al., 2019a). Yang et al. (Yang et al., 2017) explored better test cases for better program repair. Yi et al. (Yi et al., 2018) empirically investigated the effectiveness of test-suite metrics in controlling the repairing reliability of GenProg. Motwani et al. (Motwani et al., 2018) investigated to what extent important bugs can be fixed by 9 APR tools. Liu et al. (Liu et al., 2019a) investigated the FL bias in benchmarking APR tools with only one APR tool. Durieux et al. (Durieux et al., 2019) conducted a large-scale empirical study for Java APR tools to investigate their repairability on different benchmarks. Empirical studies for APR tools have been studied from different scenarios in the literature, but these studies mainly focus on the traditional APR tools and the latest state-of-the-art tools (e.g., ACS (Xiong et al., 2017), SimFix (Jiang et al., 2018) and TBar (Liu et al., 2019c)) have not been studied systematically. Our study fills this gap by looking back at 10 years of test-based program repair research and focusing on the under-valued performance criterion that is efficiency.

Conclusion

This paper reports on a large-scale study on the efficiency of test suite based program repair. Efficiency is defined based on the number of patch candidates that are generated before a repair system can hit a valid patch. Our study comprehensively runs 16 repair systems from the literature under identical configuration of fault localization. Our experiments explore repairability (i.e., repair effectiveness), repair efficiency as well as the impact of fault localization on both performance criteria. Beyond the statistical data, we call on the community to invest in strategies for making repair efficient in order to facilitate adoption in a software industry where computing resources are managed sometimes with parsimony.

Artefacts: All data and tool support for replication are available at https://github.com/SerVal-DTF/APR-Efficiency.git