Generating Accurate Assert Statements for Unit Test Cases using Pretrained Transformers

Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, Neel Sundaresan

I Introduction

Software testing is recognized as one of the most crucial, challenging, and expensive parts of the software lifecycle. It comes as no surprise that the software testing research community has invested significant effort in designing approaches that aim at supporting or automating software testing activities. An example of this endeavor is the work targeting the automatic generation of unit tests . While these works represent a notable achievement towards the goal of automated testing, they come with several limitations, recently highlighted by studies in industrial settings .

One of the major challenges these tools aim to overcome is the generation of accurate assert statements, which have been found to be often incomplete or inadequate to properly test the behavior of a software component. Generating meaningful assert statements is one of the key challenges in automatic test case generation . Assert statements represent the basic blocks of software testing, used by developers to check conditions or states in a program and reason about program correctness.

Watson et al. recently proposed ATLAS , an RNN-based approach which aims at learning from thousands of unit test methods how to generate meaningful assert statements. Inspired by this work, we improve upon it in substantial ways.

In this paper we present an approach to generate accurate assert statements based on state-of-the-art transformer model and relying on transfer learning to achieve best-in-class performances, predicting the correct assert in 62% of the cases in the very first attempt, which represents an 80% relative improvement for top-1 accuracy over the previous work .

Transfer learning is a technique which first trains a model in an unsupervised fashion on large quantities of unlabeled data, and then finetunes on a downstream task like classification or translation. This technique has emerged as a standard route to achieve state-of-the-art results in natural language processing (NLP) tasks, obtaining higher performance and requiring much less resources than training each task from scratch . The intuitive explanation for this success is that a model which learns to generate text or fill in blanks will have developed biases reflected in the data which can help it learn to perform particular language tasks faster and with higher performance.

In this paper, we extend this idea to source code as a language for a task of automated assert statement generation. We pretrain a sequence-to-sequence transformer model on a large source code and English language corpora, and finetune it on an assert statement generation task.

In our extensive empirical evaluation we assess several properties of our approach, such as intrinsic model metrics as well as extrinsic metrics related to the generated asserts. Finally, we evaluate the proposed approach in a scenario where it is used to support automated test case generation tools, such as EvoSuite, by augmenting the test cases with the assert statements generated by our approach.

To summarize, this paper provides the following contributions:

An approach that generates accurate assert statements based on a sequence-to-sequence transformer model. Our approach can predict the correct assert in 62% of the cases in the very first attempt, and reaches up to 84% correctness when allowing the model to suggest more asserts.

We empirically demonstrate the benefits of pretraing on both English and source code corpora for the downstream assert generation task, resulting in performance improvement in terms of variety of intrinsic metrics such as BLEU score, validation loss, and syntactic correctness.

We investigate how our proposed approach can be used to augment existing test cases, such as those generated by EvoSuite, with additional assert statements that lead to test coverage improvements.

II Approach

Figure 1 provides an overview of the training pipeline we followed in building our models specialized in assert statement generation.

We begin with the state-of-the-art BART Transformer (Sec. II-A) which will serve as the reference architecture for our models. We employ two pretraining stages: English Pretraining (Sec. II-B) where we perform semi-supervised pretraining on a large corpus of English text; Code Pretraining (Sec. II-C) where the model is pretrained on Java source code. Next, we perform the finetuning on the task of generating assert statements for unit test cases (Sec. II-D), relying on a labelled dataset of test cases and method under tests. Finally, we evaluate variants of the models (Sec. II-E) obtained with different levels of pretraining indicated with different arrows in Figure 1.

BART is a denoising autoencoder which utilizes the standard sequence-to-sequence transformer architecture from , substituting ReLUs with GeLU activation functions.

We select the BART model architecture because it facilitates finetuning for downstream translation task of assert statement generation, providing a more advanced set of noising transformations, which include token masking, token deletion, infilling and statement permutation. The model is pretrained by corrupting documents and optimizing the cross-entropy loss between the decoder’s output and the original input sequence.

We pretrain the BART large model architecture, having 12 layers in the encoder and decoder. The model is trained in mixed-precision, using Adam stochastic optimization procedure with $\epsilon=10^{-6}$ , and $\beta_{1}=0.9$ , $\beta_{2}=0.98$ optimizer parameters; we apply inverse square root learning rate schedule with the base learning rate of 0.0001, a warmup period of 5000 update steps, and the local gradient accumulation with a frequency of 4 update steps.

II-B English Pretraining

In this stage we pretrain a model in a semi-supervised fashion on a large corpus of English text, with the goal of learning semantic and statistical properties of natural language.

The pretraining is performed for 40 epochs on 160GB of English text extracted from books, Wikipedia, and news articles , comprising a total of X lines of text.

II-B2 Training Strategy

BART is trained in an unsupervised manner. Given corrupted text, its objective is to reconstruct the original text. The particular type of noise used in this work involves masking 30% of all tokens, with masks covering spans of tokens with lengths following a Poisson distribution parameterized by $\lambda=3$ , as well as permuting all sentences.

II-C Code Pretraining

In this stage we pretrain a model on source code corpus written in Java language, with the goal of learning syntax and properties of source code.

We collect this code corpus dataset by crawling all public, non-fork Java repositories on GitHub with at least 50 stars. We then deduplicate at the file-level using a hash function. After filtering based for permissive licenses and filtering out based on heuristics like the fraction of non-ascii characters, we are left with 25GB of training data from the 26,000 repos. For pretraining validation, we use the 239 test Java repos from CodeSearchNet , which comprise 600MB.

II-C2 Training Strategy

A similar pretraining strategy to English pretraining is employed. The source code files are corrupted by deleting 20% of all tokens independently and rotating half of all documents. This pretraining is performed for 10 epochs.

II-D Asserts Finetuning

In this stage we finetune a model on the task of generating assert statements for unit test cases. Specifically, we represent this task as a translation task, where the source is the partially written unit test along with the method under test, and the target is the correct assert statement that the developer wrote for that unit test.

To perform the finetuning, we rely on the publicly available dataset of unit test methods used to evaluate ATLAS . This dataset is comprised of Test Methods (i.e., methods within a unit test case), Focal Methods (i.e., the methods under test), and Asserts (i.e., the assert statements within the Test Methods).

This dataset has been mined from more than 9 thousand open-source GitHub projects containing unit test cases defined with JUnit. The authors first extract methods beginning with the @Test annotation as candidate Test Methods. From these candidate methods the authors select those that specify a single assert statement. Next, for each Test Method they pinpoint the Focal Method (i.e., the method that the current Test Method is testing) using a heuristic which looks at last method invocation before (or within) the assert statement. Finally, each Test Method is modified by replacing the Assert statement with a placeholder .

Each data point in the dataset is referred to as a Test-Assert Pair (TAP), and can be seen as a triplet $TAP_{i}=\{tm^{\prime}_{i},fm_{i},a_{i}\}$ where $tm^{\prime}_{i}$ is the Test Method where the assert statement has been replaced with a placeholder, $fm_{i}$ is the Focal Method, and $a_{i}$ is the assert statement. This data is organized as a parallel corpus, a common format for machine translation tasks, where the source sentence $s_{i}=\{tm^{\prime}_{i}+fm_{i}\}$ is the concatenation of the Test Method and Focal Method, while the target sentence $t_{i}=a_{i}$ is the assert statement to predict. Figure 2 provides an example of a TAP. The test method $tm_{i}$ is testLength() and its corresponding focal method $fm_{i}$ is length(). The test method creates two sets of bits and check that their length is the same, using the assertEquals statement. The source sentence is the concatenation of the test method and focal method, where the assert statement is replaced with a placeholder. The corresponding target output is the assert to be predicted.

For our models, we use the raw version of the dataset – corpus comprising the original source code tokens – rather than the abstract version (where some tokens are replaced with IDs), since we aim to exploit the rich semantics of all the variable and method names. We keep the original split of the dataset in training, validation, and testing sets (80%, 10%, 10%) for a fair comparison. Table I reports the number of instances in the dataset.

II-D2 Training Strategy

The finetuning process is a translation task, where we train the model to learn the mapping $s_{i}\rightarrow t_{i}$ as a conditional probability $P(a_{i}|tm^{\prime}_{i}+fm_{i})$ .

During training we use the cross entropy loss and the Adam optimizer and monitor the loss on the validation set for early stopping.

We use shared vocabulary embeddings between Encoder and Decoder for optimizations reasons and because our input and output language is the same (i.e., Java source code).

II-E Model Variants

At the end of these stages, we obtain four different variants of the model, based on the level of pretraining performed:

BART_Scratch: A model which has not been pretrained on any corpus but directly finetuned on the assert generation task. This model represents the orange line in Figure 1.

BART_English: A model which has been pretrained on the English corpus and then finetuned for the assert generation task. This model represents the green line in Figure 1.

BART_Code: A model pretrained on the source code corpus, then finetuned on the assert generation task. This model represents the purple line in Figure 1.

BART_English+Code: A model pretrained first on English and further pretrained on source code corpus, then finetuned on the assert generation task. This model represents the blue line in Figure 1.

III Experimental Design

The goal of our empirical study is to determine if our approach can generate accurate assert statements for unit test cases, in one or very few attempts. We investigate whether our approach outperforms the previous RNN-based approach ATLAS . Additionally we explore the impact of different pretrainings on the assert generation performances as well as the effect of incorporating the focal method in the input to the model.

Our experiments aims at answering the research questions described in the following paragraphs.

RQ1: How does our approach compare with ATLAS?

We compare the performances of our models against ATLAS , the RNN-based approach available in the literature. Specifically, to ensure fair comparison, we perform the finetuning process of our models on the exact same training set, and evaluate and compare the performances on the same test set. To compare the models’ performances we use the top-k accuracy metric, which measures the accuracy of a model with different number of attempted predictions. In particular, if the target assert statement $a_{i}$ for the i-th input in the test set, is in the top-k predictions of the model, we count it as a correct prediction at $k$ . Similarly to ATLAS , we experiment with $k=\{1,\dots,50\}$ with a maximum beam size of 50.

RQ2: What is the effect of the pretraining process on the assert generation task?

We investigate whether pretraining our models on English corpus, on source code, and both corpora has any noticeable impact on the downstream performances of the models on the assert generation task. In particular, we compare the BART_Scratch model, which has not been pretrained on any corpus, against BART_English which was pretrained on English corpus, BART_Code which was pretrained on source code, and BART_English+Code which was pretrained on both English and source code. This comparison is performed considering:

Extrinsic metrics: the top-k accuracy on the assert generation task;

Intrinsic metrics: (i) best validation loss and the number of training steps required to reach it (e.g., faster convergence); (ii) BLEU4 score, a common metric for machine translation tasks, evaluated on the test set; (iii) the syntactic correctness of the generated asserts, determined using a Java Parser.

RQ3: What is the effect of the Focal Method on the performance of the model?

In Section II-D we described the finetuning process, where the input to the Encoder is the concatenation $\{tm^{\prime}_{i}+fm_{i}\}$ of the Test Method $tm^{\prime}_{i}$ and Focal Method $fm_{i}$ . In this research question our goal is to understand the effect on performances of the Focal Method as input to the model, when generating the assert statements. In particular, we select our best model obtained from RQ1 and RQ2 and compare against an equivalent model (i.e., same preprocessing steps) but finetuned on a parallel corpus that does not contain the focal method, trying to learn the probability $P(a_{i}|tm^{\prime}_{i})$ . Specifically, this can be seen as an auto-completion task, where the source is the partially written test method $tm^{\prime}_{i}$ and the expected target output is the assert $a_{i}$ . We compare the performances of the models with or without the focal method in terms of top-k accuracy on the test set.

RQ4: What is the quality of the generated asserts?

In the last research question we investigate the quality of the assert statements generated by the model. We manually analyze instances of correct predictions as well as inspecting those that do not match with the target assert statement. We report qualitative examples and discussion.

RQ5: Can our approach be used to improve automatically generated test cases?

The goal of this research question is to provide a preliminary investigation on the potential benefits of using our approach to improve automated test case generation tools, such as EvoSuite. Specifically, we aim at enhancing test cases generated by EvoSuite by inserting additional asserts generated by our approach. We evaluate the potential benefits in terms of test coverage boost and qualitative discuss the additional asserts.

For this investigation we select a small but reproducible testbed using defects4j. We rely on defects4j since it provides a reliable infrastructure to generate, compile, execute, and evaluate test cases. Specifically, we select Lang-1-f, which represents the fixed version of the first bug in the defects4j collection belonging to the project Apache Commons Lang.

To generate test cases with EvoSuite, we use the defects4j built-in command gen_tests.pl -g evosuite -p Lang -v 1f. This command invokes EvoSuite test generation on the first fixed revision of Lang, which will generate test cases for the class affected by the bug (i.e., NumberUtils). We let EvoSuite generate test cases for 500 seconds ( $\sim 8$ minutes) and compute the test coverage using defects4j which relies on Cobertura, singularly for each test case.

Next, we select the 18 unique focal methods of the class, without considering overloaded copies of the methods, and the corresponding test cases generated by EvoSuite. We select a single best test case for each of the focal methods. Once we have the mapped test case pair, we generate additional assert statements for each of the pair using our approach. Specifically, for each focal method we generate the top-10 predictions and select a single assert from them, which we insert as the last statement within the EvoSuite test case. Finally, we execute the newly augmented test cases and recompute the test coverage for each of them.

IV Experimental Results

In this section we report and discuss the results of our empirical study.

RQ1: How does our approach compare with ATLAS?

Figure 3 reports the top-k accuracy for our four variations of the model as well as for ATLAS. The x-axis represents the $k$ value, ranging from 1 to 50, while the y-axis indicates the percentage of correct asserts in the test set. For ATLAS we report the results as they appear in the original work , by considering the best model (i.e., Abstract Model). It is worth noting that the ATLAS line is shaped as a step-function because the authors reports only the value of $k$ with 5 step increment (e.g., $1,5,10,\dots$ ), while we computed the top-k accuracy at each integer value $k$ from 1 to 50, hence the smoother curve.

The results show that our models outperform ATLAS at any $k$ value. In particular, BART_English+Code can correctly predict the target assert statement (originally written by the developer) in 62% of the cases in just the first attempt. This represents an $80\%$ relative improvement over the top-1 ATLAS accuracy of 27%.

The impressive results on the top-1 accuracy are particularly important in terms of usability and applicability of this approach beyond research, in actual development environments. Practically, developers would obtain accurate and relevant assert statements in one or very few suggestions, without the need of going through a long list while discarding incorrect recommendations.

Summary for RQ1. Our approach outperforms ATLAS with a relative improvement of 80% on top-1 accuracy.

RQ2: What is the effect of the pretraining process on the assert generation task?

Figure 3 shows a massive gap between the performance of the model whiteout pretraining (BART_Scratch) compared to the models with English (BART_English), source code (BART_Code), and both (BART_English+Code) pretraining. Specifically, the performance gap between BART_Scratch and the models with single pretraining phase (BART_Code or BART_English) is 22-25%, while the gap between BART_English and BART_English+Code is 1.54-2.25%, in favor of the model which was pretrained on both English and source code.

These results highlight the importance of pretraining on the performance of downstream tasks. It is particularly striking the effect of pretraining on natural language English text over a downstream task involving source code. This result emphasizes the significance of having a model which understands the semantics of variable and method names in the code.

While additional pretraining on source code appears to have a limited impact on performances, compared to pretraining only on English, it still delivers consistent improvements, which could potentially be higher on bigger test sets.

Intrinsic Metrics

In terms of intrinsic metrics, Figure 4 shows the cross-entropy loss on the validation set during training for the four model variations. Similarly to what observed with the extrinsic metric, we note a substantial gap between the model without pretraining (BART_Scratch) compared to the two models with English (BART_English), source code (BART_Code) and both (BART_English+Code) pretraining. Comparing the English only and the English+Code models, the additional pretraining on source code has three evident effects: (i) lower initial loss (0.21 vs 0.31); (ii) lower best loss (0.13 vs 0.15); (iii) faster convergence ( $\sim$ 2500 training steps earlier).

Table III reports the intrinsic metrics computed for the four model variations. Specifically, BART_English+Code obtains the best BLEU4 and validation loss. Regarding the syntactic correctness, the model pretrained on both English and source code obtains the best value for the top single prediction, however when computing the correctness considering the top 25, and 50 generated asserts for each input in the test set, the model pretrained exclusively on source code achieves the highest correctness score. This result is somewhat predictable, since BART_Code has been pretrained and finetuned exclusively on source code, thus it should have the most consistent results in terms of syntax.

Overall, we observe a significant positive effect of pretraining on English and source code for both extrinsic and intrinsic metrics. While additional pretraining on source code appears to have a smaller impact than English pretraining alone, it is worth noting that we observe consistent improvements across all the analyzed metrics, corroborating the beneficial effect of the source code pretraining. Additionally, the small gap could be due to the nature of the downstream task, where the output is a single-line assert statement, which could closely resemble a natural language sentence.

Summary for RQ2. Pretraining has a significant positive effect on the downstream performances. Pretrainig on English text boosts the performances of 23-25%, while further pretraining on source code can yield additional $\sim$ 2% improvement.

RQ3: What is the effect of the Focal Method on the performances of the model?

To answer this research question, we compared the model that achieves the best performances in RQ1 and RQ2 – BART_English+Code – against a similar model (i.e., same pretraining phases) but with different finetuning. Specifically, we selected the same model checkpoint after the source code pretraining, and fintuned the model on a modified dataset without the Focal Method as input. Figure 5 shows the top-k accuracy of the models with (solid line) and without (dashed line) the Focal Method. The results show that the model which takes as input the Focal Method is more accurate in generating assert statements in $\sim$ 10% of the cases. That is, there are certain assert statement that are not covered by the model w/o Focal Method, even when 50 different predictions are generated. This result highlight the essential information provided by the Focal Method to inform the model on generating specific assert statements.

Summary for RQ3. The Focal Method provides essential information which allows the model to generate $\sim$ 10% more accurate asserts compared to a generic auto-completion model.

RQ4: What is the quality of the generated asserts? To answer this research question we analyze and discuss examples of generated assert statements. Figure 6 provides examples of common, complex, and equivalent assert statements generated by our best model BART_English+Code. The list of common asserts comprises statements that are correctly predicted by the mode (i.e., match the original assert) which are often found in test cases in different contexts. For example, these asserts usually check that the result is equal to the expected value, or that a given list contains an element that was previously added. These types of asserts are usually predicted in the very first attempts of the model. While these represent simple assert statements, they still require the model to detect the variables used within the test/focal method and their relationship.

The list of complex assert statements showcase some of the challenging asserts correctly predicted by the model. These asserts involve multiple method calls, parameters, attributes, and variables that are less common.

The list of equivalent assert statements show generated asserts that do not exactly match with the target assert (i.e., these are not counted as correct asserts in the top-k accuracy), but they are equivalent to the developer’s assert. For example, the model suggests to get the class literal directly with AbstractService.class, while the developer uses the method getClass() which, in turn, uses the same class literal. In another instance, the developer checks that status == 0 is true, while the model suggests an equivalent check with assertEquals(0, status). Similarly, the model suggests to use assertSame on two objects, rather than the == equivalence. Note that for all these cases, the model was eventually able to predict the correct assert (i.e., perfect match) in the subsequent attempts.

Finally, Figure 7 reports two complete examples with source and target, correctly predicted by the model. In the first example, the generated assert checks that the event object created with the eventFactory is of the correct class instance. In the second example, the model generates a complex assert statement involving numerical literals and variables previously used to set-up the testing environment. Additionally, the example in Figure 2 described in Sec. II-D was also correctly predicted by the model in the very first attempt.

These results highlight the need for additional metrics beyond simple accuracy. In particular, metrics that can recognize and discern cases where the generated assert statement is different yet equivalent to the one created by human developers, as well as non-equivalent asserts that are also correct in that context. Additionally, there could be many locations in the code where the developers did not introduce assert statements but the model could suggest reasonable ones, which are currently not uncovered in the quantitative metrics. The main goal of this research question was precisely to fill this gap with a qualitative and manual analysis.

Summary for RQ4. Our models can generate common assert statements as well as complex ones involving method calls, parameters, and unusual variables. In several cases, the model generates equivalent assert statements to the developer’s assert.

RQ5: Can our approach be used to improve automatically generated test cases? Table IV reports the absolute (and percentage) line and condition coverage at class-level, for each of the 18 public methods considered in the experiment. The table shows the results for the original EvoSuite test cases, those augmented by our model, as well as the delta improvement in the last column.

For 13 out of 18 methods, our approach generated asserts that improved the line and/or condition coverage between 1-3 more lines and 1-4 additional condition coverage. For 4 methods our approach generated correct asserts which did not improve the coverage, while for one method (i.e., createBigDecimal) our approach did not generate any correct assert within the top-10 predictions.

Figure 8 shows all the generated asserts which have been used to augment the EvoSuite test cases, in the same order as the methods reported in table IV. We can notice that the first three asserts invoke the focal method with an actual numerical value, which results in additional test coverage, since the original EvoSuite test case tested the same focal methods with a null or empty string, resulting in the execution of a different branch. The fifth assert, related to the focal method toDouble invokes the focal method using a non-numerical string "foo", and covering three additional lines and one more condition in the focal method, w.r.t. the EvoSuite test case.

Let us now focus on the four assert statements that did not improve the coverage, corresponding to the focal methods min and max, shown as the sixth to the third from the bottom of figure 8. Three of these asserts simply perform additional checks on the return variables used by EvoSuite, namely long0, float0, byte0. These asserts do not invoke the focal method, thus not resulting in additional coverage, but instead focus on testing additional properties of the retun values. Finally, the assert assertEquals(4, NumberUtils.min(4, 5, 7)) correctly invokes the focal method and asserts the correct behavior, but executes lines and branches already tested by the original test case (with different values).

Overall, these results show that our approach can be helpful in augmenting existing or automatically generated test cases with additional accurate assert statements. In most of the cases reported in our experiment, we found the asserts to slightly improve the test coverage.

Summary for RQ5. Our approach can be used to augment existing test cases, such as those generated by EvoSuite, with additional assert statements. Our experiments show that these asserts can lead to improved test coverage.

V Discussion & Future Work

Our experimental analysis showed promising results of our approach in generating accurate assert statements for unit test cases. For our future work, we envision two possible scenarios where we can deploy our model with the goal of improving automation in software testing activities.

Our approach could be used to support developers in writing test cases more efficiently, by suggesting assert statements while defining the test case. In this scenario, we plan to implement our approach as plugin for an IDE, which is then used by developers while writing code as a code completion tool. Our approach could work side-by-side existing code completion approaches, such as Pythia . The results of RQ3 clearly shows that our approach is more accurate than standard code completion, leading us to suggest a hybrid approach. In this hybrid approach, a standard code completion tool would perform inference on our model when the developer is writing test cases.

V-B Improving Automated Test Case Generation Tools

The results of RQ5 show that our approach can be used to augment test cases generated by automated test case generation tools, such as EvoSuite, Randoop, and Agitar. In this scenario, our approach could be integrated within an automated test case generation tool, or used as an external tool which augment and revises assert statements in the newly generated test cases.

VI Threats to Validity

Threats to construct validity concern the relationship between theory and observation and are mainly related to the measurements we performed. In this context, data leakage could represent a threat to the validity of our study. Data leakage refers to the unintentional and accidental sharing of data between the training and test sets. In our case, the threat arises during the pretraining stage on large amount of source code, where the model may have observed similar code to what found in the finetuning test set. We mitigated this threat by constructing the finetuning process differently from the pretraining, where the code is organized in a dissimilar fashion. Specifically, during the pretraining the test method did not contain the placeholder, and was not adjacent to the focal method. We empirically validated the mitigation of the threat by evaluating the pretrained model (without finetuning) on the test set, in order to observe its performances. The results show that the model was not able to generate correct assert statements, thus confirming the our hypothesis. It is also worth to note that data leakage is avoided within the finetuning dataset (i.e., training, validation, test sets)

Internal validity threats concern factors internal to our study that could influence our results. The performance of our approach depends on the hyperparameter configuration and pretraining process. We did not perform hyperparameter search since these large models require substantial training time, however, we reuse configurations suggested in the literature. We experiment with different pretraining stages and report the results of our experiments.

Threats to external validity concern the generalization of our findings. In our context, the threat arises when comparing our BART Transformer model (with 400M trainable parameters) against the RNN-based model (with 4M trainable parameters) having different capacity and number of parameters. While we note that comparing models with the same number of parameters could yield different results, the authors of ATLAS did not observe improvements when increasing the number of layers and units of the Encoder-Decoder architecture. Moreover, we rely on the existing literature comparing Transformer and RNN architectures.

VII Related Work

Our work is related to several existing approaches in the area of automated software testing. In particular, there is a class of approaches that aims at generating tests methods and synthesizing assert statements, such as Evosuite , Randoop , and Agitar . Among these, Evosuite is one of the most popular tools for test generation in Java. It relies on mutation testing in order to generate appropriate assert statements. Specifically, it first introduce mutants within the method under test, then it attempts to generate asserts with the goal of killing the aforementioned mutants. During this process, Evosuite optimizes towards maximizing the number of killed mutants while generating as few asserts as possible. Randoop generates assert statements by relying on user-specified contracts. These statements are then refined using random testing and analyzing execution traces of the statement it creates.

The major difference between these works and our approach is the learning component. Specifically, the aforementioned works rely on handcrafted rules or heuristics to generate assert statements, for example via predefined mutations. Instead, we aim to learn from developers’ code what are the assert statements that are more effective for the particular context (i.e., test case and focal method).

Additionally, recent works have shed light on the importance of generating accurate and complex assert statements to detect real faults in the system . In particular, Almasi et al. shows that, while Evosuite and Randoop were able to uncover several faults in real programs, nearly half of the undetected faults could have been detected with a more appropriate assert statement .

These limitations motivated the work from Watson et al. , where the authors proposed an RNN-based approach ATLAS which aims at generating meaningful assert statements by learning from developers’ code. Inspired by this work, we improve upon it in several substantial ways. First, we employ a different and more advanced deep learning architecture based on transformer models. Second, differently from ATLAS, we take advantage of English and source code semi-supervised pretraining to significantly boost the performances of the models on the assert generation task. Lastly, we investigate qualitative cases and intrinsic metrics as well as the effect of the focal method, which provides additional beneficial context to the model. These contributions culminated into an approach that significantly outperforms the previous work , with an 80% relative improvement in top-1 accuracy.

Our work is related to a broad set of literature on transfer learning , unsupervised language model pretraining , and denoising pretraining . In this paper, we extend these ideas to source code as a language, combining English and source code pretraining modes, fine-tuning on a downstream translation task from the automated software engineering domain.

VIII Conclusion

In this paper we presented an approach for generating accurate assert statements for unit test cases. The core of our approach is based on a state state-of-the-art transformer model which has been pretrained, in a semi-supervised fashion, on both English and source code corpora. This pretraning process allows to learn the semantics of the natural language and its words as well as the syntax of the source code. The model was then finetuned on the assert generation task, which we represent as a translation task, where the input is the focal method along with the partially generated test case, and the output is the desired assert statement.

The resulting model is able to generate accurate assert statements, with a 62% top-1 accuracy, matching the exact assert statement originally wrote by the developer. This represents an 80% relative improvement over the previous RNN-based approach .

In our empirical evaluation, we experimented with different pretraining levels, showing the beneficial impact of pretraining on both English and source code in terms of extrinsic and intrinsic metrics. We qualitatively analyzed the assert statements predicted by our model, and identified both common and complex asserts. Interestingly, we found many cases in which the predicted assert statement did not syntactically match the original statement, yet was semantically equivalent and correct. Finally, we empirically demonstrate how our proposed approach can be used to augment existing test cases, such as those generated by EvoSuite, with additional assert statements that lead to test coverage improvements.

We believe that these results are particularly important in terms of the usability and applicability of this approach beyond research, in actual development environments. Practically, developers would obtain accurate and relevant assert statements in one or very few suggestions, allowing them to write complete and robust test cases.