CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code

Shuyan Zhou, Uri Alon, Sumit Agarwal, Graham Neubig

Introduction

Natural-language-to-code generation (NL $\rightarrow$ Code) has seen sharply growing popularity recently due to the emergence of large language models (LLMs) trained on vast amounts of natural language and code (Chen et al., 2021; Fried et al., 2022; Zhou et al., 2023; Austin et al., 2021; Allal et al., 2023). LLMs have reached such a high NL $\rightarrow$ Code accuracy that they are now useful for the broad programming audience and actually save developers’ time when implemented in tools such as GitHub’s Copilot. This sharp rise in LLMs’ usability was achieved thanks to their ability to accurately generate long completions, which span multiple tokens and even lines, rather than only a single next-token as in early models (Allamanis and Sutton, 2013; Movshovitz-Attias and Cohen, 2013). Nevertheless, evaluating and comparing different models has remained a challenging problem (Xu et al., 2022) that requires an accurate and reliable evaluation metric for the quality of the models’ generated outputs, and existing metrics are sub-optimal.

The most common evaluation metrics are token-matching methods such as BLEU (Papineni et al., 2002), adopted from natural language processing. These metrics are based on counting overlapping n-grams in the generated code and the reference code. CrystalBLEU (Eghbali and Pradel, 2022) extends BLEU by ignoring the 500 most occurring n-grams, arguing that they are trivially shared between the prediction and the reference. Nonetheless, both BLEU and CrystalBLEU rely on the lexical exact match of tokens, which does not account for diversity in implementation, variable names, and code conventions. Figure 1 shows an example: given the reference code in Figure 1(a), both BLEU and CrystalBLEU prefer (rank higher) the non-equivalent code in Figure 1(b) over the functionally equivalent code in Figure 1(c).

CodeBLEU (Ren et al., 2020) attempts to lower the requirement for a lexical exact match, by relying on data-flow and Abstract Syntax Tree (AST) matching as well; nevertheless, valid generations may have different ASTs and data flow from the reference code, which may lead to low CodeBLEU score even when the prediction is correct. Further, partial predictions may be useful for a programmer, but accepting them may lead to partial code that does not parse, and thus cannot be fully evaluated by CodeBLEU (for example, predicting the first line of a for loop, without the loop’s body).

Execution-based evaluation attempts to address these problems by running tests on the generated code to verify its functional correctness (Chen et al., 2021; Athiwaratkun et al., 2022; Li et al., 2022; Wang et al., 2022; Lai et al., 2022). This provides a direct measure of the functionality of the generated code while being agnostic to diversity in implementation and style. However, execution-based evaluation requires datasets that are provided with hand-written test cases for each example, which is costly and labor-intensive to create; thus, only few such datasets exist. Additionally, executing model-generated code is susceptible to security threats, and thus should be run in an isolated sandbox, which makes it technically cumbersome to work with iteratively.

Our approach

In this work, we introduce CodeBERTScore, an evaluation metric for code generation, leveraging self-supervised pretrained models of code such as CodeBERT (Feng et al., 2020), and adopting best practices BERTScore (Zhang et al., 2020). First, CodeBERTScore encodes the generated code and the reference code independently with pretrained models, with the inclusion of natural language instructions or comments. Then, we compute the cosine similarity between the encoded representations of each token in the generated code and each token in the reference code. Finally, the best matching token vector pairs are used to compute precision and recall. CodeBERTScore allows comparing code pairs that are lexically different while taking into account the (1) programmatic- or natural-language-context, if such provided; the (2) contextual information of each token; and (3) implementation diversity. Our approach is illustrated in Figure 2.

Example

A concrete example is shown in Figure 1: while BLEU and CrystalBLEU prefer (rank higher) the non-equivalent code in Figure 1(b) given the reference code in Figure 1(a), CodeBERTScore prefers the code in Figure 1(c), which is functionally equivalent to the reference (Figure 1(a)). We note that in this example, the variable names are identical across all three code snippets. When the variable names of the reference are different than the candidate’s, it is even harder for token-matching approaches such as BLEU and CrystalBLEU to compare the reference with the candidates, while CodeBERTScore can trivially match variable names according to their semantic similarity and their functional role in the code.

Contributions

In summary, our main contributions are: (a) CodeBERTScore: a self-supervised metric for NL $\rightarrow$ Code evaluation, based on BERTScore, which leverages the benefits of pretrained models, while not requiring labeling or manually annotated data. (b) An extensive empirical evaluation across four programming languages, showing that CodeBERTScore is more correlated with human preference and more correlated with execution correctness than all previous approaches including BLEU, CodeBLEU, and CrystalBLEU. (c) We pretrain and release five language-specific CodeBERT models to use with our publicly available code, for Java, Python, C, C++, and JavaScript. As of the time of this submission, our models have been downloaded from the Huggingface Hub more than 1,000,000 times.

Evaluating Generated Code

A larger value of $f(\hat{y},y^{*})$ indicates that the generated code is more accurate with respect to the reference code, and the way $f$ ranks different candidates is more important than the absolute value of $f(\hat{y},y^{*})$ . That is, ideally, if a prediction $\hat{y}_{1}$ is more functionally equivalent to $y^{*}$ and more preferable by human programmers over a prediction $\hat{y}_{2}$ , we wish that a good metric would rank $\hat{y}_{1}$ higher than $\hat{y}_{2}$ . That is, we seek an $f$ function such that $f(\hat{y}_{1},y^{*})>f(\hat{y}_{2},y^{*})$ .

2 Background: BERTScore

BERTScore (Zhang et al., 2020) was proposed as a method for evaluating mainly machine translation outputs. The idea in BERTScore is to encode the candidate sentence (the prediction) and the reference sentence (the ground truth) separately, using a BERT-based model, which encodes each sequence of tokens as a sequence of vectors. Then, BERTScore computes the cosine similarity between every vector from the candidate sequence and every vector from the reference sequences.

Given these similarity scores, BERTScore computes sentence-level precision by taking the maximum similarity score for every candidate vector and averaging, and computes recall by taking the average of the maximum similarity scores for every reference vector. Intuitively, a high BERTScore-recall is obtained, for example, if every vector from the reference sentence has at least one vector from the candidate sentence that is highly cosine-similar to it; a high BERTScore-precision is obtained if every vector from the candidate sentence is highly cosine-similar to at least one vector from the reference sentence. Ultimately, the final score is the F1 score, computed as the harmonic mean of precision and recall.

3 CodeBERTScore

Our approach generally follows BERTScore, with the following main differences:

We encode the context (the natural language instruction or comment) along with each of the generated and reference code snippets, but without using the encoded context in the final similarity computation, essentially computing $f(\hat{y},y^{*},x)$ rather than $f(\hat{y},y^{*})$ .

Given the precision and recall, instead of computing the F1 score, we also compute F3 to weigh recall higher than precision, following METEOR (Banerjee and Lavie, 2005).

As our underlying BERT-like model, we use programming language-specific models that we pretrain and release, rather than models that were intended for natural language only.

We use a BERT-like pretrained model $\mathcal{B}$ to encode the reference and candidate. In our experiments, $\mathcal{B}$ is a CodeBERT model that we further pretrained using the masked language modeling objective (Devlin et al., 2019) on language-specific corpora, but $\mathcal{B}$ can be any transformer-based model which we have access to its internal hidden states.

We concatenate the context $x$ with each of the reference and the candidate, resulting in $x\cdot y^{*}$ and $x\cdot\hat{y}$ . We use the tokenizer $\mathcal{T}_{\mathcal{B}}$ provided with the model $\mathcal{B}$ :

to get a sequences of tokens. We run a standard “forward pass” with the model $\mathcal{B}$ for each tokenized sequence, resulting in sequences of vectors:

Finally, we mask out the encoded context tokens ${\bm{x}}_{1},...,{\bm{x}}_{k}$ as well as all non-alphanumeric tokens (parentheses, brackets, dots, commas, whitespaces, etc.) except for arithmetic operators, from each of the encoded reference and encoded candidate. This results in encoded reference tokens ${\bm{y}}^{*}=\left<{\bm{y}}^{*}_{1},...,{\bm{y}}^{*}_{m}\right>$ , encoded candidate tokens $\hat{{\bm{y}}}=\left<\hat{{\bm{y}}}_{1},...,\hat{{\bm{y}}}_{n}\right>$ , and their corresponding masks ${\bm{m}}^{*}$ and $\hat{{\bm{m}}}$ . We denote ${\bm{y}}[{\bm{m}}]$ as the remaining encoded tokens in ${\bm{y}}$ after selecting only alphanumeric token vectors according to the mask ${\bm{m}}$ .

Similarity Computation

We compute the cosine similarity between the encoded reference and candidate tokens, following Zhang et al. (2020):

Although this compares the individual tokens $y^{*}_{i}$ and $\hat{y}_{j}$ , their vector representations ${{\bm{y}}^{*}_{i}}$ and $\hat{{\bm{y}}}_{j}$ contain information about their context, and thus about their semantic role in the code.

CodeBERTScore

We use the similarity matrix (see Figure 2), formed by the similarity scores between ${\bm{y}}^{*}$ and $\hat{{\bm{y}}}$ , to compute precision, recall, and F1, by taking the maximum across the rows and columns of the similarity matrix, and then averaging. Following Banerjee and Lavie (2005), we also compute F3 by giving more weight to recall, as shown in Figure 3. Additional details regarding token weighting and scaling are provided in Appendix A.

Experimental Setup

We evaluate CodeBERTScore across multiple datasets and programming languages. We first show that CodeBERTScore is more correlated with human preference than previous metrics, using human-rated solutions for the CoNaLa dataset (Yin et al., 2018a; Evtikhiev et al., 2022). We then show that CodeBERTScore is more correlated with functional correctness, using the HumanEval dataset (Chen et al., 2021). We also show that CodeBERTScore achieves a higher newly proposed distinguishability than other metrics (Appendix F). Finally, we analyze some of the design decisions and their implications.

We used CodeBERT (Feng et al., 2020) as our base model ( $\mathcal{B}$ ) and continued its self-supervised pretraining (Gururangan et al., 2020) with the masked language modeling (MLM) objective (Devlin et al., 2019) on Python, Java, C++, C, and JavaScript corpora. We trained a separate model for each programming language, for 1,000,000 steps for each language, using a batch size of 32, an initial learning rate of $5e^{-5}$ , decayed linearly to $3e^{-5}$ . Our implementation is based on the widely used HuggingFace Transformers library (Wolf et al., 2019) and BERTScorehttps://github.com/Tiiiger/bert_score, and it supports any transformer-based model available on the HuggingFace hub.

Dataset

We trained each model on the language-specific subset of the CodeParrot (Tunstall et al., 2022) datasethttps://huggingface.co/datasets/codeparrot/github-code-clean, which consists of overall 115M code files from GitHub, further filtered by keeping only files having average line length lower than 100, more than 25% alphanumeric characters, and non-auto-generated files. Even after 1,000,000 training steps, none of the models have completed even a single epoch, meaning that every training example was seen only once at most.

2 Comparing Different Metrics

We compare CodeBERTScore with existing metrics that are commonly used on code generation evaluation. We use human annotated preference and execution-based results as the ground truth and measure their correlation with these metrics.

We used three major correlation metrics. Following best practices in natural language evaluation, we used Kendall-Tau ( $\tau$ ), Pearson ( $r_{p}$ ) and Spearman ( $r_{s}$ ) to measure the correlation between each metric’s scores and the references. The detailed equations can be found in Appendix C.

Human preference experiments

We evaluate different metrics on CoNaLa (Yin et al., 2018b), a natural language to Python code generation benchmark collected from StackOverflow. We use the human annotation of Evtikhiev et al. (2022) to measure the correlation between each metric and human preference. More details are provided in Section B.1.

Functional correctness experiments

We evaluate functional correctness using the HumanEval (Chen et al., 2021) benchmark. Each example in HumanEval contains a natural language goal, hand-written input-output test cases, and a human-written reference solution. While the original HumanEval is in Python, Cassano et al. (2022) translated HumanEval to 18 programming languages, and provided the predictions of the Codex model (Chen et al., 2021) (code-davinci-002) and their corresponding functional correctness.https://huggingface.co/datasets/nuprl/MultiPL-E We used Java, C++, Python, and JavaScript for these experiments, which are some of the most popular programming languages in open-source projects.https://octoverse.github.com/2022/top-programming-languages More details are provided in Section B.2.

Hyperparameters

We tuned only the following hyperparameters for CodeBERTScore: whether to use F1 or F3, and which layer of the underlying model to extract the encoded tokens from, which we examine in Section 5. We used F1 in the human preference experiments and F3 in the functional correctness experiments. We perform 3-fold cross-validation and report average results across the three folds. As for the layer to extract the token vectors from, we used layer 7 for CoNaLa, and in HumanEval we used layer 7 for Java, 10 for C++, 11 for JavaScript, and 9 for Python.

Results

Table 2 shows the correlation between different metrics and human preference. CodeBERTScore achieves the highest correlation with human preference, across all correlation metrics. While Evtikhiev et al. (2022) suggested that chrF and ROUGE-L are the most suitable metrics for evaluating code generation models in CoNaLa, CodeBERTScore outperforms these metrics by a significant margin. For example, CodeBERTScore achieves Kendall-Tau correlation of 0.517 compared to 0.470 of chrF and 0.420 of ROUGE-L. These results show that generated code that is preferred by CodeBERTScore— also tends to be preferred by human programmers.

Correlation with functional correctness

Table 1 shows the correlation between different metrics and functional correctness: CodeBERTScore achieves the highest or comparable Kendall-Tau and Spearman correlation with functional correctness across all four languages. METEOR achieves a comparable correlation with CodeBERTScore in Java and JavaScript, and its correlation is surprisingly better than other baseline metrics. However, in C++ and Python, CodeBERTScore is strictly better. Overall on average across languages, CodeBERTScore is more correlated with functional correctness than all baselines.

Analysis

We conducted a series of additional experiments to understand the importance of different design decisions, and to gain insights on applying CodeBERTScore to new datasets and scenarios.

In all experiments in Section 4, we used the language-specific model which we continued to pretrain on each language. But what if we wish to use CodeBERTScore in a language in which we don’t have a language-specific model? We compare the language-specific models to CodeBERT-base in Figure 4. Generally, CodeBERT-base achieves close performance to a language-specific model. However, in most HumanEval experiments and correlation metrics, using the language-specific model is beneficial. These results show that language-specific models are often preferred if such models are available, but the CodeBERT-base can still provide close performance even without language-specific pretraining.

Which transformer layer should we use?

We further investigate the impact of using hidden states from different layers of the model — the layer which the vectors in Equation 2 come from, in the computation of CodeBERTScore. The results are shown in Figure 5: generally, the deeper the layer – the higher the average correlation between CodeBERTScore and functional correctness, across all programming languages. However in almost all languages, performance reaches its maximum before the last layer, and decreases at the following layers. This suggests that higher layers encode the semantic information of each token more accurately, but the final layers may be more task-specific. These observations are consistent with Tenney et al. (2019), who found that lower layers in BERT tend to process shallow information, while higher layers encode deeper semantic meaning in natural language.

Does encoding natural language context help?

One major difference between CodeBERTScore and BERTScore is that CodeBERTScore leverages the context for the generated code, such as the natural language instruction or intent that was given as input for generation. We find that using context increases the correlation, for example, the Kendall-Tau of CodeBERTScore from 0.50 to 0.52. While this paper mainly focuses on natural language instructions, we believe that CodeBERTScore can thus benefit other programming scenarios as well, for example when generating code given the human-written comments, or generating code given the preceding code context.

CodeBERTScore allows soft matching of tokens

The heatmaps in Figure 6 show the similarity scores between tokens in CodeBERTScore. For example, both shutil.rmtree and os.rmdir in Figure 6(a) delete a folder; CodeBERTScore aligns each token to a respective token in the other expression, even though the two spans do not share many identical tokens.

In Figure 6(b), both code snippets calculate a square root, where one uses math.sqrt(x) and the other uses x ** 0.5. An exact surface-form-matching metric such as chrF would assign a low similarity score to this code pair, as they only share the token x. However, CodeBERTScore assigns non-zero scores to each token with meaningful alignments, such as matching [sq,rt] with [_0,5], since a square root is the 0.5-th power.

Additionally, we study the robustness of CodeBERTScore to adversarial perturbations. We found that token-based metrics such as chrF are much more prone to matching trivial tokens rather than tokens that preserve the semantic meaning of the code. Examples can be found in Appendix E.

Additional discussion and experiments regarding the distinguishability of CodeBERTScore are provided in Appendix F. Additional general examples are provided in Appendix G.

Related Work

Metrics such as BLEU (Papineni et al., 2002) evaluate code generation by counting matching n-grams between generated and reference code. CrystalBLEU (Eghbali and Pradel, 2022) refines this approach by disregarding trivially shared n-grams, while ROUGE (Lin, 2004) and METEOR (Banerjee and Lavie, 2005) emphasize recall and balance of precision and recall respectively. However, these metrics, relying on exact lexical matches, often fail to capture semantically equivalent but lexically different code snippets. Unlike these, CodeBERTScore captures the wide, two-sided context of each token, which n-grams cannot capture.

Static analysis-based metrics

CodeBLEU (Ren et al., 2020) incorporates data-flow and Abstract Syntax Tree (AST) matching, in addition to token-matching. However, valid code may not always align in ASTs and data-flows. Additionally, partial code, although potentially useful, may not parse, thus cannot be fully evaluated by CodeBLEU. Further, as highlighted by subsequent studies (Wang et al., 2022), CodeBLEU does not correlate well with execution accuracy.

Execution-based Metrics

To alleviate previous issues, execution-based evaluation counts a generated code snippet as correct if it produces the required outputs when run with given inputs (Chen et al., 2021; Athiwaratkun et al., 2022; Li et al., 2022; Wang et al., 2022; Lai et al., 2022; Huang et al., 2022). However, execution-based evaluation requires datasets that are provided with manually crafted test cases for each example, which is costly and labor-intensive to create; thus, only few such datasets exist. In contrast, CodeBERTScore is completely unsupervised and does not depend on any specific dataset. Further, executing model-generated code is susceptible to security threats, and thus should be run in an isolated sandbox, which makes it technically cumbersome to work with iteratively.

Conclusion

In this paper, we present CodeBERTScore, a simple evaluation metric for code generation, which builds on BERTScore (Zhang et al., 2020), using pretrained language models of code, and leveraging the natural language context of the generated code. We perform an extensive evaluation across four programming languages which shows that CodeBERTScore is more correlated with human preference than all prior metrics. Further, we show that generated code that receives a higher score by CodeBERTScore is more likely to function correctly when executed. Finally, we release five programming language-specific pretrained models to use with our publicly available code. These models were downloaded more than 1,000,000 times from the HuggingFace Hub. Our code and data are available at https://github.com/neulab/code-bert-score.

Acknowledgement

We thank Misha Evtikhiev, Egor Bogomolov, and Timofey Bryksin for the discussions, and for the data from their paper (Evtikhiev et al., 2022). We thank anonymous reviewers for the valuable feedback. We are grateful to Yiwei Qin for the discussions regarding the T5Score paper (Qin et al., 2022); the idea to use functional correctness as a meta-metric was born thanks to the discussion with her. We are also grateful to Aryaz Eghbali and Michael Pradel for the discussions about CrystalBLEU (Eghbali and Pradel, 2022). This material is partly based on research sponsored in part by the Air Force Research Laboratory under agreement number FA8750-19-2-0200. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Air Force Research Laboratory or the U.S. Government. This project was also partially supported by a gift from AWS AI.

Limitations

CodeBERTScore requires a GPU for computing the metric, while traditional metrics such as BLEU require only a CPU. This adds a hardware requirement to the evaluation of models of code, while most previous approaches are computationally cheaper (e.g., by counting n-grams). However, since training and testing neural models require GPU anyways, we can safely assume that a GPU is available. Further, BERT-base models are encoder-only and non-autoregressive; this means that they require only a single “forward pass”, compared to encoder-decoder models (e.g., T5) and decoder-only models (e.g., GPT-3) that need to autoregressively generate token after token, using a forward pass for each output token. Thus, the additional time consumption by encoder-only models (e.g., BERT) is negligible, especially when evaluating encoder-decoder or decoder-only as the NL $\rightarrow$ Code generator models.

Another point to consider is that CodeBERTScore relies on a strong underlying BERT-based model, while methods such as BLEU do not have many “moving parts” or hyperparameters to tune. However, this is mostly an advantage, since CodeBERTScore can be further improved in the future using stronger base models.

References

Appendix A Additional Details

A more general F score $F_{\beta}$ uses a positive factor $\beta$ , where recall is considered $\beta$ times as important as precision:

As found in METEOR (Banerjee and Lavie, 2005), using Fβ with $\beta=3$ , thus preferring recall over precision, results in a higher correlation with human preference in machine translation. In our experiments, we found that this applies to NL $\rightarrow$ Code as well.

Token Weighting

Following Zhang et al. (2020), we compute the inverse document frequency (idf), according to a language-specific test set, and weigh each token according to its negative log frequency.

Scaling

Following Zhang et al. (2020), the cosine similarity scores of hidden states tend to lie in a limited range. Thus, we can linearly scale the resulting scores, using an empirical base scalar $b$ :

Appendix B Evaluation Details

For each example, Evtikhiev et al. (2022) asked experienced software developers to grade the generated code snippets from five different models. The grade scales from zero to four, with zero denoting that the generated code is irrelevant and unhelpful, and four meaning that the generated code solves the problem accurately. Overall, there are 2860 annotated code snippets (5 generations $\times$ 472 examples) where each snippet is graded by 4.5 annotators.

B.2 Functional Correctness

We evaluate functional correctness using the HumanEval (Chen et al., 2021) benchmark. Each example in HumanEval contains a natural language goal, hand-written input-output test cases, and a human-written reference solution. On average, each example has 7.7 test cases and there are 164 examples in total. While the original HumanEval is in Python, Cassano et al. (2022) translated HumanEval to 18 programming languages, and provided the predictions of the Codex model (Chen et al., 2021) (code-davinci-002) and their corresponding functional correctness.https://huggingface.co/datasets/nuprl/MultiPL-E We used Java, C++, Python, and JavaScript for these experiments, which are some of the most popular programming languages in open-source projects.https://octoverse.github.com/2022/top-programming-languages Notably, Cassano et al. (2022) did not translate the reference solutions to the other languages, so, we collected these from HumanEval-X (Zeng et al., 2022).https://huggingface.co/datasets/THUDM/humaneval-x The reference score of every example is either 1 (“correct”, if it passes all test cases) or 0 (“incorrect”, otherwise).

Appendix C Correlation Metrics

$\tau$ measures the ordinal/rank association between a metric such as CodeBERTScore and the reference measurement. It is calculated as:

where $|\text{concordant}|$ represents the number of pairs where two measurements agree on their relative rank. That is, if $f(\hat{y_{1}},y_{1}^{*})>f(\hat{y_{2}},y_{2}^{*})$ , the reference measurement also yields $f^{*}(\hat{y_{1}},y_{1}^{*})>f^{*}(\hat{y_{2}},c_{2}^{*})$ . Similarly, $|\text{discordant}|$ represents the number of pairs where two measurements yield opposite ranks. Notably, in our experiments, we restrict the comparisons of ranks within the generations of the same question.

$r_{p}$ measures the linear correlation between a metric and the reference measurement. It is defined as:

where $N$ is the number of generations in the dataset, $\bar{f}$ is the mean CodeBERTScore of the dataset, and $\bar{f^{*}}$ is the mean similarity score calculated by the reference measurement.

$r_{s}$ measures the Pearson correlation coefficient between the ranks produced by a metric and the reference measurement:

where $R$ returns the ranks of code snippets in a collection of code snippets $\mathbf{Y}$ . $\textrm{cov}(\cdot,\cdot)$ is the covariance of two variables and $\sigma(\cdot)$ is the standard deviation.

Appendix D Standard Deviation

Table 3 shows the same results as in Table 1, but with standard deviations. Table 4 shows the results from Table 2, with standard deviations.

Appendix E Robustness to adversarial perturbations

We conducted a qualitative evaluation of CodeBERTScore under various perturbations. An example is shown in Figure 7, which shows the CodeBERTScore and chrF rankings of three code snippets based on the similarity to the reference shutil.rmtree(folder). CodeBERTScore gives a higher ranking to the code snippet that employs the appropriate API (os.rmdir) than the trivial (folder) that has the same variable name but without any function call. Contrarily, chrF assigns a higher ranking to (folder) which has a longer common sequence of characters, although semantically inequivalent.

Appendix F Distinguishing Code with Different Semantics

We study how well can CodeBERTScore perform as a generic similarity function that measures the similarity between two arbitrary code snippets $y_{i}$ and $y_{j}$ .

We evaluate CodeBERTScore using the distinguishability metric $d$ proposed by Eghbali and Pradel (2022) which is calculated as follows:

where $\text{Pair}_{\text{intra}}$ defines a set of code pairs from the same semantically equivalent clusters, and $\text{Pair}_{\text{inter}}$ defines a set of code pairs from two clusters of different functionality. Formally,

where $C_{k}$ is the $k$ -th cluster with semantically equivalent code snippets. Intuitively, a similarity function $f$ that can distinguish between similar and dissimilar code will produce $d$ larger than 1, meaning that a pair of code snippets from the same semantic cluster has a higher similarity score than a pair of snippets from different clusters. Since the number of intra-class and inter-class pairs grows quadratically with the number of code snippets, in our experiments we followed Eghbali and Pradel (2022) to sample $N$ inter- and $N$ intra-class pairs instead.

F.2 Dataset with Semantically equivalent clusters

We follow Eghbali and Pradel (2022) to evaluate whether CodeBERTScore can distinguish similar and dissimilar code mined from ShareCode https://sharecode.io/, an online coding competition platform. Semantically equivalent code snippets are from the same coding problem, and they all pass the unit tests provided by the platform. The dataset consists 6958 code snippets covering 278 problems in Java and C++. We use CodeBERTScore to calculate the similarity score for code pairs that share the same semantic class and code pairs that do not. We then measure the distinguishability of CodeBERTScore according to Equation 7. The results are shown in Table 5.

Table 5 shows that CodeBERTScore achieves a higher distinguishability than CrystalBLEU, which proposed this meta-metric, in both Java and C++. CodeBERTScore achieves distinguishability scores of 9.56 in Java while CrystalBLEU achieves 5.96; in C++, CodeBERTScore achieves 9.13 while CrystalBLEU achieves only 6.94. This result confirms that CodeBERTScore assigns higher similarity scores to semantically similar code pairs, compared to randomly paired snippets that belong to different semantic classes.

Despite the encouraging results in Table 5, we also found that distinguishability can be easily manipulated since it compares absolute scores across different metrics. For example, while CrystalBLEU achieves a distinguishability score of 5.96, we can craft a variant of CodeBERTScore that achieves a distinguishability score of 120,000 by simple exponentiation of CodeBERTScore’s output score.

As Figure 8 shows, distinguishability of CodeBERTScorek increases almost exponentially while increasing $k$ , although the base CodeBERTScore metric has not changed.

We thus argue that distinguishability is not a reliable meta-metric and is no substitute for execution-based- or human-rating. We further suspect that any meta-metric that compares exact, absolute, scores across different metrics is susceptible to such manipulations, and the reliable way to compare metrics is according to the way they rank different examples, rather than the exact scores.

The distinguishability results of CodeBERTScorek with different values of $k$ are shown in Figure 8. As Figure 8 shows, the distinguishability increases almost exponentially with the increasing value of $k$ . We thus argue that distinguishability is not a reliable meta-metric and is no substitute for execution-based- or human-rating. We further suspect that any meta-metric that compares exact, absolute, scores across different metrics is susceptible to such manipulations, and the reliable way to compare metrics is according to the way they rank different examples, rather than the exact scores.

Appendix G Additional Examples

In this section, we provide additional examples in which CodeBERTScore prefers the functionally correct prediction, while the best baseline metric in each language ranks higher a functionally incorrect prediction, which is inequivalent to the reference. Figure 9 shows an example in Java, and Figure 10 shows a C++ example.