Diet Code Is Healthy: Simplifying Programs for Pre-trained Models of Code

Zhaowei Zhang, Hongyu Zhang, Beijun Shen, Xiaodong Gu

Introduction

Pre-trained models of code such as CodeBERT (Feng et al., 2020) have been the cutting-edge program representation technology, yielding remarkable performance on a variety of software engineering tasks such as code completion (Ciniselli et al., 2021), code search (Feng et al., 2020), and clone detection (Lu et al., 2021). Having pre-trained on large-scale code corpora, they demonstrate a better understanding of the semantics of source code than previous deep learning models such as code2vec (Alon et al., 2019) and ASTNN (Zhang et al., 2019).

Despite making a giant leap in accuracy, pre-trained models are often heavy in computation, which significantly hinders their applications in practice. For example, the standard CodeBERT contains 125 million parameters and takes 28 hours to pre-train on 8.5M code. More critically, pre-trained models usually need to be fine-tuned before use, which is cost inefficient due to the large scale of parameters and training data. As such, it is highly desirable to identify the critical feature learned by pre-trained models and reduce the required computational cost by focusing only on the important information from model inputs (Rabin et al., 2021).

To understand critical information learned by pre-trained models, we conduct an empirical analysis of CodeBERT – a pre-trained model for programming and natural languages. Our study aims to find out (i) what kinds of tokens CodeBERT pays the most attention to; and (ii) what kinds of statements are most important to CodeBERT when learning code representations. To answer these two questions, we categorize tokens and statements into a few classes and summarize the attention weights of each class that the pre-trained CodeBERT assigns to. Our results reveal that keywords and data types are the most critical tokens that CodeBERT focuses on. In terms of statements, CodeBERT pays more attention to method signatures and return statements, which show the overall functionality of a method.

Based on these empirical findings, we propose DietCode, a novel method that aims at lightweight leverage of large pre-trained models for source code. DietCode reduces the computational complexity of pre-trained models by pruning unimportant tokens and statements from the input programs. Three pruning strategies are proposed, including word dropout, frequency filtering, and attention-based pruning. In particular, the attention-based pruning strategy selects tokens and statements that receive the highest attention from pre-trained models. The program simplification algorithm formulates statement selection as a 0-1 knapsack problem where statements are regarded as items, and their attention weights are regarded as values. The algorithm selects statements (items) under the constraint of a given target length (capacity).

We apply DietCode to two downstream tasks, namely, code search and code summarization. We measure the performance with relative length, FLOPs, and time cost, and compare our approach with baseline models, including the original pre-trained code model and SIVAND (Rabin et al., 2021). Experimental results show that DietCode provides comparable results as RoBERTa (code), with nearly 40% less computational cost in fine-tuning and testing.

Our contributions can be summarized as follows:

We conduct an in-depth empirical analysis of critical tokens and statements learned by CodeBERT.

We propose a novel program simplification approach for pre-trained programming language models that can significantly reduce the computation cost while retaining comparable performance.

We extensively evaluate the proposed approach in two downstream tasks and show the effectiveness of our approach.

Background

Pre-trained language models such as BERT (Devlin et al., 2019), GPT-2 (Radford et al., 2019), and T5 (Raffel et al., 2020) have achieved remarkable success in a variety of NLP tasks (Yang et al., 2019; Clark et al., 2019; Conneau and Lample, 2019). They refer to neural network models trained on large text corpora and can be fine-tuned to low-resource downstream tasks.

State-of-the-art pre-trained models are mainly built upon the Transformer architecture (Vaswani et al., 2017). The Transformer is a sequence-to-sequence learning model using the attention mechanism (Bahdanau et al., 2015). It is based on the encoder-decoder architecture, where a source sequence is encoded into hidden states and then taken as input to the decoder for generating a target sequence. Both the encoder and decoder contain multiple identical layers, and each layer consists of a multi-head self-attention network followed by a feed-forward network. Both of the outputs will be normalized before entering the next layer.

The key component of Transformer is the self-attention mechanism that represents a sequence by relating tokens in different positions (Vaswani et al., 2017). The goal of self-attention is to learn the important regions in the input sequence. Unlike traditional recurrent neural networks (Cho et al., 2014), self-attention networks can learn the dependencies from distant tokens in parallel. For an input sequence ( $x_{1}$ ,…, $x_{n}$ ) of length $n$ , the self-attention produces its representation ( $z_{1}$ , $z_{2}$ … $z_{n}$ ) as

where $W^{Q}$ , $W^{K}$ , and $W^{V}$ denote the model’s parameters. $d$ is the dimension of the input matrix. The model involves several attention heads, and each head’s output is concatenated into the final result.

Pre-trained models usually involve large-scale parameters and consume huge computational resources. For example, BERT-base and BERT-large contain 110M and 340M parameters, respectively. As such, the lightweight leverage of pre-trained models is greatly desirable for research and practitioners (Sanh et al., 2019).

2. CodeBERT

Recently, researchers have adopted BERT for software engineering tasks and proposed CodeBERT (Feng et al., 2020). CodeBERT is a bimodal pre-trained model for natural and programming languages, capturing semantic representations from programming languages (Feng et al., 2020). The program representations learned by CodeBERT can be further utilized for downstream tasks such as code search and code summarization.

CodeBERT is built on top of a Transformer encoder (Vaswani et al., 2017). The optimization involves two tasks: masked language modeling (MLM) and replaced token detection (RTD). MLM masks two random tokens from the input pair of code and natural language comments, and aims to predict the original token in an extensive vocabulary. RTD involves two generators and a discriminator. The generators predict the original token for the masked token while the discriminator predicts whether the tokens are original or not. After pre-training, CodeBERT can be adapted to downstream tasks through fine-tuning on the target dataset.

Empirical Analysis

In this section, we describe our study methodology and experimental setup. There are many levels of granularity for code, such as tokens, statements, and functions. As an in-depth study, we begin by investigating the atomic unit of source code, namely, tokens. We next investigate the statement-level knowledge learned by CodeBERT, which contains basic structures and semantic units. We finally explore the function-level knowledge learned by CodeBERT through downstream tasks. In summary, we design our study methodology by addressing the following research questions:

RQ1: What critical tokens does CodeBERT learn about? We study the critical information CodeBERT learned at the token level by analyzing the attention weights assigned to these tokens and visualizing their relative importance.

RQ2: What critical statements does CodeBERT learn about? We further study the statements that CodeBERT assigns the most weights to. We classify code statements into common categories such as initialization, assignment, and return, and present the attention weight of each category assigned by CodeBERT.

To answer these questions, we first need to know how to represent the key information learned by CodeBERT. In other words, how to measure the importance of each token and statement? Since the heart of CodeBERT is the self-attention network, where hidden states of each token are calculated layer-by-layer according to the self-attention weights, we measure the importance of each token using the attention weights in the Transformer layers in CodeBERT after pre-training. Next, we will present the details of how to measure the importance of tokens and statements, respectively.

As introduced in Section 2.2, CodeBERT takes a sequence of source code tokens as input and produces a self-attention weight for each input token. Each weight measures how the corresponding token gains attention from other tokens in the input sequence. The higher the attention weights, the more attention that are paid by other tokens. Therefore, in our study, we measure the importance of each token using the attention weight. CodeBERT has multiple self-attention layers and heads, each producing an attention weight for the same token. We average the attention weights of all layers and heads for each token. In our experiments, we go through the CodeSearchNet corpus (Husain et al., 2019) and take as input each code snippet to the pre-trained CodeBERT. Then, we calculate the average attention weight that CodeBERT assigns to each token.

1.2. Computing Attentions of Statements

Having obtained the importance of each token, we compute the attention weight for each statement. Intuitively, the attention weight for a statement can simply be obtained using the average attention weights of all its tokens. However, different tokens in a statement have different importance. Without considering the global importance of each token, statements that consist of unimportant tokens could obtain even higher attention. Based on this concern, we regularize statement attention by penalizing unimportant tokens that gain fewer attention weights across the whole corpus. More specifically, we calculate the attention weight for a statement $S$ using a weighted average of attention weights of its tokens:

where $a(t)$ represents the attention weight of token $t$ in statement $S$ ; $w(t)$ denotes the normalized attention weight of token $t$ in the whole corpus described in the previous section. The normalization is implemented using the Softmax function, namely,

Unlike tokens, statements are often unique in the corpus and cannot be included in a dictionary. To efficiently analyze statements, we classify statements into 21 categories such as method signature, variable declaration, and if condition, and only present the average attention weight for each category. These categories are mainly guided by Java specification (Gosling et al., 2000). In particular, the logger category contains statements with standard logging functionalities such as log4j, Logger, and println. Table 1 shows the categories of statements we have summarized and the corresponding quantities in the CodeSearchNet corpus. We can see that function invocation and method signature are the most common statements, while loop statements such as while are relatively rare.

2. Results and Analysis

In this section, we present the results of our experiments for each research question.

Figure 2 highlights the critical tokens in Java that gain the most attention by CodeBERT. The graph is summarized from the CodeSearchNet (Husain et al., 2019) corpus which contains more than 100,000 Java functions. For tokens in each function, we calculate their attention weights and insert them into a ¡token, attention¿ map until the keys of the map are stable. For ease of visualization, we remove tokens that appear less than 50 times in the corpus. Among all the Java tokens, coverage gains the lowest attention weight (5.43 $e$ -5) while boolean has the highest weight (2.94 $e$ -2). The standard deviation of these weights is 5.97 $e$ -3.

The results show that CodeBERT assigns more attention to Java keywords such as public and boolean. This is expected since Java keywords frequently appear in Java code and play a dominant role in representing code semantics. Another category of tokens that receives high attention is the data-oriented identifiers such as Map, List, and String, probably because they define key data structures in a function.

Motivated by this finding, we further study the attention of Java keywords. Figure 1 shows the attention weight of each Java keyword. As we can see, the keyword private obtains the highest attention weight of 1.15 $e$ -2 among all keywords, while interface and transient gain the lowest attention weights of 7.14 $e$ -4 and 9.62 $e$ -4, respectively. The standard deviation of these weights is 1.513 $e$ -3.

Among all Java keywords, method modifiers such as public, private, and static receive the highest attention weights We conjecture that these modifiers often denote the beginning of a method signature, which is essential to understanding the entire code. Next to these modifiers, finally and return also receive high attention weights. Finally usually represents the end of a function, and the code inside the finally block will definitely be executed. Return represents the method’s output, which may show the method’s functionality. Surprisingly, branching-related keywords such as if and switch receive lower attention, presumably because CodeBERT considers them unimportant for program understanding despite their frequent occurrence in code.

A similar pattern can be observed in Python. For example, CodeBERT prefers keywords such as return and def that represent the general structure of a function, just like return and public in Java. In particular, among the branching and loop keywords, while ranks higher than if and for.

Figure 3 shows a heatmap of attentions for a Java function. The code is about reading content from a file. The tokens with higher attention weights are marked in deeper colors. The heatmap shows a similar result as discussed above. public and finally receive the highest attention weights in this function, while new and if obtain the lowest attentions. Tokens that are related to the core functionality, such as File and read, have shown high attention weights. By contrast, tokens that are auxiliary to the main functionality, such as null and Exception have received much lower attention. Another interesting observation is that grammatical symbols such as } and; are considered necessary by CodeBERT probably because they mark the end of statements and blocks. Overall, the heatmap shows that CodeBERT mainly pays attention to functional tokens that reflect the overall functionality and does not care much about notional tokens about grammar and special branches.

2.2. What does CodeBERT learn about code statements? (RQ2)

Figure 4 shows the attention weights assigned by CodeBERT for each type of Java statement. As seen, method signature has gained the highest attention (4.11 $e$ -3), while case statement obtains the lowest attention (2.32 $e$ -3). The standard deviation of the attention weights over all categories is 4.88 $e$ -4. Broadly speaking, CodeBERT focuses more on statements indicating the overall functionality of a method, such as method signature and return, probably because they contain dense information of the whole function, such as names and targets.

Interestingly, arithmetic expressions (expressions with mathematical operations) also receive more attention than other statements. Like natural language, arithmetic expression can be viewed as a sequence of operations that tell the functionality of a statement. Function invocations has also been shown to be essential for CodeBERT to understand code. Like arithmetic expressions, function invocations can be regarded as identifiers of other method signatures, which can be literally understood through the function names and parameters.

Surprisingly, CodeBERT does not pay much attention to statements related to control flow structures, such as while, for and case. This indicates that CodeBERT can effectively learn representations of plain texts while being limited in learning structures.

Figure 3 also shows the heatmap of statement attentions for a sample function of Java. Similar to the conclusions above, method signature is the most important, which provides a general introduction to the method’s functionality. At the same time, statements about variable declaration and initialization, such as “String content == null” receive lower attentions, probably because they only create new variables with initial value null without actual operations to them. The attention weight of the if statement in the finally block is also low (2.92 $e$ -3). This function only closes the file reader and may have little effect on CodeBERT’s understanding of the whole method.

We observe a similar trend for Python statements. Method signatures also receive the highest attention weight (6.86 $e$ -3), while branching statements such as break and continue have the lowest weights. The standard deviation of attention weights for Python statements is 1.66 $e$ -3, which is a bit larger than that of Java statements.

Compared to Java, the distributions of Python statements and tokens are a bit more sparse. For example, the method signature statement which ranks the first owns a much higher weight than the second category of statements (i.e., return). The top keyword def also obtains a much higher attention weight than other keywords.

DietCode: Program Simplification for CodeBERT

As Equation 1 indicates, the computational complexity for computing the attention matrix is $\mathcal{O}(d^{2}n+n^{2}d)$ , which quadratically scales with sequence length. As such, the attention operation becomes a bottleneck when applied to long sequences such as source code (Kim et al., 2021).

We wonder whether we can remove unimportant tokens or statements from the input programs of CodeBERT. In this way, the time and space costs for the pre-trained model can be substantially reduced. Based on this idea, we propose DietCode, a lightweight pre-trained model for source code. DietCode reduces the computational complexity of CodeBERT by simplifying input programs.

More specifically, our problem formulation is as follows: given a code snippet $C=\{s_{1};\ldots;s_{|C|}\}$ which consists of $|C|$ statements, we want to output a pruned code snippet $C_{p}$ which contains a maximum length of $L$ tokens. The goal of program simplification is to prune as many tokens as possible without a significant impact on the accuracy of downstream tasks.

One straightforward strategy for program simplification is the word dropout (Iyyer et al., 2015), namely, randomly dropping tokens from the input code to meet a specified sequence length. For each token $w\in S$ , we define a binary variable $r_{w}\in\{0,1\}$ to indicate whether the token will be kept ( $r_{w}=1$ ) or pruned ( $r_{w}=0$ ):

where $p=L/|C|$ denotes the probability of selecting a token. In addition to reducing computational complexity, word dropout has also been shown to improve the robustness of neural networks (Iyyer et al., 2015). Hence, it enhances the performance of many tasks.

2. Frequency Filtering

A key concern with the word dropout strategy is that it can also prune important tokens randomly from the input. Hence, we introduce another frequency-based token pruning strategy that removes uncommon tokens while keeping only the most frequently occurred tokens.

Specifically, for each input code snippet $C=\{w_{1},...,w_{n}\}$ , we keep tokens if their frequency is among the top $k$ ( $k=|C_{p}|$ ) of all tokens in the input, namely,

where $f(w)$ denotes the frequency of $w$ in the whole code corpora.

3. Attention-based Code Pruning

The frequency filtering strategy can roughly distinguish important tokens, but may only bias common tokens. According to our empirical findings above, CodeBERT pays attention to certain types of tokens and statements. Tokens that receive high attention do not coincide with common tokens. In order to provide a more fine-grained selection of important tokens and statements, we propose an attention-based code pruning strategy that selects tokens and statements based on their attention weights.

The simplification procedure is summarized in Algorithm 1. We begin by summarizing the category of statements for a given programming language according to the methodology in our empirical study. Then, we create attention dictionaries for both statements and tokens. The algorithm runs code pruning in two phases: selecting statements and pruning tokens.

In the statement selection phase, we select pivot statements that belong to a category that obtain the highest attention weights in our empirical study. Meanwhile, we want the selected statements not to exceed the length constraint (i.e., $L$ tokens). This can be formulated as a 0-1 knapsack problem (Pisinger, 1995), where statements can be regarded as the items to be collected into a knapsack, with the attention weights being the values, and the statement lengths being the weights. The length constraint can be regarded as the capacity of the knapsack. It is worth noting that we enlarge the capacity by the maximum length of statements for tolerance of selecting one more statement so that the algorithm can further perform token pruning in the next phase.

In the token pruning phase, we greedily remove the tokens that have the lowest attention weights in the lowest weighed statements iteratively until satisfying the maximum number of tokens.

One potential problem is the different scales of statement attentions and token sizes. We find that some shorter sequences are much more likely to be chosen during our experiments even if they receive low attention weights. This is probably because the value ratio (the attention weight divided by the sequence length) of a shorter sequence is much larger than the longer one. We amplify the attention weights using min-max normalization and multiply with the number of tokens in the statement, that is,

where $N$ denotes the number of tokens in the statement.

The time complexity of the simplification algorithm is $\mathcal{O}$ ( $N\cdot(L_{T}+L_{M})$ ), where $L_{T}$ denotes the target number of tokens, and $L_{M}$ denotes the maximum length of statements. The time cost mainly comes from the 0-1 knapsack algorithm. Furthermore, the knapsack algorithm requires a two-dimensional array for dynamic programming, which requires space complexity.

Figure 5 shows an example of program simplification by DietCode. The original code (Figure 5(a)) aims to read content from a file with a given filename. Our goal is to reduce the code to contain only 60 tokens without considerably losing the semantics. In the statement selection phase (Figure 5(b)), the simplification algorithm solves a 0-1 knapsack problem by setting the capacity to 78 (including the target length of 60 and the longest statement length 18). Then, DietCode selects trunk statements such as method signature and return, and removes trivial statements such as variable assignments and catch. The number of tokens in the code is reduced from 118 to 77. In the token pruning phase, DietCode further removes tokens such as variable initialization (Figure 5(c)), and reduces the number of tokens from 77 to 60. Although reduced by around 50%, the main semantic of this function is not destroyed significantly. CodeBERT can still understand the function through key statements such as the method name readFile, the key variable content, and the invoked function read. Although the remaining code is never runnable, the selected critical information can be used for downstream tasks such as code search and summarization.

Evaluation

This section introduces the evaluation of DietCode in two downstream tasks. We aim to answer the following research questions through experiments:

RQ3: How effective is DietCode in program simplification? We evaluate the performance and computation cost of DietCode in two downstream tasks and compare it with baseline models.

RQ4: How effective is DietCode under different relative lengths? We study the effect of different relative lengths on the evaluation scores. Our goal is to find the relative length that leads to the best trade-off between model performance and computational expense.

RQ5: What is the effect of different pruning strategies? The attention-based DietCode performs both statement and token pruning. We conduct an ablation study to identify the effect of pruning each of them.

We test our algorithm in two downstream tasks, namely, code search and code summarization. They are the most widely used software engineering tasks to demonstrate the capacity of NL-PL understanding (Kamiya et al., 2002; White et al., 2016; Haiduc et al., 2010; Jiang et al., 2017; Nie et al., 2016; Stolee et al., 2014).

Code Search. Code search is a typical testbed for pre-trained code models such as CodeBERT (Feng et al., 2020). The task aims to find code snippets from a codebase that is most relevant to a given query.

Code Summarization. Code summarization aims to generate a natural language summary for an input code snippet. It has also been widely used for evaluating pre-trained models for source code (Feng et al., 2020; Wang et al., 2021). Through this task, we want to verify whether program simplification by DietCode is effective in generation tasks.

2. Datasets

We use the CodeSearchNet to fine-tune and test our model (Husain et al., 2019), which contains ¡comment, code¿ pairs collected from open source projects. code means the code snippet of a method, while comment is the description of the code that was mainly collected from comments for the whole functions such as Javadocs or docstrings in Python. The statistics of CodeSearchNet is shown in Table 2.

The original data of CodeSearchNet are in the format of token sequences. We preprocess the data by parsing them into statements. We parse statements by splitting brackets and semicolons according to the common guidance of Java (Gosling et al., 2000). For Python code, there are no delimiters because the indentation is missing in the dataset. Besides, Python does not use brackets and semicolons. Hence, we split statements using other hint symbols such as “def”, “=”, “:”, and “)”.

To reduce the vocabulary size of code tokens, we delexicalize all string constants to a generic token string and all numerical constants to the same token 10. We conserve special numbers such as 0, 1, and -1, which can belong to boolean values.

3. Evaluation Metrics

First, we want to measure the maximum extent of code reduction that DietCode can achieve without losing much accuracy. We define Relative Length (RL) to measure how much of the code is remained after simplification. It is computed as the percentage of the length of the simplified code snippet $|C_{p}|$ with respect to the length of the original code $|C|$ :

where — $\cdot$ — denotes the length of a code snippet (i.e., number of tokens). The smaller the relative length, the greater the simplification to the original code.

We also use FLOPs (floating point operations) (Hunger, 2005) to measure the effect of model reduction. FLOPs is a widely used metric to measure the complexity of a machine learning model. The higher the FLOPs, the slower the processing of the model.

The third metric is the time cost, including fine-tuning time (FT Time) and testing time. We calculate the execution time of DietCode for fine-tuning and testing, measured by the number of hours from the program starting to the termination.

Besides the metrics for complexity, we use two standard metrics to evaluate the effectiveness of DietCode in two downstream tasks respectively. We measure the performance of code search using MRR (mean reciprocal rank), which refers to the average of the multiplicative inverse of the position for the first correct answer for the query (Husain et al., 2019).

We measure the performance of code summarization using the BLEU-4 (bilingual evaluation understudy) score (Lewis et al., 2020), which calculates the average of n-gram precision on a couple of sequences with a penalty for short sequences.

4. Experimental Setup

The demonstrate the strength of our approach, we simplify the input programs for pre-training based models and compare the performance to that of the original pre-trained model. Specifically, we compare our technique to the vanilla versions of three pre-trained models:

CodeBERT (Feng et al., 2020): the original CodeBERT without code simplification. We follow the experimental setup in the CodeBERT paper.

CodeT5 (Wang et al., 2021): a widely used pre-trained programming language model based on the sequence-to-sequence architecture. CodeT5 has demonstrated better performance on generative tasks (Wang et al., 2021).

RoBERTa (Liu et al., 2019): a popular extension of BERT that has demonstrated significant improvement. The original RoBERTa was pre-trained in natural languages. So we also report the results of RoBERTa (code) (Feng et al., 2020), a variant that was pre-trained on source code.

Besides the pre-trained models, we also compare the accuracy of DietCode with that of traditional deep learning approaches, including BiRNN (Cho et al., 2014), SelfAttn (Vaswani et al., 2017), Seq2Seq (Cho et al., 2014), and Transformer (Vaswani et al., 2017). As they are also baseline models for CodeBERT, we directly take the results from the CodeBERT paper (Feng et al., 2020).

Finally, we compare DietCode with SIVAND (Rabin et al., 2021), which is also a program simplification method proposed by Rabin et al. (2021). SIVAND https://github.com/mdrafiqulrabin/SIVAND is a machine-learning-based algorithm that recursively splits source code into fragments and measures its importance through the method name prediction task. The algorithm selects important fragments as the reduced code. In our experiments, we directly use the code provided by the authors.

We implement our model based on the GitHub repository of CodeBERT https://github.com/microsoft/CodeBERT and CodeT5 https://github.com/salesforce/CodeT5 using the default hyperparameter settings. All models are optimized with the Adam algorithm (Kingma and Ba, 2017) with learning rates of 1 $e$ -5 and 5 $e$ -5 for the code search and code summarization tasks respectively.

We run all models on a machine with a CPU of Intel(R) Xeon(R) Silver 4214R 2.40GHz and a GPU of Nvidia Tesla P40.

5. Experimental Results (RQ3)

Table 3 and 4 show the performance of DietCode in two downstream tasks. Overall, DietCode reduces the computational cost by around 40% while keeping a comparable accuracy to the original pre-trained models. The original CodeBERT takes nearly one day for fine-tuning and more than 3 hours for code search on the Java test set to achieve an MRR score of 0.74. In contrast, by reducing the length of the input code to 60%, the fine-tuning and testing time can be reduced to 11.08 hours and 1.83 hours, respectively. The FLOPs also drops from 16.99G to 10.19G. At the same time, the accuracy (MRR = 0.71) is comparable to the original CodeBERT (MRR=0.74). It is worth noting that although the performance drops slightly, it still significantly outperforms traditional non-pretraining based methods such as BiRNN (MRR=0.29) and SelfAttn (MRR=0.59).

DietCode on CodeT5 demonstrates a similar efficacy. The vanilla CodeT5 requires more than 16 hours for fine-tuning and around 3 hours for testing on the Java code search task. By dropping out 40% input tokens using DietCode, the fine-tuning and test time is reduced to around 9 and 1.6 hours, respectively. Meanwhile, DietCode (attention) achieves a comparable performance (MRR=0.71) to the original CodeT5 (MRR=0.72) with only a reduction of 1.5%.

DietCode is also effective in the code summarization task. Taking Java as an example, the vanilla CodeBERT achieves a BLEU-4 score of 18.95 at the expense of 13.82 hours of fine-tuning and 4.8 hours of testing. When the target length is set to 150 tokens (i.e., 60% of the original length), the fine-tuning time is reduced to about 60% and the FLOPs drops from 24.328G to only 15.325G. This does not sacrifice accuracy (BLEU-4=17.29).

Comparing the three pruning strategies, DietCode with attention shows strength over the two other strategies. For example, DietCode with attention-based pruning achieves an MRR of 0.71 in the Java code search task, which is better than dropout (MRR=0.68) and frequency filtering (MRR=0.66). The results suggest that by finer-grained pruning of tokens based on attention weights, DietCode can achieve more effective program simplification. DietCode on the CodeT5 demonstrates a similar trend in both tasks, where the attention-based pruning achieves better performance (BLEU=19.25) than dropout (BLEU=17.65) and frequency filtering (BLEU=18.74). We observe that this strength becomes less significant in the Python language. We conjecture that the Python language contains fewer auxiliary tokens (such as brackets), resulting in a more uniform distribution of attention weights over all tokens. Therefore, removing tokens with smaller attention is closer to removing random or frequent tokens in the Python language.

As a noteworthy point, it is inapplicable to test the baseline model SIVAND in our datasets, as it takes a massive amount of time (¿30,000 hours according to our estimation) to process more than 1 million functions in the CodeSearchNet benchmarks. This is mainly because the ddmin algorithm used by SIVAND is based on backtracking, where each step the algorithm should invoke the deep learning model for a prediction task. For this reason, we estimate the time cost of SIVAND by sampling 50 functions from CodeSearchNet. Results show that it costs 104 and 77 minutes respectively in the CodeSearchNet (Java) and CodeSearchNet (Python) datasets. Based on this observation, we estimate the testing time in the entire dataset. Since the data we sample is small in size, it is insufficient to fine-tune the model to validate the accuracy.

6. Effect of Relative Length (RQ4)

Figure 6 shows the performance of DietCode under different relative lengths and pruning strategies. We can see that the performance of all pruning strategies drops sharply as the relative length decreases. In both two tasks, DietCode with attention performs better than the other two pruning strategies under all relative lengths. When the relative length is above 70%, the performance is comparable to that of the original CodeBERT but the reduction of computational cost can be modest. When the relative length drops below 50%, the performance is deficient since most of the code has been pruned. A relative length of around 60% is probably the best trade-off for these two tasks.

7. Effect of Token and Statement Pruning (RQ5)

Figure 7 shows the performance of DietCode with token and statement pruning on the two tasks. In order to verify the effect of statement pruning, we only run the statement selection phase in Algorithm 1. To test the effect of token pruning, we drop the lowest attention tokens directly until meeting the target length. As we can see from the results, a significant drop in performance is observed for both strategies as the relative length decreases. When 90% code remains, the MRR scores of the two pruning strategies are close. As the relative length decreases, the score of the token pruning strategy drops much faster than that of the statement pruning strategy. When the relative length decreases further, their MRR scores drop dramatically. The results suggest that statements are more critical than tokens in learning code representations by CodeBERT. One possible reason could be that token attention usually measures local information, while statement attention considers the relationship between tokens. Therefore, statement attention could keep more semantic of the functions. The code summarization task shares a similar result as the code search task.

Discussion

One debatable question is how can pre-trained models understand simplified code which is not compilable and runnable. Source code involves two channels of information: formal & natural (Casalnuovo et al., 2020; Chakraborty et al., 2020). It can either be considered as programs for computers to execute or as a specific natural language for humans to communicate (Hindle et al., 2012). The former requires the source code to be strictly compilable and runnable, while the latter focuses more on conveying information to humans. We believe that CodeBERT considers more about the natural channel of source code (Hindle et al., 2012; Chakraborty et al., 2020). As such, it pays more attention to keywords without necessarily having to know how the program actually behaves during execution.

Our empirical findings are in agreement with a previous human study on how programmers summarize source code (Rodeghero et al., 2014), which shows that developers usually start by reading the function name. They may skip some structural information such as For and If conditions while paying special attention to literal information such as method invocations and variable names, which literally show the intention of the code (Rodeghero et al., 2014). The pruning algorithms described in our paper make use of this finding and simplify source code by only selecting critical information (e.g., reflected by the attentions of CodeBERT) in the source code. Therefore, they do not have much impact on the understanding of source code.

2. What users may adopt the techniques?

One debatable question is what users may adopt the technique since it has a precision loss. From the results in Table 3 and 4, although the performance drops slightly, it still significantly outperforms traditional non-pretraining based methods such as BiRNN and SelfAttn. Meanwhile, DietCode shows an advantage of computational efficiency (e.g., saving 40% of fine-tuning time). Hence, DietCode provides a practical option for users with limited computational resources but still want quick leverage of large pre-trained models.

3. The role of the 0-1 knapsack formulation

One may question whether we really need to transform the problem into the 0-1 knapsack. In our preliminary studies, simply pruning statements often results in sequences that are either too long or too short. Token pruning provides a finer-grained pruning to the input program. This amounts to selecting statements and tokens to obtain the maximum attention while satisfying the length restriction. Hence, we formulate it as a 0-1 knapsack problem and solve it using a greedy strategy. According to the experimental results in Figure 7, a simple statement pruning archives an MRR of around 0.68 on Java code search when the relative length is 60%, which is evidently lower than that of DietCode (MRR=0.71). The same comparison can be observed on the Java code summarization task (i.e., 16.8 vs. 17.3 in terms of BLEU). This suggests the effectiveness of the 0-1 knapsack formulation.

4. Implications for Future Research

Through our empirical study, we find that CodeBERT does not recognize grammar and structural information very well. For example, the if condition, for condition, and while condition are three typical statements representing code structures, but they are rarely considered by pre-trained models when learning program representations. This motivates future research on better incorporating code structures into pre-trained code models, for example, converting structural symbols (e.g., the bracket } ) to a more verbal style (e.g., ‘the end of a for statement’) before feeding into pre-trained models.

Besides DietCode, there is a broader class of techniques, including model compression (Shen et al., 2020), model distillation (Sanh et al., 2019), and data reduction (Ye et al., 2021), which aim to reduce the computational complexity of large pre-trained models at the cost of a slight decrease in performance. DietCode provides an easy, alternative solution by only pruning input programs. We believe this is more practical for developers as they only need to process the data without modifying the model. In the future, we can investigate more techniques such as model compression and distillation.

5. Threats to Validity

Internal Threats. When calculating the attention weights for statements, we encounter the out-of-vocabulary (OOV) issue. We resolve this issue by excluding rare categories of statements in our experiments. Nevertheless, some categories of statements could still be necessary though having a low quantity in the corpus. Furthermore, the raw data provided by CodeSearchNet is in the format of plain text. We parse statements using text processing rules designed by ourselves. However, there could be irregular patterns that our parser can not recognize. This could cause noise in our dataset and affect the results of our research.

External Threats. We have just conducted our experiments in Java and Python. Though the conclusions from both languages are similar, other programming languages such as LISP and Erlang could have different attention patterns. In addition, DietCode is evaluated in two downstream tasks: code search and code summarization. However, these tasks rely much on the literal information provided in the source code. Thus, it remains to be verified whether or not the proposed simplification algorithm is applicable to other software engineering tasks.

Related Work

Besides our work, there have been other studies that also try to explain the mechanisms of pre-trained models for code (Ahmad et al., 2021; Mastropaolo et al., 2021; Rogers et al., 2020; Paltenghi and Pradel, 2021). Karmakar and Robbes (2021) applied four probing tasks on pre-trained code models to investigate whether pre-trained models can learn different aspects of source code such as syntactic, structural, surface-level, and semantic information. Different from our work, they only empirically study the whole pre-trained model, whereas we focus on more specific (critical tokens and statements) knowledge learned by pre-trained models.

There have also been many works that investigate the attention weights of pre-trained models for source code. For example, Wan et al. (2022) performed a structural analysis of pre-trained language models for source code. They analyzed the self-attention weights and found that the Transformer attention can capture high-level structural information of source code. AutoFocus (Bui et al., 2019) aims to reveal the most relevant part of the code to programmers. They measure the relevance of statements using attention weights from a GGNN (Allamanis et al., 2018). Different from our approach, they capture structural knowledge learned by pre-trained models, while we dig more into critical statements and tokens learned by pre-trained models. Furthermore, we propose simplifying programs for pre-trained language models based on the findings.

2. Program Simplification

Program simplification has gained increasing attention recently. The state-of-the-art methods such as SIVAND (Rabin et al., 2021) and P2IM (Suneja et al., 2021) a based on the delta debugging prototype (Zeller and Hildebrandt, 2002). The delta debugging mechanism requires an input code snippet and an auxiliary deep learning model such as the code2vec (Alon et al., 2019). The deep learning model takes as input the code snippet and splits it into fragments. Each fragment is then taken as input to the neural network model to perform testing tasks such as method name prediction and misused variables detection. If a fragment obtains a satisfying score, the program will split it further. The process continues until the performance of the subset does not satisfy the target score. Finally, the algorithm produces the smallest code snippet that satisfies the objective of the deep model.

Compared to DietCode, their methods are computationally inefficient because they need to run an auxiliary deep learning model and evaluate the performance at each iteration. For example, it costs hundreds of hours for SIVAND to process 10,000 functions, while DietCode can complete it within two minutes.

Conclusion

This paper empirically analyzes the critical information learned by CodeBERT, including important tokens and statements in both Java and Python. Our results show that CodeBERT focuses on keywords and data-relevant statements such as method signatures and return statements. Based on our empirical findings, we propose a novel approach named DietCode for lightweight leverage of pre-trained models. DietCode simplifies the input code to a target length by selecting important statements and tokens based on their attention weights using a 0-1 Knapsack algorithm. Experiments on two tasks have shown that DietCode provides comparable results as CodeBERT, with the advantage of computational efficiency.

In the future, we will consider more aspects that CodeBERT learns about code, such as syntax rules and semantic relations, to further reduce the size of source code for pre-trained models. We will also investigate model compression (Cheng et al., 2017, 2018) and distillation (Sanh et al., 2019) techniques to further reduce the size of pre-trained programming language models.

Our source code and experimental data are publicly available at https://github.com/zhangzwwww/DietCode

Acknowledge

This work was sponsored by NSFC No. 62102244, CCF-Tencent Open Research Fund (RAGR20220129), and CCF-Baidu Open Fund (NO. 2021PP15002000).