Retrieval Augmented Code Generation and Summarization
Md Rizwan Parvez, Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang
Introduction
In recent years, automating source code generation and summarization is receiving significant attention due to its potential in increasing programmers’ productivity and reducing developers’ tedious workload. Consequently, various approaches have been explored in the literature to facilitate code generation Yin and Neubig (2017); Gu et al. (2016) and code documentation/summarization Ahmad et al. (2020); Wei et al. (2019); Allamanis et al. (2018). Despite initial success, most of the generated code still suffers from poor code quality Xu et al. (2021). Therefore, the question remains—how to generate better code from a given summary and vice versa.
Source code generation and summarization, however, are intrinsically complex and challenging. They involve generating diverse token sequences such as different variables, operators, keywords, classes, and method names Parvez et al. (2018), which requires understanding the programming languages at lexical, syntax, and semantics levels. To combat these issues, recent studies (e.g., Ahmad et al. (2021); Guo et al. (2021); Xu et al. (2020); Feng et al. (2020a); Xu et al. (2020)) take a learning-based approach—they train representations of code and the associated text by leveraging existing high-quality source code and short text descriptions available in open-source repositories and question answering forums such as GitHub and Stack Overflow. Then fine-tune the representation models on the downstream tasks. Although these dataset contains high-quality human-written code and text, since the existing approaches do not directly leverage them during the generation process, the gain achieved by these approaches is still limited, especially when the source code is long.
To overcome this, we take advantage of the existing high-quality source code and their description by including them directly in the generation process that are retrieved via information retrieval technique. In this work, we present REDCODER, a Retrieval augmentED CODe gEneration and summaRization framework. While designing REDCODER, we take motivation from how developers take advantage of existing resources. For example, developers often search for relevant code in the code repository, and if found, adapt the retrieved code in their own context. Similarly, when an API usage is unclear, they search in question answering forums (e.g., StackOverflow) Brandt et al. (2010); Sadowski et al. (2015). Such an additional resource helps developers to increase their development productivity Li et al. (2013).
We design REDCODER as a two-step process (see Figure 1). In the first step, given the input (nl text for code generation, or code snippet for summarization) a retriever module retrieves relevant source code (for code generation) or summaries (for code summarization) from a database.The database could be open source repositories (e.g., GitHub) or developers’ forums (e.g., Stack Overflow). In the second step, a generator processes the retrieved code/summary along with the original input to generate the target output. In this way, REDCODER enhances the generation capability by augmenting the input through retrieval. The two-step process allows us to design a modular and configurable framework for source code and summary generation. Various designs of retriever and generator models can be incorporated into this framework.
Existing cross-encoder code retrievers being computationally expensive, their applicability to retrieve from a large database is limited Humeau et al. (2020). A natural choice would be to use sparse term based retrievers such as TF-IDF or BM25 Robertson and Zaragoza (2009). However, the retriever module in REDCODER should exhibit a good understanding of source code and programmers’ natural language, which is a non-trivial task due to the syntactic and semantic structure of the source code Guo et al. (2021); Ahmad et al. (2021). Such an expectation of searching for semantically similar code and summary may not be attainable by a sparse token level code retriever (e.g., BM25). To that end, we design the retriever module in REDCODER based on programming languages (PL) and natural languages (NL) understanding models (e.g., GraphCodeBERT Guo et al. (2021)). This retriever module extends the state-of-the-art dense retrieval technique Karpukhin et al. (2020) using two different encoders for encoding the query and document.
As for the generator, REDCODER can handle retrieval databases consisting of both unimodal (only code or natural language description) and bi-modal instances (code-description pairs) and makes the best usage of all the auxiliary information that are available. Yet, to incorporate information, we augment the retrieved information only in the input level. It does not modify the underlying architecture of the generator module —preserving its model agnostic characteristics.
We evaluate the effectiveness of REDCODER on two popular programming languages (Java and Python) on both code generation and code summarization tasks. The empirical results show that, REDCODER’s concept of retrieval augmented generation elevates the state-of-the-art code generation from an Exact Match score of 18.6 to 23.4 and the summary generation BLEU-4 score from 18.45 to 22.95 even when we forcefully remove the target candidate from the retrieved code or summary. With further experiments, we establish the importance of both the retrieved code and retrieves summary in the generation process. The source code for reproducing our experiments are at https://github.com/rizwan09/REDCODER.
Background
We first introduce the problem formulation and discuss the fundamentals of the retriever and generator components that REDCODER is built upon.
Our goal is two folds: (i) code generation: Generating source code (), given their natural language description, such as code summaries, code comments or code intents (); (ii) code summarization: Generating natural language summaries , given source code snippets . Fig 2 shows an example.
Let and denote a collection of input and output sequences (, in code generation, , in summary generation ). We assume that we have access to a retrieval database consisting of an extensive collection of source code (e.g., aggregated from GitHub or Stack Overflow) or summaries (e.g., docstrings, code comments) (). Note that, target sequences () may or may not be present in the retrieval database (). Now, given an input , a retriever retrieves the top- relevant output sequences from the database: . Then the input sequence is augmented with the retrieved sequences to form , where denote the concatenation operation. Finally, a generator generates the target output given . In the following, we first discuss the base retriever and generator modules used in REDCODER and then how we improve these components is in Section 3.
2 Retriever: DPR
Information retrieval (IR) systems or retriever models are designed to retrieve the top- relevant documents that presumably best provide the desired information Manning et al. (2008). Term-based retrieval methods, a.k.a. sparse retrieval models, such as TF-IDF or BM25 Robertson and Zaragoza (2009) use sparse vector representations to perform lexical matching and compute relevance scores to rank the documents based on a query.
On the other hand, dense retrieval methods encode documents into a fixed-size representations and retrieve documents via maximum inner product search Sutskever et al. (2014); Guo et al. (2016). Particularly of interests, Karpukhin et al. (2020) propose a Dense Passage Retriever (DPR) model for open-domain question answering (QA). It consists of two encoders (Q(.) and P(.)) that encode queries and passages, respectively. The similarity of a query and a passage is defined by the inner product of their encoded vectors . Given a query , a positive (relevant) passage , and a set of irrelevant passages , DPR optimizes the classification loss:
Karpukhin et al. (2020) propose to fine-tune DPR using in-batch negatives Gillick et al. (2019); Yih et al. (2011) with curated “hard” negatives using BM25 (candidates with high BM25 scores but contain no sub-string that match the target). We refer to Karpukhin et al. (2020) for details.
3 Generator: PLBART
PLBART Ahmad et al. (2021) is a sequence-to-sequence Transformer model Vaswani et al. (2017) that is pre-trained on a huge collection of source code and natural language descriptions via denoising autoencoding. PLBART has shown promise in several software engineering applications, including code generation and summarization. We adopt PLBART as the generator module in our proposed framework, REDCODER.
Proposed Framework: REDCODER
Our proposed code generation and summarization framework, REDCODER generates the target code or summary by augmenting the input with relevant code snippets or summaries. We build our retriever module by training a DPR model differently from Karpukhin et al. (2020). With an intelligent scheme, we then augment the retrieved candidates and their pairs (if available) to provide auxiliary supervision to the generator. We briefly describe the model components in this section.
The retriever module of REDCODER is built upon the DPR model Karpukhin et al. (2020) and we call it SCODE-R (Summary and CODE Retriever). SCODE-R composed of two encoders that encode source code and natural language summary. We use bidirectional Transformer encoders Vaswani et al. (2017) that are pre-trained on source code and natural language summaries. Specifically, we explore CodeBERT Feng et al. (2020b) and GraphCodeBERT Guo et al. (2021) as the code and summary encoders for SCODE-R.
Input/Output
SCODE-R takes an input sequence (code or summary) and retrieves a set of relevant documents from a database of output sequences (if the input is code, then the output is summary and vice versa). SCODE-R returns the the top- output sequences , where .
Training
We fine-tune SCODE-R using a set of parallel examples () of code and summaries. As mentioned in Section 2.2, DPR originally proposed to be fine-tuned using in-batch negatives and curated “hard” negatives from BM25 retrieved passages for open-domain QA. The key idea behind “hard” negatives is to fine-tune DPR to distinguish the target passage from relevant passages that do not contain the target answer. However, unlike open-domain QA, a retrieved code or summary that is not the target could still benefit code generation or summarization (verified in Section 6). We provide an example in Figure 3; although the retrieved code does not match the target one but can facilitate generating it. Therefore, we fine-tune SCODE-R without any “hard” negatives. Specifically, for each training instance (), the corresponding output is considered as positive and the other in-batch outputs (i.e., the outputs of other instances in the same batch - ) as negatives. Figure 4 shows an example of SCODE-R fine-tuning for code generation task.
2 Generator: SCODE-G
We adopt PLBART as discussed in Section 2.3 as the generator module of REDCODER and call it SCODE-G (Summary and CODE Generator). The input sequence is concatenated with the top- retrieved sequences to form the augmented input sequence, . The augmented input is fed to PLBART to estimate .
Note that a source code often consists of docstrings, comments that can be extracted to form code – summary pairs. In the retrieval databases, code and summaries are either singleton (e.g., code without a description or a problem statement without any code) or parallel. Therefore, we consider two retrieval settings that require separate modeling consideration for the generator.
In this case, we concatenate the original input sequence and the top- retrieved candidates with a special separator token.
This is our default setting and we refer this as REDCODER in this work.
Case 2: Retrieve candidates are pairs
In this case, retrieved candidates are pair of code and natural language (NL) summary. We augment the input sequence using both of them as follows.
where and are parallel sequences (e.g., is a piece of code and is its corresponding summary for the code generation task) retrieved from the database. We conjecture that the additional information complements the input sequence and verify its effectiveness in the experiments.
Note that retrieve candidates could be a mix of singleton and pairs. In case of a singleton candidate, we simply replace or with an empty string. We refer this setting as REDCODER-EXT. Although, REDCODER-EXT is a more general setting which includes “Case 1”, we study them separately to understand how these two retrieval settings benefit the target tasks. We illustrate an example on code generation in Figure 5. In both cases, the augmented input is truncated to match PLBART’s maximum input length 512.
Experiment Setup
In order to investigate the effectiveness of our framework, we perform a comprehensive study and analysis on code generation and summarization in two programming languages, Java and Python.
Datasets We perform evaluation on both the tasks using the code summarization dataset from CodeXGLUE Lu et al. (2021). It is curated from CodeSearchNet Husain et al. (2019) by filtering noisy examples. In addition, we conduct code generation experiments in Java using the Concode benchmark Iyer et al. (2018). The dataset statistics are summarized in Table 1.
Retrieval Databases To generate a source code given its natural language description or a summary given the code, our proposed approach REDCODER first retrieves prospective candidates from an existing code or summary database. We form the code retrieval database using the deduplicated source code (on average 1.4M functions in Java and Python) that consists of both paired (59%) and monolingual code, released in CodeSearchNET Husain et al. (2019). As for building the summary retrieval database, we extract the high quality natural language summaries from the paired instances in the training sets of CodeSearchNET. As many of the summaries are duplicated, we also consider the training sets in the other four available languages Ruby, Javascript, Go, and PHP. We then further enlarge it by aggregating the additional summaries from the CCSD corpus Liu et al. (2021). After performing deduplication, we retain 1.1M unique code summaries and for evaluating REDCODER-EXT, 20% of them can be used as pairs with the corresponding Java and Python source code. We provide the statistics of the retrieval databases in Appendix. Note that the retrieval databases contain code and summaries that are curated from real developers’ open sourced repositories on GitHub. By default, we exclude the target code/summary from the retrieval database.
Implementations As mentioned in Section 3, REDCODER has two disjoint components. First, the dense retriever SCODE-R is implemented adopting DPR Karpukhin et al. (2020) and the encoders in DPR are initialized from GrpahCodeBERT available in the Huggingface API Wolf et al. (2020). In addition, we implement a baseline BM25 retriever. We use the official codebase of PLBART Ahmad et al. (2021) and set max epoch to 15, patience to 5, learning rate to . We tune the batch size in {8, 16, 32, 64, 72} and the value for top- retrieval up to 10 for code generation and in range {10, 30, 50, 100} for code summarization. As some candidate code and summaries are short in length, we tune with this upper bound of to accommodate as many candidates as possible within PLBART’s maximum input length.
2 Evaluation Metrics
BLEU Following prior works Ahmad et al. (2021); Feng et al. (2020a), we compute the corpus level BLEU Papineni et al. (2002) and the smoothed BLEU-4 Lin and Och (2004) scores for code generation and summarization tasks.
CodeBLEU To demonstrate syntactic and semantic data flow correctness of code generation models, we report CodeBLEU Ren et al. (2020). CodeBLEU is a weighted average of lexical, abstract syntax tree, and data flow match.
Exact Match (EM) indicates the percentage of output sequences that exactly match the references.
3 Baseline Methods
We compare REDCODER w.r.t. a number of state-of-the-art code models. We classify them into two categories: (i) retrieval based models and (ii) generative models. We study both generative models that are trained from scratch and are pre-trained on programming and natural languages.
We examine two retriever baselines and consider the top-1 retrieved candidate as the prediction.
Dense Retriever We consider DPR as the dense retriever baseline. We evaluate both the officially released models trained on the natural language open-domain QA task and a variant called DPR (code) that we fine-tune on the evaluation datasets.
Sparse Retriever The second baseline is a sparse retriever that uses the BM25 algorithm to compute relevance scores.
Generative models
The generative models work in a sequence-to-sequence (Seq2Seq) fashion.
RoBERTa, RoBERTa (code) RoBERTa models Liu et al. (2019) pre-trained on natural language corpora, and source code from CodeSearchNet Husain et al. (2019) respectively.
CodeBERT Feng et al. (2020a) is pretrained with a hybrid objective incorporating masked language modeling Devlin et al. (2019) and replaced token detection Clark et al. (2020).
GraphCodeBERT Guo et al. (2021) is pre-trained by modeling the data flow graph of source code. GraphCodeBERT holds the state-of-the-art results on code search using CodeSearchNet.
GPT-2, CodeGPT-2, and CodeGPT-adapted are GPT-style models that are pre-trained on natural language Radford et al. (2019) and code corpora CodeXGLUE Lu et al. (2021).
PLBART Ahmad et al. (2021) is the generator module of our proposed framework.
In addition, we train an LSTM based Seq2Seq model with attention mechanism Luong et al. (2015) and a Transformer model Vaswani et al. (2017) on the benchmark datasets.
Results
Table 2 and Table 3 show the evaluation results on code generation from summary descriptions on CodeXGLUE, and Concode datasets, respectively. First, we compare REDCODER with the state-of-the-art code generation models. They are transformers models pre-trained with different objectives using external resources of different sizes. Among them, the relatively strong baseline PLBART has an EM score of 18 on the Concode dataset while it rarely generates any code that matches the real target code in CodeXGLUE (See Table 2) (more discussion on this is in Appendix). The BLEU and CodeBLEU scores are also low. Such result indicates that automated code lacks quality and correctness without the proper supervision in the input to the generator.
Among the retriever-only models, SCODE-R significantly outperforms BM25 (more comparison is in § 6). As expected, the EM is zero as targets are filtered from the retrieval, and CodeBLEU scores are high as they are real code. However, although the retrieved code does not exactly match the target code, they are quite relevant (e.g., Figure 3; more in Appendix). When comparing retrieval-only models to generative models, it is interesting to note that SCODE-R surpasses PLBART by a large margin on CodeXGLUE (Table 2), suggesting that retrieved code has high overlapping with target code that can benefit the generation.
Overall, the retrieval augmented generative models excel in code generation. Our proposed framework REDCODER outperforms PLBART by a large margin, validating the advantage of reusing existing codebases to help code generation. The REDCODER-EXT gains are even higher. For CodeXGLUE (Java, Python) and Concode, the gains in BLEU are 18.88, 19.54, and 5.8. Comparing REDCODER to REDCODER-EXT shows that BLEU scores on Concode and all metrics on CodeXGLUE are improved by 1%. These results confirm our conjecture that complementing input with paired summaries of the retrieved code help code generation. We provide a qualitative example in the Appendix to explain how the retrieved information helps PLBART in generation.
2 Code Summarization
We compare REDCODER with three sets of baseline methods for code summarization, and Table 4 shows the results. Among the two retrieval base methods, SCODE-R performs significantly well, confirming the advantages of dense retrieval over its sparse counterpart. Out of the generative methods, PLBART excels on code summarization as it leverages an extensive collection of natural language descriptions during pre-training. As anticipated, retrieval augmented generative methods outperform the other two sets of models. We see that the “BM25 + PLBART” model improves over PLBART, confirming our conjecture that retrieval augmented techniques have the promise to improve code summarization. Our proposed framework REDCODER and its variant REDCODER-EXT outshine “BM25 + PLBART”, surpassing its performance by 1.5 and 3.2 points for Python and Java languages, respectively.
Analysis
In this Section, we analyze REDCODER’s performance on the following points.
Retrieval database includes the target sequence Table 5 shows the code generation results when we did not filter the target from the retrieval (summarization results are in Appendix). As expected, SCODE-R performances are much better than those in Table 2, 3, and 4. In all cases, REDCODER gets more enhanced when target is present in the retrieval database. For the code generation task, we plot the recall@k curve for upto 10 for both Java and Python on CodeXGLUE dataset when the retrieval contains the target in Figure 6. As we can see, SCODE-R significantly outperforms in both languages and for all values.
Bi-encoder SCODE-R vs cross-encoder retrievers Table 6 shows the retrieval performance of different alternative retrieval techniques that we considered in REDCODER. SCODE-R performs comparably well with GraphCodeBERT while being significantly faster and scalable Humeau et al. (2020). Note that, SCODE-R also uses GraphCodeBERT to initialize its encoders (see Figure 4). However, SCODE-R’s design of using different encoders for query and documents enables pre-indexing of database and faster retrieval in practice.
Performance vs target length Figure 7 shows the code generation performances of different models w.r.t. the target code length for Python. While the generator model (PLBART)’s performance consistently decreases with increasing code size, the retriever (SCODE-R) performs consistently well. Such consistent performance from SCODE-R boosts performance of REDCODER (and also REDCODER-EXT) significantly higher than the generative model counterpart. For Java, we find similar results (details in Appendix).
Performance vs #retrievals Figure 8 shows that typically the performance improves more with more retrievals on both tasks. However, roughly 5 code and 30 summaries work sufficiently well.
Human evaluation Finally, we evaluate the quality of code generated by SCODE-G using human evaluation. In Table 7, we perform a human evaluation for code generation task on a subset of the test set in CodeXGLUE (Python). In this study, we compare REDCODER generated code with the code retrieved by SCODE-R. Note that both REDCODER and SCODE-R using the same retrievers, but REDCODER generates code using SCODE-G, while SCODE-R outputs code written by real programmers. We sample 30 instances where REDCODER generated code has a lower BLEU score than that of the SCODE-R and investigate whether the quality of code generated by them are significantly different on these cases.
As programming requires a specific skill, we do not evaluate the quality of the code generation using the mass crowd workers. We recruit 7 Ph.D. students studying in computer science as volunteersBefore participating in the evaluation process, all the participants are informed that it is a voluntary task and it may take roughly 30 minutes to perform the evaluation. to score (1 to 5) code based on three criteria (i) similarity, and (ii) relevance w.r.t. the target code; (iii) the compilability of the generated code.
The ratings show that both models receive similar scores, with a slightly higher score for SCODE-R in terms of similarity to the target code, relevancy, and compilability. This shows that the quality of the code generated by SCODE-G are competitive with real code from programmers’ perspective. Interestingly, REDCODER achieves higher scores than SCODE-R in CodeBLEU and Exact Match even on the cases where its BLEU score is lower.
Related Works
Code Summarization. In recent years, source code summarization attracted a lot of attention Iyer et al. (2016); Liang and Zhu (2018); Allamanis et al. (2016); Hu et al. (2018b); Ahmad et al. (2020). Many of these works view code as a sequence of token. Other approaches leverage the structural properties of code using Tree based model Shido et al. (2019); Harer et al. (2019); Hu et al. (2018a); LeClair et al. (2019). In literature, several retrieval-based methods were proposed that leverage retrieved information along with the input code. For example, Zhang et al. (2020) retrieves similar code snippet and use those as an auxiliary input for summarization. On the other hand, Hayati et al. (2018) retrieves related summaries for augmenting summarization input. Different from these approaches, REDCODER leverages both the retrieved code and its summary to augment the input.
Code Generation. Generating source code is a major stepping stone towards automated programming. Yin and Neubig (2017), and Rabinovich et al. (2017) proposed code generation as abstract syntax tree generation to ensure its syntactic correctness. Recent advancements in pre-training language models on unlabeled source code data Lu et al. (2021); Ahmad et al. (2021) showed colossal promise towards learning code syntax and semantics, resulting in improved code generation models.
Code Retrieval and Others. Numerous software engineering applications require information retrieval. Sadowski et al. (2015); Xia et al. (2017); Stolee et al. (2014); Sim et al. (2011) show that developers search for related code, API examples for implementing or adapting new APIs. Design of REDCODER is inspired by developers’ behavior while writing code. Developers use search engines for retrieving off-the-shelf libraries Hucka and Graham (2018), or “usable” source code Rahman et al. (2018) for adapting in the development process Nasehi et al. (2012); Arwan et al. (2015); Ponzanelli et al. (2014). Similarly, REDCODER retrieves existing code or summaries and adapts them to generate the target code or summary. In contrast, Hashimoto et al. (2018) optimizes a joint objective; Zhang et al. (2020); Liu et al. (2021) do not consider any decoder pre-training, Lewis et al. (2020) fine-tunes both of the retriever and the generator end-to-end. For open domain QA, Izacard and Grave (2021) propose a similar model of alternative generator (multi-encoder uni-decoder).
Conclusion
We propose REDCODER to automate developers’ writing of code and documentation by reusing what they have written previously. We evaluate REDCODER on two benchmark datasets and the results demonstrate a significant performance boost with the help of the retrieved information. In the future, we want to extend REDCODER to support other code automation tasks such as code translation.
Acknowledgments
We thank anonymous reviewers for their helpful feedback. We also thank the UCLA NLP group for helpful discussions, comments, and participating voluntarily in the human evaluation. This work was supported in part by NSF OAC-1920462, SHF-2107405, SHF-1845893, IIS-2040961, IBM, and VMWare. Any opinions, findings, and conclusions expressed herein are those of the authors and do not necessarily reflect those of the US Government.
References
Appendix A Qualitative Example
In Figure 11, we show an example of generated code by a baseline and different modules of REDCODER. The input summary asks to write a code (in Java) to get a MuxerStream given a position.
We show two of the corresponding retrieved code, their summaries (for bimodal instances), generated code of PLBART, REDCODER, and REDCODER-EXT. As can be seen, PLBART generates a basic but relevant code; both retrieved code (rank-1 and rank-3) contains the statements with variable cPtr one of them is of MuxerStream class, and another is from DeMuxerStream class. REDCODER generates a somewhat correct code of MuxerStream class and it takes the position argument too. Seemingly, while fusing the retrieved code, we suspect that as the tentative function name MuxerStream mentioned in the input summary does not match the function name DeMuxerStream of the rank-3 retrieved code, it only adapts one line containing cPtr from rank-3 retrieved code (line #3) and takes the rests including the function definition (i.e., line #1) from the rank-1 retrieved code. Now when REDCODER-EXT is allowed to leverage the summaries of the retrieved code, it can match the summary of the rank-3 retrieved code with the input, and that is why it produces the MuxerStream class object but with the throw exceptions from the rank-3 retrieved code.
Appendix B Performance Difference of PLBART on CodeXGLUE and Concode
Concode is a relatively easier dataset for code generation and retrieval due to several pre-processing steps taken by its authors. Along with additional contexts (environment variables and methods) in the input summary, Concode artifacts the target code by replacing the specific variable names with generic tokens.
Therefore, we suspect that due to this, PLBART achieves good EM score for Concode but not for the generation of real code in CodeXGLUE.
Analogously for the retrieval models, code retrieved by BM25 have also a large word overlapping with the targets in Concode in contrast to CodeXGLUE (1st row in Table 2 and 3). Consequently, BM25 retrieval boosts PLBART (i.e., BM25 + PLBART) more in Concode than that in CodeXGLUE (3rd row for the bottom in Table 2 and 3). Overall, we anticipate all these skewness in model performances are due to the dataset characteristics.