Unified Pre-training for Program Understanding and Generation

Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang

Introduction

Engineers and developers write software programs in a programming language (PL) like Java, Python, etc., and often use natural language (NL) to communicate with each other. Use of NL in software engineering ranges from writing documentation, commit messages, bug reports to seeking help in different forums (e.g., Stack Overflow), etc. Automating different software engineering applications, such as source code summarization, generation, and translation, heavily rely on the understanding of PL and NL—we collectively refer them as PLUG (stands for, Program and Language Understanding and Generation) applications or tasks. Note that the use of NL in software development is quite different than colloquially written and spoken language. For example, NL in software development often contains domain-specific jargon, e.g., when software engineers use Code Smellhttps://en.wikipedia.org/wiki/Code_smell, it means a potential problem in code (something other than Smell in regular English language).

In this work, our goal is to develop a general-purpose model that can be used in various PLUG applications. Recent advancements in deep learning and the availability of large-scale PL and developers’ NL data ushered in the automation of PLUG applications. One important aspect of PLUG applications is that they demand a profound understanding of program syntax and semantics and mutual dependencies between PL and NL. For example, Figure 1 shows two implementations of the same algorithm (sorting) in two PL and corresponding NL summary. An automatic translation tool must understand that function sorted in Python acts similar to Arrays.sort in Java and the lambda operation in Python is equivalent to instantiating a Comparator object in Java. Similarly, a tool that summarizes either of these code must understand that x in Python or Tuple.get(0) in Java refers to the first element in the tuple list.

Most of the available data in PL and NL are unlabeled and cannot be trivially used to acquire PLUG task-specific supervision. However, PLUG tasks have a common prerequisite — understanding PL and NL syntax and semantics. Leveraging unlabelled data to pretrain a model to learn PL and NL representation can be transferred across PLUG tasks. This approach reduces the requirement of having large-scale annotations for task-specific fine-tuning. In recent years we have seen a colossal effort to pretrain models on a massive amount of unlabeled data (e.g., text, images, videos) Devlin et al. (2019); Liu et al. (2019); Conneau and Lample (2019); Conneau et al. (2020); Li et al. (2019); Sun et al. (2019) to transfer representation encoders across a wide variety of applications. There are a few research effort in learning general purpose PL-NL representation encoders, such as CodeBERT Feng et al. (2020) and GraphCodeBERT Guo et al. (2021) that are pretrained on a small-scale bimodal data (code-text pairs). Such models have been found effective for PLUG tasks, including code search, code completion, etc.

Language generation tasks such as code summarization is modeled as sequence-to-sequence learning, where an encoder learns to encode the input code and a decoder generates the target summary. Despite the effectiveness of existing methods, they do not have a pretrained decoder for language generation. Therefore, they still require a large amount of parallel data to train the decoder. To overcome this limitation, Lewis et al. (2020) proposed denoising sequence-to-sequence pre-training where a Transformer Vaswani et al. (2017) learns to reconstruct an original text that is corrupted using an arbitrary noise function. Very recently, Lachaux et al. (2020) studied denoising pre-training using a large-scale source code collection aiming at unsupervised program translation and found the approach useful. This raises a natural question, can we unify pre-training for programming and natural language? Presumably, to facilitate such pre-training, we need unlabeled NL text that is relevant to software development. Note that unlike other bimodal scenarios (e.g., vision and language), PL and associated NL text share the same alphabet or uses anchor tokens (e.g., “sort”, “list”, “tuple” as shown in Figure 1) that can help to learn alignment between semantic spaces across languages.

We introduce PLBART (Program and Language BART), a bidirectional and autoregressive transformer pre-trained on unlabeled data across PL and NL to learn multilingual representations applicable to a broad spectrum of PLUG applications. We evaluate PLBART on code summarization, generation, translation, program repair, clone detection, and vulnerability detection tasks. Experiment results show that PLBART outperforms or rivals state-of-the-art methods, e.g., CodeBERT and GraphCodeBERT, demonstrating its promise on program understanding and generation. We perform a thorough analysis to demonstrate that PLBART learns program syntax, logical data flow that is indispensable to program semantics, and excels even when limited annotations are available. We release our codehttps://github.com/wasiahmad/PLBART to foster future research.

PLBART

PLBART uses denoising sequence-to-sequence pre-training to utilize unlabeled data in PL and NL. Such pre-training lets PLBART reason about language syntax and semantics. At the same time, PLBART learns to generate language coherently.

We pre-train PLBART on a large-collection of Java and Python functions and natural language descriptions from Github and StackOverflow, respectively. We download all the GitHub repositories associated with Java and Python languages available on Google BigQuery.https://console.cloud.google.com/ marketplace/details/github/github-repos We extract the Java and Python functions following the pre-processing pipeline from Lachaux et al. (2020). We collect the StackOverflow posts (include both questions and answers, exclude code snippets) by downloading the data dump (date: 7th September 2020) from stackexchange.https://archive.org/download/stackexchange Statistics of the pre-training dataset are presented in Table 1. We tokenize all the data with a sentencepiece model Kudo and Richardson (2018) learned on 1/5’th of the pre-training data. We train sentencepiece to learn 50,000 subword tokens.

One key challenge to aggregate data from different modalities is that some modalities may have more data, such as we have 14 times more data in PL than NL. Therefore, we mix and up/down sample the data following Conneau and Lample (2019) to alleviate the bias towards PL. We sample instances for pre-training according to a multinomial distribution with probabilities $(q_{1},q_{2},\ldots,q_{N})$ :

where $N$ is the total number of languages and $n_{i}$ is the total number of instances in language $i$ . We set the smoothing parameter $\alpha$ to 0.3.

PLBART uses the same architecture as BARTbase Lewis et al. (2020), it uses the sequence-to-sequence Transformer architecture Vaswani et al. (2017), with 6 layers of encoder and 6 layers of decoder with model dimension of 768 and 12 heads ( $\sim$ 140M parameters). The only exception is, we include an additional layer-normalization layer on top of both the encoder and decoder following Liu et al. (2020), which is found to stabilize training with FP16 precision.

In denoising autoencoding, a model learns to reconstruct an input text that is corrupted by a noise function. Reconstruction of the original input requires the model to learn language syntax and semantics. In this work, we use three noising strategies: token masking, token deletion, and token infilling Lewis et al. (2020). According to the first two strategies, random tokens are sampled and replaced with a mask token or deleted from the input sequence. In token infilling, a number of text spans are sampled and replaced with a single mask token. The span lengths are drawn from a Poisson distribution ( $\lambda=3.5$ ). We mask 35% of the tokens in each instance.

The input to the encoder is a noisy text sequence, while the input to the decoder is the original text with one position offset. A language id symbol (e.g., , ) is appended and prepended to the encoder and decoder inputs, respectively. We provide a few examples in Table 2. The input instances are truncated if they exceed a maximum sequence length of 512.

PLBART is pre-trained on $N$ languages (in our case, $N$ =3), where each language $N_{i}$ has a collection of unlabeled instances $\mathcal{D}_{i}=\{x_{1},\ldots,x_{n_{i}}\}$ . Each instance is corrupted using the noise function $f$ and we train PLBART to predict the original instance $x$ from $f(x)$ . Formally, PLBART is trained to maximize $\mathcal{L}_{\theta}$ :

where $m_{i}$ is the number of sampled instances in language $i$ and the likelihood $P$ is estimated following the standard sequence-to-sequence decoding.

We train PLBART on 8 Nvidia GeForce RTX 2080 Ti GPUs for 100K steps. The effective batch size is maintained at 2048 instances. We use Adam ( $\epsilon$ = 1e-6, $\beta_{2}$ = 0.98) with a linear learning rate decay schedule for optimization. We started the training with dropout 0.1 and reduced it to 0.05 at 50K steps and 0 at 80K steps. This is done to help the model better fit the data Liu et al. (2020). The total training time was approximately 276 hours (11.5 days). All experiments are done using the Fairseq library Ott et al. (2019).

2 Fine-tuning PLBART

We fine-tune PLBART for two broad categories of downstream applications.

PLBART has an encoder-decoder architecture where the decoder is capable of generating target sequences autoregressively. Therefore, we can directly fine-tune PLBART on sequence generation tasks, such as code summarization, generation, and translation. Unlike denoising pre-training, the source sequence is given as input to the encoder during fine-tuning, and the decoder generates the target sequence. The source and target sequence can be a piece of code or text sequence. Table 3 shows a few examples of input and output to and for PLBART for different generation tasks. Note that PLBART prepends a language id to the decoded sequence; it enables fine-tuning PLBART in a multilingual setting (e.g., code generation in multiple languages).We do not perform multilingual fine-tuning in this work.

We fine-tune PLBART on sequence classification tasks following Lewis et al. (2020). The input sequence is fed into both the encoder and decoder. For a pair of inputs, we concatenate them but insert a special token (“”) between them. A special token is added at the end of the input sequence. This last token’s representation from the final decoder layer is fed into a linear classifier for prediction.

We fine-tune PLBART for a maximum of 100K steps on all the downstream tasks with 2500 warm-up steps. We set the maximum learning rate, effective batch size, and dropout rate to 3e-5, 32 and 0.1, respectively. The final models are selected based on the validation BLEU (in generation task) or accuracy (in classification tasks). Fine-tuning PLBART is carried out in one Nvidia GeForce RTX 2080 Ti GPU.

Experiment Setup

To understand PLBART’s performance in a broader context, we evaluate PLBART on several tasks. Our evaluation focuses on assessing PLBART’s ability to capture rich semantics in source code and associated natural language text.

We divide the evaluation tasks into four categories. The evaluation task datasets are summarized in Table 4. We use CodeXGLUE Lu et al. (2021) provided public dataset and corresponding train-validation-test splits for all the tasks.

refers to the task of generating a natural language (English) summary from a piece of code. We fine-tune PLBART on summarizing source code written in six different programming languages, namely, Ruby, Javascript, Go, Python, Java, and PHP.

is exactly the opposite of code summarization. It refers to the task of generating a code (in a target PL) from its NL description. We fine-tune PLBART on the Concode dataset Iyer et al. (2018), where the input is a text describing class member functions in Java and class environment, the output is the target function.

requires a model to generate an equivalent code in the target PL from the input code written in the source PL. Note that the source and target PL can be the same. Hence, we consider two types of tasks in this category.

The first task is a typical PL translation task, translating a code i.e., from Java code to C#, and vice versa. In this task, the semantic meaning of the translated code should exactly match the input code. Thus, this task evaluates PLBART’s understanding of program semantics and syntax across PL. The second task we consider is program repair. In this task, the input is a buggy code, and the output is a modified version of the same code which fixes the bug. This task helps us understand PLBART’s ability to understand code semantics and apply semantic changes in the code.

aims at predicting the target label given a single or a pair of source code. We evaluate PLBART on two classification tasks. The first task is clone detection, where given a pair of code, the goal is to determine whether they are clone of each other (similar to paraphrasing in NLP). The second task is detecting whether a piece of code is vulnerable. This task help us gauging PLBART’s effectiveness in program understanding in an unseen PL since the code examples in this task are written in C/C++.

2 Evaluation Metrics

computes the n-gram overlap between a generated sequence and a collection of references. We use corpus level BLEU Papineni et al. (2002) score for all the generation tasks, except code summarization where we use smoothed BLEU-4 score Lin and Och (2004) following Feng et al. (2020).

is a metric for measuring the quality of the synthesized code Ren et al. (2020). Unlike BLEU, CodeBLEU also considers grammatical and logical correctness based on the abstract syntax tree and the data-flow structure.

evaluates if a generated sequence exactly matches the reference.

3 Baseline Methods

We compare PLBART with several state-of-the-art models and broadly divide them into two categories. First, the models that are trained on the evaluation tasks from scratch, and second, the models that are pre-trained on unlabeled corpora and then fine-tuned on the evaluation tasks.

Seq2Seq Luong et al. (2015) is an LSTM based Seq2Seq model with attention mechanism. Vocabulary is constructed using byte-pair encoding.

Transformer Vaswani et al. (2017) is the base architecture of PLBART and other pre-trained models. Transformer baseline has the same number of parameters as PLBART. Hence, a comparison with this baseline demonstrates the direct usefulness of pre-training PLBART.

3.2 Pre-trained Models

As described in section 2, PLBART consists of an encoder and autoregressive decoder. We compare PLBART on two categories of pre-trained models. First, the encoder-only models (e.g., RoBERTa, CodeBERT, and GraphCodeBERT) that are combined with a randomly initialized decoder for task-specific fine-tuning. The second category of baselines include decoder-only models (CodeGPT) that can perform generation autoregressively.

RoBERTa, RoBERTa (code) are RoBERTa Liu et al. (2019) model variants. While RoBERTa is pre-trained on natural language, RoBERTa (code) is pre-trained on source code from CodeSearchNet Husain et al. (2019).

CodeBERT Feng et al. (2020) combines masked language modeling (MLM) Devlin et al. (2019) with replaced token detection objective Clark et al. (2020) to pretrain a Transformer encoder.

GraphCodeBERT Guo et al. (2021) is a concurrent work with this research which improved CodeBERT by modeling the data flow edges between code tokens. We report GraphCodeBERT’s performance directly from the paper since their implementation is not publicly available yet.

GPT-2, CodeGPT-2, and CodeGPT-adapted are GPT-style models. While GPT-2 Radford et al. (2019) is pretrained on NL corpora, CodeGPT-2 and CodeGPT-adapted are pretrained on CodeSearchNet Lu et al. (2021). Note that, CodeGPT-adapted starts from the GPT-2 checkpoint for pre-training.

Results & Analysis

We aim to address the following questions.

Does PLBART learn strong program and language representations from unlabeled data?

Does PLBART learn program characteristics, e.g., syntax, style, and logical data flow?

How does PLBART perform in an unseen language with limited annotations?

Table 5 shows the result of code summarization. PLBART outperforms the baseline methods in five out of the six programming languages with an overall average improvement of 0.49 BLEU-4 over CodeBERT. The highest improvement ( $\sim$ 16%) is in the Ruby language, which has the smallest amount of training examples. Unlike CodeBERT, PLBART is not pretrained on the Ruby language; however, the significant performance improvement indicates that PLBART learns better generic program semantics. In contrast, PLBART performs poorly in the PHP language. The potential reason is syntax mismatch between the pre-trained languages and PHP. Surprisingly, RoBERTa performs better than PLBART on the PHP language. We suspect that since RoBERTa is pre-trained on natural language only, it does not suffer from the syntax mismatch issue. Overall in comparison to the Transformer baseline, PLBART improves with an average of 2.76 BLEU-4, and we credit this improvement to the pre-training step.

2 Code Generation

Table 6 shows the evaluation result on code generation from NL description. PLBART outperforms all the baselines in terms of BLEU and CodeBLEU. While CodeGPT-adapted Lu et al. (2021) achieves the best Exact Match (EM) score, PLBART outperforms CodeGPT-adapted by a large margin in terms of CodeBLEU. This result implies that PLBART generates significantly more syntactically and logically correct code than all the baselines.

Figure 2 shows an example of code generated by PLBART. The difference between the reference code and the generated code is in line 6 onward. In the reference code, loc0 is returned, however same loc0 is returned in an else block in the generated code. If we look closely, in the reference code, line 6 will be executed only if the condition in line 3 (i.e., loc0 == null) is false. In the generated code, loc0 will be returned only if the condition in line 3 is false, making the generated code semantically equivalent to the reference code.

To study whether PLBART learns code syntax and logical flow during pre-training or fine-tuning, we perform an ablation study where we use subset of the training examples (10K, 20K, and 50K) to fintune PLBART in this task. As table 6 shows, with only 10K examples, PLBART outperforms all baselines in terms of CodeBLUE. This ablation shows that PLBART learns program syntax and data flow during pre-training, resulting in effective performance on downstream tasks even when finetuned on small number of examples.

As shown in prior works Yin and Neubig (2017); Chakraborty et al. (2020), generating syntactically and logically correct code has been a big challenge in program generation. We conjecture that PLBART’s large-scale denoising sequence-to-sequence pre-training helps understand program syntax and logical flow; therefore enables PLBART to generate syntactically and logically valid code.

3 Code Translation

Table 7 presents the evaluation results on code translation. PLBART outperforms all the baselines w.r.t. EM, BLEU, and CodeBLEU. PLBART improves over CodeBERT by 9.5% and 10.5% when translating from Java to C# and C# to Java, respectively. Although PLBART is not pretrained on C# language, there is a significant syntactic and semantic similarity between Java and C#. Thus PLBART understands C# language syntax and semantics. However, such similarities are non-trivial, making the Naive copy and PBSMT perform very poorly in both the translation tasks.

Figure 3 shows an example where PLBART’s generated C# code does not exactly match the reference; however, they are semantically equivalent. In the reference, the else block (line 4-9) is equivalent to the else if block (line 4-7) in the generated code. In addition, start is generated as function parameter and used in the function body, equivalent to start_1 in the reference code. This further corroborates the syntactic understanding of PLBART and its ability to reason about the data flow in source code. We present more qualitative examples in Appendix.

In the program repair task, both the input and the output are in the same language. While the input is a buggy code, the output should be the target bug-free code. Thus in this task, the exact match is the critical metric. Nevertheless, as shown in table 8, PLBART can generate 17.13%, and 74.03% more correct bug fixes than CodeBERT in Javasmall and Javamedium datasets, respectively. On the other hand, PLBART performs comparably to GraphCodeBERT that uses structure-aware pre-training to learn program syntax and semantics.

4 Classification

In both clone detection and the vulnerability detection tasks, PLBART outperforms CodeBERT. We present the results in Table 9. In the vulnerability detection task, code semantics is the most critical feature Zhou et al. (2019); Chakraborty et al. (2020). Since PLBART is not pretrained on C/C++ language, its improved performance compared to the Transformer baseline is the testament that PLBART can identify semantics beyond the language syntax’s specifics. Moreover, PLBART’s improved performances over CodeBERT and GraphCodeBERT confirms its effectiveness in program understanding in addition to its generation ability.

We acknowledge that neither PLBART nor CodeBERT is state-of-the-art in vulnerability detection, as graph-based models perform best in this task. In this evaluation, our goal is to study how well PLBART understands program semantics in an unseen language for a different type of task (other than the generation, i.e., classification).

Related Work

Transformer Vaswani et al. (2017), a sequence-to-sequence architecture that includes an encoder and decoder, has shown tremendous promise in natural language processing (NLP), computer vision, software engineering, and more. Devlin et al. (2019) first proposed to pre-train a large Transformer architecture, called BERT, to learn representations of natural language using large-scale unlabeled data in a self-supervised fashion. Later, BERT’s task-independent pre-training approach is rigorously studied Devlin et al. (2019); Liu et al. (2019); Solaiman et al. (2019); Feng et al. (2020); Sun et al. (2019); Li et al. (2020). While BERT-like models have shown effectiveness in learning contextualized representation, it is not very useful in generation tasks. GPT Radford et al. (2018) style models improve upon BERT for generative tasks with autoregressive pre-training; however, unlike BERT, they are not bidirectional. Lewis et al. (2020) introduced BART, a denoising autoencoder that uses a bidirectional encoder and an auto-regressing decoder. Similar to BART, PLBART uses denoising pre-training to cope with generative tasks and learns multilingual representations of programming and natural language jointly.

Deep Learning in Software Engineering There is a growing interest in automating software engineering (SE) using deep learning in the last few years. Vast sources of code in open source repositories and forums make deep learning feasible for SE tasks. Code Summarization Movshovitz-Attias and Cohen (2013); Allamanis et al. (2016); Iyer et al. (2016); Alon et al. (2019a); Hu et al. (2018); Harer et al. (2019); Ahmad et al. (2020), Bug Detection Ray et al. (2016); Li et al. (2018b); Russell et al. (2018); Zhou et al. (2019); Chakraborty et al. (2020), Program Repair Chen et al. (2019); Chakraborty et al. (2020); Lutellier et al. (2020), Code Translation Chen et al. (2018); Drissi et al. (2018); Xu et al. (2020), Clone Detection Zhang et al. (2019); Yu et al. (2019); Wang et al. (2020), Code completion Li et al. (2018a); Hellendoorn and Devanbu (2017); Parvez et al. (2018) are some of the tasks that are addressed with deep neural solution. While most of the prior approaches use task-specific representation learning, a few works Alon et al. (2019b); Feng et al. (2020); Guo et al. (2021); Lachaux et al. (2020); Clement et al. (2020) attempted to learn transferable representations in an unsupervised fashion. More closely to our work, CodeBERT Feng et al. (2020) is pre-trained on bimodal data to capture the semantic interaction between the input modalities (i.e., program and natural languages). More recently, GraphCodeBERT Guo et al. (2021) improves upon CodeBERT by leveraging data flow in source code. In contrast, PLBART is pre-trained on large-scale data using denoising autoencoding to learn the program and natural language representations that make it effective for a broad spectrum of software engineering tasks.

Conclusion

This paper presents PLBART, a sizeable pre-trained sequence-to-sequence model that can perform program and language understanding and generation tasks. PLBART achieves state-of-the-art performance on various downstream software engineering tasks, including code summarization, code generation, and code translation. Furthermore, experiments on discriminative tasks establish PLBART’s effectiveness on program understanding. We also show that PLBART learns crucial program characteristics due to pre-training, such as syntax, identifier naming conventions, data flow. In the future, we want to explore ways to fine-tune PLBART on all the downstream tasks jointly.

Broader Impact

Automation in software engineering is paramount in increasing programmers’ productivity. A reduced workload of tedious works at the part of developers’ daily routine would give them more time to solve significant problems for society’s wellbeing. There are numerous program-and-language applications in the software development lifecycle, such as code documentation/summarization, code synthesis, translating code across languages, etc that can be automated to facilitate software engineering. The availability of large-scale data (thanks to open source repositories, forums, and millions of contributors worldwide) opens up the opportunity to solve many of those problems in a data-driven fashion. PLBART aims at program-and-language applications that demand a complete syntactic and semantic understanding of source code and associated textual data. For the tasks we have shown evaluation, PLBART will serve as a solid and replicable baseline to guide future research. We also believe our work could be an excellent starting point for future works aim at solving a variety of software engineering problems.

Acknowledgments

We thank anonymous reviewers for their helpful feedback. We also thank UCLA-NLP group for helpful discussions and comments. This work was supported in part by National Science Foundation Grant OAC 1920462, CCF 1845893, CCF 1822965, CNS 1842456. Any opinions, findings, conclusions, or recommendations expressed herein are those of the authors, and do not necessarily reflect those of the US Government or NSF.