PyMT5: multi-mode translation of natural language and Python code with transformers

Colin B. Clement, Dawn Drain, Jonathan Timcheck, Alexey Svyatkovskiy, Neel Sundaresan

Introduction

Software is a keystone of modern society, touching billions of people through services and devices daily. Writing and documenting the source code of this software are challenging and labor-intensive tasks; software developers need to repeatedly refer to online documentation resources in order to understand existing code bases to make progress. Developer productivity can be improved by the presence of source code documentation and a development environment featuring intelligent, machine-learning-based code completion and analysis tools.

Recent progress in natural language processing (NLP), especially encoder/decoder-based transformer models Vaswani et al. (2017) and pre-training Radford et al. (2018); Lewis et al. (2019), has led to state-of-the-art performance on language modeling, classification Devlin et al. (2018), translation Raffel et al. (2019), summarization Liu and Lapata (2019), grammar correction Bryant et al. (2017), entity recognition, dialogue generation Budzianowski and Vulić (2019), and more. Along with these quantitative advances have come deeper understanding of the learned hidden representations which power transformers Kovaleva et al. (2019); Voita et al. (2019); Clark et al. (2019); Ethayarajh (2019). While they are arguably not ‘natural,’ programming languages are increasingly becoming modeling playgrounds for NLP modeling. Since these languages by definition have a grammar, syntax, and known relationships between entities, they offer enticing opportunities for an even deeper probing of NLP models and tasks. Beyond theoretical importance, many NLP tasks have practical utility in software development environments: language modeling or generation can be used for code completion Raychev et al. (2014); Bruch et al. (2009); Svyatkovskiy et al. (2019, 2020), translation/summarization to generate documentation or natural language summaries Moreno et al. (2013); Scalabrino et al. (2017); Wan et al. (2018); Alon et al. (2018) or even summarize a set of code changes Moreno et al. (2014), translation and grammar error correction to patch and detect bugs Zhai et al. (2019), and joint embedding of code and natural language for code search Husain et al. (2019); Gu et al. (2018).

In this work we focus on jointly modeling both source code (Python) and concomitant natural language documentation (docstrings) with transformers, through the study of dual tasks: generating method code bodies from signatures and docstrings, and generating docstrings from signatures and method code bodies. While previous work Allamanis et al. (2015); Yin and Neubig (2017) has leveraged the grammar of code to extract features like the Abstract Syntax Tree for modeling (treating code and natural language as separate modalities), we follow examples like Barone and Sennrich (2017) and treat Python and its docstrings as fundamentally no different than other ‘natural’ languages, representing both source code and natural language docstrings as sequences of tokens sharing the same vocabulary. Here we present a multi-mode translation method resulting in PyMT5, the Python method text-to-text transfer transformer (inspired by the text-to-text transfer transformer T5 Raffel et al. (2019)). Our single model can both learn code/language generation and understand the relationships between them.

The paper is organized as follows: we begin in sec. 2 by presenting examples of the performance of our novel multi-mode PyMT5 —the Python method text-to-text transfer transformer model—which we trained to translate between all pairs of combinations of method signatures, docstrings, and bodies which do not have the same feature in both the source and target. In sec. 2.1 we describe our training data and the pre-processing steps for source code and natural language we followed, and compared it to existing parallel docstring-method corpora like CodeSearchNet (CSN)Husain et al. (2019) and that presented by Barone et al Barone and Sennrich (2017). In sec.2.2 we explain our BART-like Lewis et al. (2019) pre-training scheme, demonstrating a 25 $\times$ speed-up in training time for docstring generation. Next, in sec. 2.3 we analyze and classify Python docstrings, enabling style-conditioned docstring generation in PyMT5. In sections 3 and 4, we discuss PyMT5 results on method generation and docstring generation respectively and compare it to two GPT2 models randomly initialized and pre-trained on English.

Multi-mode training

Figure 1 shows examples of inputs and outputs of our model PyMT5 for 3 example tasks: (top, blue) predicting a body from a method signature, (middle, red) predicting a whole method from a natural language docstring, and (bottom, green) predicting a body from a signature and docstring. Note that the comment ‘# target ’ instructs the model to choose a particular form of output. Further note that PyMT5 correctly learns to interpret natural language: it interprets ‘even’ as being related to ‘(example %2) == 0’, and ‘greater than 1000’ as ‘number > 1000’. The model also produces syntactically correct code (as we will discuss later, we never show the model syntactically incorrect code), and correctly infers the types of ‘lst’ and ‘numbers’ to be iterables containing numbers.

PyMT5 can also be prompted with source code to produce a docstring summary in various styles. Figure 2 shows the model prompted with one of the methods generated by PyMT5 in Fig. 1 (top, blue), in both a ‘one line’ (top, blue) style and a ‘Numpydoc’ (bottom, red) style. It infers the intent from the signature name and code, and even infers that type of the argument is a list and return type int. It produces the same terse one sentence summary of the function in both cases.

In order to teach PyMT5 to maximally relate the separate method features (signatures, docstrings, bodies), we trained it to translate between all pairs of feature combinations in which the same feature does not appear in both the source and target. This scheme is also advantageous as our corpus is unbalanced, with only 1/5 methods featuring docstrings, and so the model can learn to leverage all the features whether they are present or not. Additionally, it has been shown that code is more ‘predictable’ than natural language Hindle et al. (2012). If the method and argument names are a dominating signal due to their relatively rigid structure, the model may learn to ignore the content of docstrings. This multi-mode method overcomes that by training the model to generate method bodies from docstrings alone. See the appendix for a more detailed description of the multi-mode training scheme.

Our data consists of 118k GitHub repositories, which includes all public repositories labelled as containing primarily Python source code, featuring at least 10 stars, and which have had a commit in the past 5 years. We successfully cloned 112k of these repositories, extracting 5.3 million Python files from the default HEAD state of each repository. We then removed literal duplicate files, resulting in 2.3 million unique files, but did not remove finer-grained clones. After removing license from the files, the literal contents were used in the pre-training step, comprising about 27GB of raw text.

In order to extract method-level information for fine-tuning, we used the python3.7 standard library ast to produce the file-level Abstract Syntax Tree (AST) for each Python file, extracting every individual and class method. For each file which failed to parse, we used 2to3 and autopep8 to overcome the issue of different styles and white space or tab conventions, successfully parsing 97.3% of the 2.3 million unique Python files. We used the Python module astunparse to take the AST for each method and unparse them back into source code, so that our fine-tuned model was never trained on syntactically incorrect code. The statistics of our method-docstring corpus are summarized in Table. 1. Our parallel method-docstring corpus is twice as large as the next largest irrespective of language and over 15 $\times$ as large as the next largest Python parallel corpus, both in CSN.

For each method, we ignored comments as they generally represent trivia and are not part of the normal language syntax. We cleaned the docstrings by removing non-ASCII characters, normalizing Unicode, and replacing commit hashes, file paths, and URLs with placeholder tokens. In all studies here, we randomly split the files at the repository level (to prevent data leakage) with 90% for training, 5% for validation, and 5% for a test set.

2 Pre-training

The majority of our Python methods—over 20 million methods— do not possess docstrings. This imbalance is, in fact, an opportunity in light of the recent trend for NLP: unsupervised pre-training of language models on vast amounts of raw text Devlin et al. (2018). Using these pre-trained models as starting points for downstream tasks—like classification, translation, summarization, and question answering—consistently yields state-of-the-art results Lewis et al. (2019); Raffel et al. (2019).

Following this trend, we use a similar span-masking objective used by the recent text-to-text transfer transformer (T5) Raffel et al. (2019). As shown in Figure 3, after tokenizing the inputs, we sample a random subset of the token spans up to length 3 to be replaced with, e.g. a [MASK0] token, and then teach the sequence-to-sequence model to replace the missing tokens. The training target is comprised of numbered mask tokens followed by the tokens that mask represents.

The architecture of PyMT5 is an encode-decoder transformer with a vocabulary of 50181 (byte-pair BPE encoder trained on raw python files), 6 self-attention encoder/decoder layers in each encoder layers, and a hidden dimension of 1472, totaling 374 million parameters. All the experiments in this paper, including GPT2 were done using this same extended GPT tokenizer. We pre-trained PyMT5 on 27GB of raw source code in total, for 3 weeks on sixteen 32GB Tesla V100 GPUs, or 73 epochs total. When training on docstring generation alone, we observed 25 $\times$ faster convergence to a lower loss when starting with this pre-trained model as compared to a random initialization. See the appendix for details. In all experiments PyMT5 is trained starting with this pre-trained model.

3 Docstring analysis

When examining docstring samples from our corpus, one of the most salient features is the different styles of documentation. The Python community has no prescribed or de facto style for docstrings, but Python enhancement protocol 257 Goodger and van Rossum (2001) does describe one-line and multi-line docstrings, and mandates indentation as well. Most modern large-scale projects utilize docstring styles which are parseable, allowing the automatic creation and synchronization of source code and documentation websites, see, e.g. sphinx. Therefore, a number of standard styles have evolved in the community.

The currently dominant parseable docstring styles (and the ones supported by sphinx) are reStructuredText (reST) Jones (2013), the official Google style Google (2020), Numpy style (also technically satisfies reST) Maintainers (2020), and javadoc style jav (2011). The difference between each style is mainly in the syntax of denoting sections (if they exist) and the name/type/description annotation of the method arguments and returned/yielded quantities (if they exist). We defined, in addition to these styles, one-line (containing only one line), one-paragraph (containing no empty lines), and ‘other’ to label any docstring not described so far, which includes informal user docstring styles and a few project-specific styles like the SAGE mathematics toolkit library.

Table 2 shows the breakdown of the fraction of each of these styles in our corpus. The plurality of docstrings (44%) are one-line. The next most common style is one-paragraph at 14%. The next four most-common styles are the machine parseable styles discussed above, comprising 26.2% of the total number of docstrings. The appendix contains detailed distributions of method signature, docstring, and method body character and line lengths.

To visualize the space of these styles, we used FastText vector embeddings of the docstrings, obtaining 100-dimension continuous vector representations of each. We then used PCA to reduce the dimensionality to 50 and applied the t-distributed stochastic neighbor embedding (t-sne) to obtain a two-dimensional visualization. Figure 4 shows 1/10th of our corpus (700k docstrings) embedded, colored by docstring style as defined above. We can see clear clustering of styles, indicating that similar docstrings use the same style (for the parseable styles). There is also a natural dichotomy between parseable and non-parseable styles: the left side is dominated by ‘one line,’ ‘one paragraph,’ and ‘other’ styles, and the four parseable styles are largely on the right side. This observation can be used to generate documentation consistent with the style of a given project, or it could be used to translate methods into more informal descriptions useful for search indices.

Method generation

Now we turn our attention to method generation: predicting a whole method code body from either a method signature, a natural language docstring, or both. We first discuss a benchmark of this task using a GPT2-medium model (345 million parameters, see the appendix for details), training from scratch and starting with the publicly released OpenAI English pre-trained checkpoint with weights from HuggingFaceWolf et al. (2019). In all experiments we used an extended GPT2 tokenizer—including white-space (one tab, two tabs, etc.) tokens—for a total vocabulary size of 50337, and we used beam decoding with a beam width of 5.

The third row of tab. 3 shows PyMT5 has more than double the BLEU score, overall better recall, and significantly better ROUGE-2 and ROUGE-L F-scores than our GPT2 baselines. Further, 93.6% of the methods generated by PyMT5 were syntactically correct Python 3.7, whereas only 86% of GPT2 methods were syntactically correct. PyMT5 was trained on 16 Tesla V100 16GB GPUs for 62 epochs, or 5 weeks training time (see the appendix for its hyper-parameters) and the GPT2 baselines were trained on the same hardware for 1 week training time (achieving the same or better validation loss/perplexity as PyMT5).

The English pre-trained initialization of GPT2 only slightly beats the random initialization of GPT2, which could indicate that the learned biases of English are not particularly beneficial for writing Python code; the metrics are almost all within our margin of error. Note that Barone and Sennrich (2017) also modeled methods from docstrings, obtaining a similar BLEU score of 10.9 on their own Python parallel corpus. On the Barone et al. test set, PyMT5 obtains nearly double these scores at 20.2; such a large discrepancy could be explained by data leaking from their test set into our training set. Barone’s test set is also 200 $\times$ smaller than ours and may not be a representative sample of the whole Python code domain.

The third and fourth rows of tab. 3 show the performance of PyMT5 using the publicly available CSN Python test set, from which we find notably worse results than on our own test set. CSN curated their whole set by removing any methods with ‘test’ in the name and any methods with fewer than 3 lines of code. We calculated the performance of PyMT5 only on a subset of our test set curated the same way as CSN, observing F-scores for R1, R2, and R-L on our test set of 29.7, 17.2, and 26.1, which is lower than our nominal test set performance of 35.1, 21.5, and 32.2 and closer to the CSN performance of 28.4, 13.5, and 24.8. We believe this curating choice explains the difference between our test set and the CSN test set. We also conclude that tests and short methods are ‘easier’ to complete, which is plausible, and bodes well for automatic code completion applications.

Docstring Generation

We now examine results from the docstring generation task, which for evaluation purposes were conditioned on both signatures and method bodies. As in method generation, we set a GPT2 benchmark with random initialization and pre-trained English initialization as well as the same hyperparameters. Table 4 shows that the ROUGE scores of the GPT2 baselines are within the margin of error; a somewhat surprising result given the English domain of docstrings. The third row shows PyMT5 to be superior to GPT2-medium in terms of BLEU and all of the ROUGE metrics.

We again present the results from the publicly available CSN test set. Similar to the method generation task, PyMT5 performs worse on the CSN data than our own, likely for the same reasons we discussed in sec. 3. We also evaluated PyMT5 on the Barone et al. parallel test set, as shown in the second to last row of tab. 4, and find PyMT5 performs notably worse on Barone’s test set than our own test set, contradicting the hypothesis that our doubling of the method generation BLEU score is due to data leakage. PyMT5 has a much higher BLEU score than that reported by Barone et al, perhaps indicating real progress in the code summarization field.

Docstring generation is similar to code summarization, though the domains are different as docstrings also contain structured annotations of arguments, return values, raised exceptions, and even in-line unit tests (doctest). TranS3 by Wang et al. Wang et al. (2020) reports a best ROUGE-L of 51.27 on the same test set for code summarization, but does not specify which statistic they are reporting, so we cannot make strong conclusions about the performance of PyMT5 compared to the state of the art.

Conclusion

In this work, we presented a novel multi-mode Python method text-to-text transfer transformer model PyMT5as well as the largest parallel corpus of Python source code and docstrings reported in the literature to date. We have trained PyMT5 to translate between all pairs of combinations of method signatures, docstrings, and method bodies which do not have the same feature in both the source and target. Further, we introduced control token prefixes for docstring generation to facilitate docstring generation of various styles. Focusing on two modeling tasks – predicting Python methods from docstrings and summarizing Python source code methods into docstrings of various commonly occurring styles – we have compared this new approach to the auto-regressive GPT2 baselines trained on individual docstring or method generation tasks. On the CodeSearchNet test set PyMT5 achieves a BLEU score of 8.59 for method generation and 16.3 for docstring generation, and a ROUGE-L F-score of 24.8 for method generation and 36.7 for docstring generation. We have demonstrated the effectiveness of dynamic masked pre-training, reducing docstring generation training time by 25 $\times$ . Looking forward, we plan to leverage PyMT5 for various downstream automated software engineering tasks—including code documentation and method generation from natural language statements—and develop more model evaluation criteria to leverage the unique properties of source codes.

Acknowledgements

We would like to thank the Microsoft Cloud and AI SmartML engineering team for help in preparing the data, Shao Kun Deng for the development of compelling user experiences leveraging PyMT5, and Christian Bird for useful discussions.

Appendix A Appendix

Figure 5 shows the distributions of various features of docstrings in our corpus. The top row is the distribution of total character-level length of the method signatures (left), docstrings (center), and code bodies. The blue lines are for methods possessing a docstring, and we can see that the vast majority of these methods have docstrings with more than 10 characters. The bottom row shows the distribution of line lengths of the concomitant features from the top row. While the most common line length of docstrings is 1 (comprising 41%), the vast majority of docstrings have multiple lines.

A.2 Pre-training details

Figure 7 is the complete training script, using the Facebook AI Research Sequence (FairSeq) modeling library, with which we pre-trained PyMT5. The data was pre-noised and processed using the fairseq-preprocess command, and placed in the directory indicated by $DIR. The architecture and training hyper-parameters are set in this script. PyMT5 was trained with the same hyperparameters, but with data described in sec.A.4.

Figure 7 shows learning curves of a single seq2seq model of the same architecture as PyMT5 trained only on docstrings, starting from random initializations, and starting from our pre-trained model. As the figure shows, the pre-trained initialization converged to a better validation loss 25 $\times$ faster than the randomly initialized model.

A.3 GPT2 training details

Our GPT2 experiments also used the FairSeq library, with the OpenAI English checkpoint supplied by the HuggingFace library. Figure 8 shows the complete training script, where for the English pre-trained initialization a pre-trained checkpoint was provided. Each models was trained on 4 Tesla V100 GPUs with 16GB of memory each, for 7 days.

A.4 Multi-mode training details

In order to better teach PyMT5 to understand the relationships between all the different features of code (signatures, docstrings, and bodies) we taught it to translate between all pairs of combinations of these features which do not contain the same feature in both the source and target. In this way, the model can learn to produce method bodies using both signatures and docstrings, or one or the other. Table 5 spells out exactly which combinations were provided to the model as a source and target. For each source example the comment string ‘# target (