Code Prediction by Feeding Trees to Transformers

Seohyun Kim, Jinman Zhao, Yuchi Tian, Satish Chandra

I Introduction

The idea of code prediction in general is to predict some code element, given code surrounding the point of prediction. Code prediction is commonly used in an IDE for autocomplete, where based on the code already written up to the developer’s cursor position, the IDE offers the most likely next tokens, perhaps as a drop down list to choose from as shown in IDE views in Fig 1. Other forms of code prediction could predict missing tokens at arbitrary code locations or predict larger units of code; in this paper we will concern ourselves with prediction of the immediate next token at the cursor position.

Consider the Python code fragment shown in Fig 1. Suppose a developer has written code up to string following by a dot. At this point, it will be helpful for the IDE to prompt the developer with attribute names that are likely to follow, preferably, with atoi ranked at the top because in this case that is the correct next token.

Developers have come to rely on autocomplete in their IDEs for multiple reasons. First, and most obviously, it saves the effort of typing in the next token(s) in the IDE. For this reason alone, most modern IDEs come with at least some autocomplete facility for the languages they support. Notice that a top-ranked suggestion is often selectable by hitting a tab—as would be the case in Fig 1(c)—whereas the lower ranked suggestions have to be selected by scrolling (Fig 1(a,b)), which is more effort. Thus, providing the right completion at the top of the list, or if not, among the top few, is important.

Second, autocomplete is also a powerful code discovery mechanism. For instance, a developer might not know the name of an API call they need off the top of their head, but is able to choose among the choices shown by an autocomplete tool. Without assistance from IDE, the developer might need to change their mental context, go to Stack Overflow or some other web site, and come back to the IDE. However, the code discovery assistance from autocomplete works only when a contextually appropriate code suggestion is offered among the top choices in the list, because developers do not have the time to go through a comprehensive list of completions.

I-B Machine Learning for Code Prediction

It is clear that effective autocomplete requires the intended next token to be predicted at the top of the list, or as close to the top as possible. Type-based autocomplete (e.g. Eclipsehttps://www.eclipse.org/pdt/help/html/working_with_code_assist.htm and Jedihttps://github.com/davidhalter/jedi) returns a list of type-compatible names, but does naive ranking: alphabetically or with simple count-based statistics. For instance, the autocomplete model used in the IDE (Jedi, which ranks type-compatible suggestions alphabetically) shown in Figure 1(a) ranks atoi fairly low. The example shows why one of the earliest attempts of code prediction, using type-based methods, is not very effective. Additionally, for dynamic languages, it is extremely difficult to gather an accurate type-compatible list of tokens that could occur in a context.

These limitations have motivated the use of machine learning for code prediction, as machine learning methods are able to base their predictions on the naturalness of code. Early approaches adapted $n$ -gram language models on linearized source code tokens . More recently, deep neural networks have been applied to code prediction, surpassing $n$ -gram models. The most common neural technique for code prediction is Recurrent Neural Networks (RNNs) and their variants , where the code represented as a linear sequence is fed as input to the model. Fig 1(b) shows how an RNN model does better than a non-ML alphabetical ranking Fig 1(a) in showing the expected item closer to the top.

Researchers have also investigated using the syntactic structure of code for prediction, as opposed to seeing code as text: both using probabilistic graphical models (probabilistic context-free grammars and probabilistic higher-order grammars ), as well as using deep learning .

I-C Is this a solved problem?

Although predictive models cannot be expected to be perfect, the accuracy of the current state-of-the-art methods leaves substantial margin to be improved. For instance, a typical RNN-based method provides less than 37% mean reciprocal rank (this equates to the correct answer being in the top $(37\%)^{-1}\approx 2.7$ results, Sec IV-C) on the py150 benchmark. Improving this metric is exactly the goal of this paper. We report that our techniques are able to suggest the correct next tokens at ranks better — showing correct result to the developer by 0.5 to 1 ranks higher — than those achieved by previous methods (MRR increase of 14% to 18%.)

As a concrete example: Table I shows the ranks of the various non-punctuation tokens to be predicted for the code in Figure 4, using various recent methods, as well as for our work. Specifically, the rank of atoi is predicted at rank 1 only by the new methods we propose in this paper.

I-D Feeding Trees to Transformers

To improve the accuracy of next token predicted, a number of alternatives come to mind. First, we could strengthen the neural architecture alone. For instance, researchers have suggested adding attention to RNNs to compensate for loss of signal on long range dependence. Without long-range dependence a model would be making a (potentially sub-optimal) decision only on the most recent tokens. Transformers handle long-range dependencies better. In the NLP community, Transformers have achieved state-of-the-art results , outperforming RNNs, for a variety of NLP tasks such as language modeling, question answering, and sentence entailment. For us, since training Transformers did not take much more resources that training RNNs (Table IV), we decided to proceed with Transformers.

An orthogonal way to improve the accuracy is to enable the machine learning system to “see” more code structure. Raychev et al. had found that—for the code prediction problem—a non-neural but AST-aware engine could outperform RNNs. In the same spirit, Alon et al. had found—for code summarization problem (though not for code prediction in their paper)—that embedding the AST structure of code vastly outperformed purely sequence-based methods.

Inspired by the above, we explore how to leverage code structure while using Transformers on code. This is not obvious; we cannot simply jam an AST into the Transformer, which is a sequence processing model. We explore two models that represent two ways to capture the (partial) structure of an AST: one, based on decomposing the tree into paths (PathTrans), and the other, based on a tree traversal order (TravTrans). We also investigated a variant of TravTrans that takes into account even more tree structure.

I-E Key Results

Tab II compares the new models we introduce in this paper (in bold), with three state-of-the-art models from previous work (Sec II-A discusses these models.) We report results based on training and evaluating on the py150 dataset, and for each, we report accuracy in mean reciprocal rank (MRR) as a percentage (see Sec IV).

Our best model TravTrans, which communicates the tree structure to the Transformer, significantly outperforms all previous models for code prediction, with improvements in reciprocal ranks:

from 43.9% to 58.0% when comparing a non-neural tree based model Deep3 vs. TravTrans;

from 43.6% to 58.0% when comparing Code2Seq vs. TravTrans;

from 36.6% to 54.9% when comparing an RNN implementation SeqRNN vs. TravTrans comparing TravTrans against SeqRNN is slightly different than comparing it against Deep3 or Code2Seq, hence the difference in MRR. We discuss in detail in Sec V.;

We also evaluated our trained model on a dataset selected from a Python code repository internal to a Facebook, and found the relative benefits of the Transformer models to be similar to those on the py150 dataset. This indicates that the relative advantage of Transformer models carries over to other datasets. (Sec V, Table VIII)

Deep learning models can be rather opaque as to their working. To better understand whether TravTrans is taking advantage of the tree structure, we employed saliency maps to examine where the model focuses its attention when making a prediction. We found that indeed, the model tends to focus on the most relevant parts of the tree, starting from the parent node (Sec VI).

I-F Contributions

We move the state-of-the-art forward in accurate next token prediction, a capability increasingly expected in modern IDEs.

We describe ways of using Transformers for the task of next token prediction, especially ways that profitably communicate the syntactic structure of code. (Sec II). Although there have been previous work in applying Transformers in the context of code (code summarization , code correction , and code translation ), this paper is the among the first to explore and evaluate Transformers for code (next token) prediction.

We present a systematic comparison of our proposed models with the most effective models from prior workIncidentally, this paper is also the first to compare the three prior works on a common dataset. that are applicable to next token prediction, on a widely available Python dataset py150. The findings clearly indicate a 14% to 18% gain in accuracy with our best model relative to prior state-of-the-art.

We provide a preliminary attribution study in an attempt to understand the prediction given by our best performing model, TravTrans. This study (Sec VI) indicates that TravTrans indeed conditions its predictions on the most pertinent tokens in the context. To our knowledge, this kind of model interpretability analysis for autocomplete is also a first.

Our overall conclusion is that Transformer based models over ASTs provide the best prediction power for autocomplete.

I-G Outline

Sec II provides background on the previous models that we use in our evaluation, along with a primer on Transformers. Sec III explains the Transformer-based models of our own creation. Sec IV describes our datasets and implementation. Sec V presents our quantitative results. Sec VI takes a closer look into why our models worked well (or did not). Sec VII lists some threats to validity. Sec VIII discusses prior related work in the field of code prediction and Transformers. We conclude the paper with our future work in Sec IX.

II Background

In this section, we define the code prediction task we examine in this work, followed by details of previous state-of-the-art methods of code prediction we use for comparison. We end the section with a brief introduction to the original Transformer model. We will refer to the nodes of the AST for Fig 4, as shown in Fig 3.

Code prediction task studied in this work is to predict the next code unit given the partial program up to the point of prediction. Let $p^{*}(unit\mid ctx)$ be the empirical distribution of code unit given the partial program context $ctx$ . Our task is to learn to approximate $p^{*}$ using a machine learning model $M$ . In our proposals, $M$ will be some Transformer $Trans$ . The learned distribution can be viewed as

where $\boldsymbol{\theta}$ represents the trainable parameters of the model. We train the models by minimizing the KL-divergence between $p$ and $p^{*}$ , or equivalently, minimizing the cross-entropy loss $l$ over all code prediction locations.

We explore several ways of representing partial program and predicting different kinds of code units. When using source code token as code unit and representing partial program as sequence of source code tokens (SeqTrans), the problem aligns with the traditional notion of language modeling – predicting the next token in a sequence given all previous tokens: $p(t_{i}\mid t_{1},\dots,t_{i-1})$ where $t_{i}$ is the $i$ -th token.

More interestingly, we explore various representations of a partial program to better utilize its AST information. The intuition is that the more we can utilize the syntactic information provided by the AST, the better we can predict the next token. The next section discusses three models used in previous work.

II-B SeqRNN

For next token prediction, a popular method is to feed the source sequence tokens into an RNN (or LSTM) . An RNN embeds the input tokens into a vector: $\boldsymbol{x_{t}}=\mathit{emb}(\mathit{{w}_{t}})$ , where $w_{t}$ is the source token seen at the $t$ ’th time step. The hidden state $\boldsymbol{h}_{t+1}$ at the $(t+1)$ -th time step is computed as $\boldsymbol{h}_{t+1}=rnn\left({\boldsymbol{x}_{t},\boldsymbol{h}_{t}}\right)$ , where $rnn$ is a trainable RNN unit. The last hidden state is then fed through a classification layer.

The pertinent point to note is that the hidden state $\boldsymbol{h_{t}}$ encodes the knowledge of not just the current token, but of last several (and theoretically all) previous tokens via the propagation of information in previous hidden states.

In our experiments, we feed the source code tokens into an LSTM and call this model SeqRNN.

II-C Deep3

Raychev et al. presented a system, Deep3, based on a learned decision tree combined with count-based probabilities at the leaves of the decision tree.

Fig 4 shows part of a learned decision tree, written in the form of program in a specialized language called TGEN. Given an AST $t$ and a starting node $n$ , a TGEN program walks certain paths in $t$ starting from $n$ . For example, Up WriteValue (line 1) goes to the parent of $n$ and records the label. At the end of a TGEN program is a probability distribution for the possible values of the starting node. For example, starting with node 29, the TGEN program predicts “atoi” with 60%, “split” with 30%, etc.

A TGEN program is learned—on a specific corpus—by a genetic search procedure that simultaneously selects paths and grows the decision tree from the training data, with an entropy minimization objective. In this paper, we use their pretrained model as well as their Python dataset for our experiments.

II-D Code2Seq

Code2Seq is a model by Alon et al. that embeds code snippets by embedding AST paths in a neural network.

At a high-level, given an AST, Code2Seq creates path representations for all leaf-to-leaf paths. For example, Fig 5 shows three leaf-to-leaf paths for nodes 22-29 from the full AST (Fig 3). For each path, a path representation is created with: 1. the starting leaf value, tokenized by snake and camel case, 2. the path itself, and 3. the ending leaf value, also tokenized. 1 and 3 are embedded using LSTMs, and 2 is embedded using bi-directional LSTMs. These three embeddings are concatenated and then fed through a feed forward network. Finally, all of the path represention embeddings in the AST are combined using a simple attention mechanism.

In Code2Seq, a decoder is then used to solve a code summarization task: given a method body, how well can Code2Seq generate the correct method name? The training proposed in is not well suited for next token prediction. In code summarization, a set of leaf-to-leaf paths needs to be created one time for a method. By contrast, in code prediction, a new set of leaf-to-leaf paths has to be created for each point of prediction.

For example, to predict atoi (node 29) in Fig 3, we must first create an representative embedding for the partially completed AST up to node 29, using all leaf-to-leaf paths available up to node 29. Paths that end in atoi are also used, with atoi replaced with a placeholder token to prevent information leak (e.g. Paths 2 and 3 in Fig 5). The representative embedding is then fed through a classification layer to generate predictions.Note that this is different than the generative decoder that Code2Seq uses.

By treating each point of prediction as a separate data point (compared to a language model, where one sequence is considered one data point), the number of training data points, along with the effort to create them makes Code2Seq computationally very expensive.

II-E A Primer on Transformers

Here we present a brief introduction of Transformers. Readers familiar with Transformers can skip ahead to Section III.

Transformers belong to a class of deep neural networks that are designed for sequence processing. In Transformers, information from any previous location of the sequence can directly affect the encoding of the next token, through a mechanism called self-attention, which helps greatly improve the connectivity in long sequences.

To be precise, a Transformer is a stack of Attention blocks (AttnBlk) preceded by an input embedding layer (Emb) and followed by a classification layer (Clsfr), where AttnBlk is repeated $n_{block}$ times.

See Fig 6 for a schematic of a Transformer with embedding layer, a stack of (here 6) attention blocks, and finally a classification layer.

The self-attention layer—which constitutes the main part of an attention block—is the crux of the model. The intuition here is to attend to the elements in the input sequence in proportion of their relevance to the location being predicted. For example, take an example input token sequence [“map”, “(”, “string”, “.”], and the target next token being “atoi.” It is first fed through the initial embedding layer to give: $\boldsymbol{E}=\left[{e_{\text{map}},e_{\text{(}},e_{\text{string}},e_{\text{.}}}\right]$ . Then, we feed $\boldsymbol{E}$ to three fully-connected networks ( $\boldsymbol{W_{q}},\boldsymbol{W_{k}},\boldsymbol{W_{v}}$ ) to create query, key, and value embeddings:

Fig 6 depicts the vectors $\boldsymbol{Q}$ comprised of its elements $q_{\text{map}},q_{\text{(}},q_{\text{string}},q_{\text{.}}$ (in green), and likewise for $\boldsymbol{K}$ and $\boldsymbol{V}$ .

The self-attention then works by querying keys $k$ using queries $q$ and then using the result to summarize values $v$ through the attention function:

where $d_{k}$ is the dimension of key vectors. Here, $\boldsymbol{Q}\boldsymbol{K}^{\top}$ is used to determine which token relationships are the most important, resulting in a matrix of size $n\times n$ , where $n$ is the length of the input sequence. Each row is then normalized and passed through a softmax layer. Table III shows an example of the self-attention weights. Looking at the last row, we can see that most of the attention is given to “.”, meaning it has a greater factor in predicting the next token “atoi”. Note how the matrix is a lower triangular matrix: this is because self-attention cannot be applied to tokens that have not been seen before. After multiplying this matrix with the value vector, in our example, $\text{Attn}(\boldsymbol{Q},\boldsymbol{K},\boldsymbol{V})=\left[{0.2*v_{\text{map}},0.1*v_{\text{(}},0.2*v_{\text{string}},0.4*v_{\text{.}}}\right]$ .

Transformers also uses multiple heads of these self-attention blocks, called multi-headed attention, which enables the model to simultaneously consider different ways of attending to previous information within one block and also across other blocks. In our implementation, we omit positional encoding (see Sec IV-B.) For more details, please refer to and .

The next sections discuss various ways of feeding code fragments into this Transformer architecture.

III Our work: Transformer-based Models

The question that interests us is: can Transformer-based models also benefit from syntactic structure, and if so, how can we communicate the syntactic structure to Transformer?

In this section, we first begin with a model SeqTrans that uses a Transformer to take source code tokens as input. Then we introduce two other models, PathTrans and TravTrans, that use more syntactic information obtained from the AST.

Our first model is to apply a Transformer over source token sequences, which can be easily obtained by applying a tokenizer. Here the input is the partial program represented as source token sequences and the output is a source code token. This is a straightforward application of the original Transformer design, and functions as a baseline for our later attempts that take on more AST information. The model can be written as

where $\boldsymbol{o}$ is a distribution over all possible tokens, and $\boldsymbol{e}_{t}$ is the embedding for source token $t$ for every $t$ in the source token sequence. As we show in the experiments, SeqTrans turns out to be an already strong model as a direct comparison to the baseline RNN model SeqRNN.

Since Transformers are originally designed as a sequential model, the challenge becomes finding ways to convey AST information to Transformers. In the next subsection, we will vary the inputs and the outputs to the Transformer, but the principles of operation will remain the same as in SeqTrans.

III-2 PathTrans

PathTrans enhances SeqTrans by exposing tree structure to the Transformer via root-paths. A root-path is the path from the leaf node $t$ to the root of the AST by traversing up its ancestors, recording all the nodes along with it, and thus a sequence of internal AST nodes. Fig 7 shows an example of an input datapoint for predicting node 29 of Fig 3.

The root-paths are first fed into a an LSTMWe could have used a Transformer to embed the path sequences in lieu of an LSTM, but since the path sequences are short (capped at 13 tokens) enough for LSTMs to perform adequately well, we decided to use an LSTM. See Sec IV-B for details. added with the embedding of the leaf node, and is fed through a Transformer:

where $\boldsymbol{o}$ is a distribution over all possible leaf nodes, and $\boldsymbol{e}_{t}$ is the embedding for AST node $t$ and $\boldsymbol{p}_{t}=\text{LSTM}\left({\left({\boldsymbol{e}_{u}}\right)_{u\in Rootpath(t)}}\right)$ is the summarized representation of root-path from $t$ for every leaf node $t$ in the leaf sequence $\mathit{leaf\_seq}$ . $+$ is used here following the convention of Transformer computations and to keep the embedding dimension the same for every component. The hope here is that the root-paths captures the local syntactical information and thus can help the prediction.

Since the points of prediction are the leaf nodes of the AST, the loss is taken over only the leaf AST nodes, and it predicts only leaf tokens.

III-3 TravTrans

As a Transformer naturally only takes a sequence as input, we provide the AST nodes as a sequence in pre-order traversal, or a depth-first-search (DFS) order. For Fig 3, for node 29, the previous nodes in DFS order would be: […, “Call”, “NameLoad”, “map”, “AttributeLoad”, “NameLoad”, “string”, “Attr”].

where $\boldsymbol{o}$ is a distribution over all possible tokens, and $\boldsymbol{e}_{t}$ is the embedding for AST token $t$ for every $t$ in the partial program represented as a AST token sequence $\mathit{AST\_seq}$ in DFS order.

III-4 Capturing even more AST structure?

TravTrans presents the tree nodes in a pre-determined order, but still does not retain detailed structural relationship between nodes. For example, consider the sequence of nodes 26 - 28 in Fig 3. This would be represented as [“NameLoad”, “string”, “attr”], the three nodes appearing consecutively in DFS order. Looking at the AST, we can see that the relations between (“NameLoad” & “string”, and “string” & “attr”) are actually quite different: “NameLoad” is one node up from “string”, while “string” is two nodes up and one node down from “attr”.

We create a new model, TravTrans+ that enhances TravTrans by capturing these richer path-based relations. Similarly to Hellendoorn et al. , we enhance the self attention block of the Transformer with a matrix $\boldsymbol{R}$ that captures the (unique) path needed to reach from $a$ to $b$ . This path is represented abstractly only in terms of up and down moves:

where $i$ , and $j$ are the number of up and down nodes, respectively, node $a$ has to travel to reach node $b$ . For example, UDpath(29, 27) = $U^{2}D^{2}$ for node 29 and 27 in Fig 3. $\boldsymbol{R}$ is introduced to the Transformer by replacing the Attn function with the following $\text{Attn}_{\text{TreeRel}}$ function.

where $\odot$ is element-wise product. This provides a way for the self attention to consider the previous tokens, taking into account the AST relationship between pairs of nodes as well.

IV Implementation and Datasets

We train our models using the py150 dataset used in . The dataset consists of 150k Python 2 source code files from GitHub repositories, along with their parsed ASTs, split into 100k for training and 50k for evaluation. In this work, we slightly modify the AST to ensure that the internal nodes only carry syntactic types and the leaf nodes only carry token values. To incorporate large trees (greater than 1000 nodes, which is the limit we chose for transformer window), we deploy a technique adopted by , which slices a large tree into shorter segments with a sliding window to maintain part of the previous context.

We evaluate our models on two evaluation datasets:

py150: We use the evaluation dataset used in , which consists of 50k Python ASTs. After the modifications mentioned above, there are 16,003,628 leaf nodes.

internal: We also created an evaluation dataset consisting of 5000 Python files from a code repository internal to Facebook. With this dataset, we can evaluate how our trained model can generalize to a different dataset, even if the code comes from disjoint projects. After the modifications, there are 1,669,085 leaf nodes.

IV-B Implementation

For the models that use Transformers (SeqTrans, PathTrans, TravTrans, TravTrans+), we adapt the Pytorch implementation https://github.com/graykode/gpt-2-Pytorch. of GPT-2 small . We use six Transformer blocks, six heads in each block, $n\_ctx=1000$ , and $embedding\_dim=300$ . We borrow other hyperparameters from . We limit the token vocabulary size to 100k, which covers over 90% of the tokens used in the training dataset. For PathTrans, we limit the maximum length of the path from leaf node to root to be 13, which covers over 90% of the nodes. For any path longer than 13, we keep the nodes closest to the leaf, and truncate the nodes near the root.

In our implementation, we did not use positional encoding or positional embedding to provide extra positional information over elements since our early trials with LeafSeq suggested positional embedding is rather hurting than helping. This is also supported by the claims in that positional encoding does not help for language modeling. Recently, tried to introduce tree structures to Transformer models via positional encoding. However, their relative improvement is small compared to what we see with tree-relational prior in Section V.

For SeqRNN, we adapt the PyTorch LSTM example implementation https://github.com/pytorch/examples/tree/master/word_language_model. We use embedding dimension $d_{\text{model}}=300$ , with $dropout=0.5$ and $n\_layers=1$ . We maintain the same vocabulary size at 100k.

For Code2Seq, we used a PyTorch adaptation of the publicly released modelhttps://github.com/tech-srl/code2seq, using the same hyperparameters, except changing the vocab size to 100k. For selecting 200 (max number of paths) paths per AST, we first picked paths that ended with the target (to maximize the amount of local context). Since for each prediction point in the AST, a new set of leaf-to-leaf paths have to be generated, the data processing for Code2Seq takes a substantial amount of time (magnitude of days worth of time).

We trained all models on Nvidia Tesla V100 (using 4 GPUs at a time) until the loss converged, with all of the parameters randomly initialized. We used the Adam optimizer with the learning rate set to 1e-3. Implementation details regarding number of epochs until convergence, training time (minutes per epoch), inference time (to evaluate over the py150 dataset), and model size, are listed in Table IV.

For the Deep3 model, since the authors have shared only the model and not the training algorithm, we used the model pretrained on py150.

IV-C Evaluation Metric

We evaluate the models on next token prediction for the leaf tokens. We report numbers for all leaf token predictions, as well as breaking down into more interesting categories: attribute access, numeric constant, name (variable, module), and function parameter name.

To measure performance on these tasks, we use mean reciprocal rank (MRR). The rank is defined as

where $n$ is the number of predicting locations and $rank_{i}$ is the rank of the correct label given by the model for the $i^{th}$ data point. We present MRR as a percentage, in keeping with prior work .

While Acc@1 only gives score when the correct label is ranked at the top, MRR also give scores when the true label is not ranked as the top, but among top few prediction. Comparing to the hit-or-miss style metric (Acc@1), this is closer to the realistic scenario when completion suggestions are presented to developers. With this practical perspective and for ease of computation, we only consider $rank_{i}\leq 10$ for each location $i$ (all $rank_{i}>10$ will have a score of 0). We share our data processing scripts and model implementations at https://github.com/facebookresearch/code-prediction-transformer.

V Evaluation

RQ1: Given a source token sequence, does a Transformer work better than an RNN, as in previous work? Comparing SeqRNN and TravTrans, we find that a Transformer does work better than an RNN. For the py150 dataset, we can see a significant improvement in MRR for predicting all leaf tokens in Table V, from 36.6% to 50.1% for the SeqRNN and SeqTrans models, respectively. The same holds for comparing on the internal dataset, as shown in Table VI: 23.8% vs 36.5%. Consistent improvements can be seen for specific types of leaf tokens.

RQ2: Do the Transformer models on tree outperform previous work on AST-based prediction?

We compare Deep3 and Code2Seq against PathTrans, and TravTrans (Table VII). Overall, we found that both transformer-based models, achieve better scores than both Deep3 and Code2Seq for all leaf tokens as well as for specific types of leaf tokens. Our best performing model, TravTrans, improves Deep3’s MRR by 14.1% (from 43.9% to 58.0%), and Code2Seq’s MRR by 14.4% (from 43.7% to 58.0%). Similar results can be seen for the internal dataset (Table VIII).

We also compared the accuracy on AST internal predictions, comparison Deep3 and TravTrans, as they are the only models with this capability. In Table IX we see that TravTrans improves accuracy over Deep3 across the board. Table IX show non-terminal value prediction for both py150 and internal dataset, respectively. We see that TravTrans is outperforms Deep3 for all internal node types as well. We do not include Code2Seq comparison for non-terminal node predictions due to the overhead required to prepare and process the dataset. Since the main part of the paper was on leaf token prediction, and we have shown that TravTrans performs significantly better than Code2Seq, we did not deem it essential to include the results on non-terminal value predictions.

RQ3: Does adding syntactic structure help? Comparing SeqTrans and TravTrans, we confirm that adding syntactic structure does help.

Comparing SeqTrans against TravTrans on leaf nodes fairly is a bit tricky. A source code token we are looking at here contains both syntactic type and lexical value, which translates to two nodes in the AST. For example, the source code token “2” is equivalent to an internal node with value “Num” and its child leaf node “2”. Thus, for a fair comparison, TravTrans has to make two predictions – one for the leaf node and one for its parent internal node. For example, given the sequence “y =”, and the next prediction should be “2”, TravTrans must predict both the internal ”Num” node as well as the leaf ”2” node. For evaluation, we implemented a local beam search to choose the pair with the maximum joint probability. Table V shows that TravTrans still performs better than SeqRNN and SeqTrans (except for predicting function parameter name) for predicting the leaf tokens.

It is important to note that even with this correction, it is tricky to completely fairly compare the accuracy of SeqTrans against TravTrans model on leaf nodes. The main issue here is that the contexts used to predict a leaf value is different in the two models, because of different order of seeing information. Consider the case of a code fragment x = y + z, and let us say we need to predict what comes after the “=”. In the case of source token based prediction, the predictor has seen “x =” and would be expected to produce the next token (“y”). In an AST, one would first construct an ”Assign” internal node, with the right child to be filled next, and the next prediction would be actually an interior node ”BinOpPlus”. The “y” would be predicted as the left child of BinOpPlus. In a context in which the AST has been augmented with the BinOpPlus node, the prediction of “y” has more information compared to what a source token model did immediately after the “=”. This makes a direct comparison of token predictions difficult. One way to achieve better parity between the models would be to use an in-order traversal instead of a pre-order traversal; we chose the latter because we want to do top-down generation, which will be useful in future AST generative models.

We also compare TravTrans with its variant TravTrans+ that extends the model to incorporate even more tree structure, as mentioned in Sec III. Table X shows a slight increase in MRR: 58.0% vs 58.8% for all leaf nodes, and 87.3% to 91.9% for internal nodes. Due to mixed benefit on leaf node predictions, we continue to treat TravTrans as our flagship model. But this does hint at the existence of more powerful models, if we incorporate more tree structure.

VI Model Inspection

Despite their wide adoption in many areas of computing, deep learning models have been repeatedly shown susceptible to adversarial examples . Interpretability studies that provide human-understandable justifications, have thus become an important property in building safe and trustworthy AI systems. With the growing adoption of neural models for software engineering tasks, the same concern raises. Recent discoveries of unexpected behaviours of deep learning models of code calls attention for such studies. Besides of helping understand the behaviour of deep learning models, interpretability methods have also been used to directly improve the process of software engineering. For example, used LIME , a model-agnostic explainability method, to pinpoint the defective lines of code from defect detection models.

As a crucial first step towards interpretability, in this section, we reveal what TravTrans has learned that leads to its good predicative power. We found that TravTrans has learned to attribute the prediction to relevant previous tokens. While this is not a principled study, we hope it sheds light on a possible approach to perform a complete model inspection study.

Although our Transformer-based models heavily relies on attentions, direct visualizations of weights in individual attention heads did not yield clear insights as the attentions are stacked across layers and multiple attention heads are in effect in each layer (see Sec II-E). We thus turn to gradient-based saliency maps , for a more comprehensive account of the influence of each input token. Following , the influence of each input token is computed by taking the partial derivative of the loss function with respect to the embedding vector of that token. Fig 8 is the saliency map we got for TravTrans, which visualizes the magnitudes of the gradients fall at each input token ( $x$ -axis) when the model predicts a particular output ( $y$ -axis). Intuitively, the larger the value for a particular token, the more sensitive the output is to the variations at that input token.

We first observe that the parent AST node (the internal node right above the leaf) is generally important in predicting the next leaf node value: the last token usually has a high influence. More generally, we found that TravTrans tends to attribute more towards the internal nodes that are directly followed by relevant leaf nodes. For example, we can see the prediction of atoi is influenced by (the locations previous to) string, map and num_requests. It is not shown in the figure but the predictions of sys and argv before 2 are influenced by the previous occurrence of the same values followed by 0 and 1. The prediction of host, gethostbyname are influenced by (the location previous to) ip. For the position of gethostbyname, TravTrans predicts socket with probability 0.31, with the correct answer ranked second with slightly lower probability 0.24. All above suggest that TravTrans is capable of picking up information from relevant previous code locations. The reason TravTrans tends to focus more on the parent nodes may be that all relevant information for the next token are already summarized in the hidden representation at the location previous to the token (aka parent nodes for the leaf nodes).

On an orthogonal note, we also observe that for many predicting locations, the magnitude of gradients are very small, suggesting the robustness of the model in the sense that it is less sensitive to minor perturbations of the input sequence.

VII Threats to Validity

One of the challenges in next-token prediction is predicting tokens that are rare or unseen in the training corpus.

When trained on py150, 21% and 35% of prediction locations observe OOV tokens for the py150 and the internal test set, respectively. None of the models in our main evaluation handle OOV predictions and thus are unable to correctly predict for such locations.

Handling of OOV tokens, while important, is orthogonal to core model design, and is outside the scope of our current work. Adding OOV handling to a model should further improve its performance, as suggested by previous works . Moreover, a quick comparison (see Sec VIII) against PointerMixture from found that TravTrans already has an edge even without OOV handling for the py150 dataset. We thus do not expect OOV handling to reverse our findings.

While larger Python corpora have appeared, py150 is still sizable at 500MB; we do not expect the larger corpora to reverse our findings.

We have only carried out evaluations on Python, and have not demonstrated that our results would carry over (in trends) to other languages. The Deep3 paper did find their results (in trends) to roughly carry over from Python to JavaScript.

To bound the scope of this paper, we limit our investigation to techniques that do not require any compiler support beyond constructing the AST; thus, we exclude ways to communicate program analysis results such as def-use information, as in . We also limit our study to the Transformer architecture, and purposely exclude the graph neural networks (GNN), because there is already an active debate between GNN and Transformer for NLP applications that transcends code prediction.

VIII Related Work

We compared our models extensively with three baselines: SeqRNN, Deep3 and Code2Seq, both qualitatively in Sec II and quantitatively in Sec V. Before proceeding to a broader discussion, we briefly compare our work with the attentive pointer mixture model (PointerMixture) from Li at al. as another plausible baseline. We are particularly interested in this model, because it (1) works with serialized AST node sequences, (2) uses attention mechanism on top of LSTM, and (3) has reportedly achieved better accuracy than Deep3’s results.

The similarities and differences in relation to our work are as follows. For (1), the preorder traversal of the AST they used is similar to ours. They did not separate types and values into different nodes as we do, which makes their sequences more compact than ours. For (2), their attention is applied within a fixed-sized window on top of LSTM outputs, whereas our Transformer-based models use stacked layers of attentions as the key mechanism for computing throughout the network. also used “parent attention”, which is based on the parent-child relation in the AST. In comparison, our PathTrans and TravTrans+ extend far beyond the direct parent-child relation by making use of paths.

For (3), the accuracy reported in included the predictions of null values for internal nodes, whereas our numbers only consider leaf nodes with non-empty values. To fairly compare the predicative power between our proposal and theirs, we implemented PointerMixture in our setting.To maintain consistency, the hidden size and the embedding sizes were set to 300, and the vocab size was increased to 100k. The model was trained for 8 epochs. Note that PointerMixture uses a pointer network which decides for each point of prediction whether to use the LSTM output or to copy from a previous location, whereas our TravTrans has no means for handling out-of-vocabulary (OOV) tokens (i.e. all predictions requiring an OOV token are incorrect.)

We found on the basis of py150 dataset that PointerMixture outperforms SeqRNN, as well as the other baselines of Deep3 and Code2Seq. However, TravTrans outperforms PointerMixture, even though all the OOV predictions count as wrong for TravTrans! For the internal dataset, PointerMixture model does better than TravTrans for some prediction types. This is expected: we trained the model for illustration on a different dataset (py150) than the one for evaluation (internal), causing a significantly higher percentage of OOV predictions. In actual usage, we would retrain our models on the internal dataset. The results are shown in Table XI, focusing on PointerMixture vs. TravTrans.

Since we have already introduced a brief history of autocomplete in the paper, this section will focus on other explorations in the Transformer and code prediction space.

In this paper, we viewed the context of prediction as code that appears strictly before the cursor . Among other flavors of code prediction are the ones where code after the prediction location, when available, is taken into account , e.g. when completing a “hole” in a program, or correcting a misused local variable instance. Another dimension of work considers prediction at varying granularities of predictions, e.g. from characters to subtokens ) to AST fragments (e.g. sub-ASTs ). In the context of autocomplete, we believe that subtokens or BPE are indeed a promising future direction, as we mentioned in Sec IX.

There have also been work discussing the practical implications of applying a code prediction tool into production. state that synthetic benchmarks are not representative of real-world data, and accuracy of the models drops when evaluated on real-world data. discusses approaches to make models more lightweight to allow for faster computations and less memory usage in an IDE. A user evaluation of an autocomplete tool with stronger ML models is outside the scope of this paper. We would also like to point out that advantages in idealistic settings often transfer to advantages in more practical settings, as confirmed by the results reported in the above two works (Table II and III in , and Table 2 in ).

Other than next token prediction, Transformers have been used recently for code summarization . Furthermore, there has been a surge of interest since 2019 in extending Transformer models to handle beyond sequential structures for NLP . It has been shown that taking tree structure into account helped code correction and code translation .

There is practical interest, outside of academic literature, in the topic of code prediction using Transformers. Galois is an open source project that uses GPT-2 for code prediction. TabNine™ published a blog post in July 2019 mentioning the use of GPT-2 in their code prediction but revealed no technical detail. Recently a team from Microsoft also published their ongoing efforts on applying Transformers for code autocomplete. We believe we are among the first to systematically evaluate Transformers for code prediction and compare them to previous models.

There are many other uses of deep learning techniques for code, beyond code prediction. These include techniques for code summarization , bug finding , repair and many others. An interesting aspect of this body of work is in the different ways in which they represent a program as an input to a neural architecture. These representations have ranged from linear token sequence (as for code prediction ), to paths in an AST , and sometimes even ways to convey static analysis information to the neural network .

IX Conclusion and Future Work

In this paper, we presented ways to using the Transformer for code prediction. We showed that the Transformer outperforms existing models for code prediction, and when supplied with code’s structural information, we are able to get even better predictive power. Attribution study show that our best model tends to focus on relevant code locations for prediction.

In the future, one avenue we wish to continue working on is handling out-of-vocabulary words better. Source code presents a difficulty shared with NLP in handling large vocabularies and rare words. The token/word to be predicted in test data may not appear in the training data. This is even more challenging when predicting identifiers, such as method names, variable names, as developers can come up with arbitrary identifier names. Possible mitigation includes copying mechanism and open-vocabulary models .

This paper focused on predicting the next token, as it is already a challenging task. In future, we also want to explore predicting multiple tokens at a time, i.e. autocompleting entire expressions.