Deep Learning for Source Code Modeling and Generation: Models, Applications and Challenges

Triet H. M. Le, Hao Chen, M. Ali Babar

Introduction

Deep Learning (DL) has recently emerged as an important branch of Machine Learning (ML) because of its incredible performance in Computer Vision and Natural Language Processing (NLP) (Goodfellow et al., 2016). In the field of NLP, it has been shown that DL models can greatly improve the performance of many classic NLP tasks such as semantic role labeling (He et al., 2017b), named entity recognition (Peters et al., 2017), machine translation (Vaswani et al., 2017), and question answering (Mikolov et al., 2015). Similarly, source code is a special type of structured natural language written by programmers (Hindle et al., 2012), which can be analyzed by DL models. Machine intelligence that understands and creates complex structure of software has a lot of applications in Software Engineering (SE). Specifically, Big Code is the research area that uses ML and DL for source code modelinghttp://science.dodlive.mil/2014/03/21/darpas-muse-mining-big-code/. To facilitate the building of ML models, software community offers valuable datasets such as online code corpora like GitHub, question answering forums like Stack Overflow as well as documentation of various software and programming tools that are highly structured and rich in content. There have been many applications of ML in SE thesaurus construction (Chen et al., 2017), language models for code (Dam et al., 2016) and information retrieval of answers, projects and documentation (Xia et al., 2015; Ye et al., 2016; Chen et al., 2016). Compared to other domains, DL for Big Code is still growing and requiring more research, tasks and well-established datasets.

Among the Big Code tasks, source code generation is an important field to predict explicit code or program structure from multimodal data sources such as incomplete code, programs in another programming language, natural language descriptions or execution examples (Allamanis et al., 2018a). Code generation tools can assist the development of automatic programming tools to improve programming productivity. Such models can also represent the temporal context and discrete logic of programs. However, traditional source code models were either inflexible (Bielik et al., 2016a; Raychev et al., 2016), time-consuming to design for specific languages/tasks (Gulwani, 2010; Gvero et al., 2013), or unable to capture long-term dependencies of code statements (Raychev et al., 2014a; Allamanis et al., 2015b). Listings 1 and 2 show that the difference of buggy and clean versions of the same code snippet depends on the relative location of the close(file) function with respect to the open(file) function, which is an example of long-term dependency in source code. DL models can help address the aforementioned issues because of their superior ability in extracting features from various data formats (e.g., natural language and symbol sequences) and capturing both syntactic and semantic information at various scales (He et al., 2017b).

There is an extensive review on ML for Big Code focusing on probabilistic models (Allamanis et al., 2018a). Compared to this previous work, here are our major differences:

Extensive coverage of state-of-the-art DL models and their extension to source code modeling and generation

A systematic mapping of Big Code tasks based on their inputs (encoder) and outputs (decoder) for applying DL models

List of datasets for source code modeling and generation tasks

Challenges and future directions for deep source code models

With such significant differences, our paper gives a more holistic view of applying DL for source code modeling and generation as compared to (Allamanis et al., 2018a). Recently, there has also been a review (Pouyanfar et al., 2018) on DL architectures and its applications in various domains such as computer vision, NLP, and social network analysis. Unlike the previous work, our paper focuses more on many practical applications in source code modeling and generation, and then illustrates how DL models and encoder-decoder framework can be used to address such problems. This work will be a useful guide for practitioners and researchers from both DL and SE fields when working with source code.

The remaining of this literature review is organized as follows. Section 2 presents existing language models for source code as well as their limitations, which motivates the use of DL models. Section 3 formulates source code modeling under encoder-decoder framework and describes the important components of such framework. Section 4 highlights the recent practices for building deep source code models. Section 5 reviews DL-based applications for various Big Code tasks. Section 6 presents the available datasets for such tasks. Section 7 discusses the current challenges and proposes some future directions in using DL for source code modeling and generation. Finally, section 8 summarizes the main contributions of this literature review.

Traditional source code modeling approaches and their limitations

We describe four traditional approaches to handle the syntactic and semantic facets of source code including (i) domain-specific language guided models, (ii) probabilistic grammars, (iii) simple probabilistic language models (i.e., $n$ -grams), and (iv) simple neural language models. Several serious issues still associate with these traditional approaches, which can be handled effectively using DL models. More details of each approach and its limitations are covered in this section.

Domain-specific languages (DSLs) are often used to define the parametrization rules and states for generating the structure of a program. DSL-based models create different grammar rules for common code statements (e.g., control flow, comments, and brackets). Compared to a general-purpose programming language, the grammar size of a DSL is usually smaller, which makes it more efficient for specific code generation tasks. DSL-based model has been studied by Gulwani et al. (Gulwani, 2010) and Jha et al. (Jha et al., 2010). Gvero et al. (Gvero et al., 2013) reduced the search space for Scala expression suggestion using a calculus of succinct types and designed a higher-order function for code completion. Since the generation process is human interpretable, this type of model could be a promising approach in SE.

In program induction, a DSL specifies the space of candidate programs (program template), while the example input-output pairs are known as specification. Under this scenario, the problem is known as Inductive Logic Programming (ILP) (Muggleton and De Raedt, 1994). Two classic families of solving ILP are the bottom-up approaches, constructing programs from example features, and the top-down approaches, testing examples from generations and adjusting the rules accordingly (Evans and Grefenstette, 2017). Given the precise and combinatorial nature of induction, the induction is commonly cast as a Constraint Satisfaction Problem (CSP) (Solar-Lezama, 2013). This belongs to the top-down family of ILP (Evans and Grefenstette, 2017). The formal definition of the problem can be found in the work of Pu et al. (Pu et al., 2017). Such a CSP problem can be solved by a constraint solver such as Z3 (De Moura and Bjørner, 2008). The problem with this approach is that the solver is always heuristic and its scalability is poor. Often these systems cannot handle noisy, erroneous, or ambiguous data.

DSL guided models can capture the structure of a specific programming language; however, they require deep domain knowledge to generate the detailed syntactic and semantic rules. One possible way to increase flexibility is to represent a DSL with a probabilistic model.

2. Probabilistic grammars

In formal languages, production rules can be used to define all possible generations of strings (code statements). Context-Free Grammars (CFGs) is a set of production rules that can be applied regardless of context, i.e., the left-hand side of a production rule only contains a single non-terminal symbol. CFG is a common way to define the structure of a programming language, which can then be used to convert its source code into Abstract Syntax Trees (ASTs) (Hopcroft, 2008). Probabilistic Context-Free Grammar (PCFG) is an extension to CFG, in which the production rules of a context-free language are associated with a probabilistic model. Bielik et al. (Bielik et al., 2016a) generalized the idea of PCFG to the task of code completion in other non-context-free languages like JavaScript. Later, Raychev et al. (Raychev et al., 2016) extended the previous work on PCFGs by learning a decision tree to build a probabilistic model using ASTs of a proposed DSL namely TGen. These works achieved strong results for some code completion tasks.

Another extension of CFG, namely Tree Substitution Grammar (TSG), uses tree fragments instead of a sequence of symbols for the production rules. Such rules of TSG are defined by a Tree Adjoining Grammar (Joshi and Rambow, 2003), which can create more flexibility and better represent complex linguistic structures (Cohn et al., 2010). To limit the model complexity and sparse grammar, researchers often use non-parametric Bayes to infer the distributions (Cohn et al., 2010; Allamanis and Sutton, 2014). These models are suitable for pattern mining since their automatic model selection ability allows the discovery of more complex structures. However, non-parametric Bayesian methods are often extremely slow to compute and hard to scale. Although probabilistic grammars achieve high performance for domain-specific languages, they still require manually-designed rules to model code locality and reuse.

3. n𝑛n-gram language models

Besides the two syntactic models above, one simple yet effective statistical language model is $n$ -gram. More specifically, this model assumes each token/word is conditionally dependent on the previous $n-1$ tokens/words as described in the following equation: $P(\mathbf{W})=\prod_{t}P(w_{t}|w_{t-(n-1),\dots,t-1})$ , where $P(w_{t}|w_{t-(n-1),\dots,t-1})$ can be simply computed by counting the occurrences of all $n$ -grams in the training set. A direct advantage of $n$ -grams over syntactic models (e.g., probabilistic grammars and DSL guided models) is that it is easier to generalize since the dependencies and rules of the programming languages are learned automatically from source code. Hindle et al. (Hindle et al., 2012) took the initiative to utilize $n$ -gram to build a language model for source code. Since then, besides code completion (Raychev et al., 2014a; Nguyen et al., 2013a), $n$ -gram models have also been applied to other tasks such as idiom mining (Allamanis and Sutton, 2013), syntax error detection (Campbell et al., 2014), source code analysis (Hsiao et al., 2014) and code obfuscation (Liu, 2016). However, simple language models like $n$ -gram cannot capture high-level programming paradigms.

To address this issue, a line of work has enhanced language models to be more adaptable to local information. Bielik et al. (Bielik et al., 2016b) augmented a DSL-based model with $n$ -gram and showed strong empirical results for programming language modeling. The resulting model was also more efficient in training and inference compared to neural models. Hellendoorn et al. (Hellendoorn and Devanbu, 2017) added a local $n$ -gram cache and merges the predictions of local and global models. Both models were claimed to surpass DL counterparts (e.g., RNN and LSTM) at the time of their publication, thus these two models would be good baselines for source code modeling.

It is noted that $n$ -gram is not good at modeling long-term dependencies (cf. Listings 1 and 2) between tokens. The $n$ -gram truncation discards long-term positional information. Other statistical models, such as Hidden Markov models, also fail to encode long term history (Sutskever et al., 2009), since its state space becomes exponentially large when encoding several previous history tokens into one state. Besides, another limitation with an $n$ -gram model is the sparsity of the vector representation of the word or code token, which is caused by the large vocabulary of source code. This sparsity issue can be solved using the distributed representation of neural language models.

4. Simple neural program models

Instead of incorporating the frequency of each previous token explicitly, neural network embedding models with one hidden layer first convert one-hot encoding of a word into an intermediate word-embedding vector with a much shorter length compared to the vocabulary size (e.g., 100 – 1000). This idea is also known as distributed word representation. In its original work (Schwenk and Gauvain, 2002), the word embeddings of up to $n$ - 1 previous tokens with respect to the current word were fed to a fully-connected neural network with one hidden layer. At the output layer, a softmax function was applied to calculate the probability of the next word. However, a serious drawback of this original model is the high computational cost of the hidden layer. Therefore, log-bilinear was presented (Mnih and Hinton, 2007) to address this challenge by replacing the non-linear activations with a context matrix to determine the context vector with respect to the current word. Then, the similarity between the context vector of the previous tokens and the current word was computed. Later, several methods were proposed to speed up the training and prediction time using hierarchical architecture (Mnih and Hinton, 2009). In this previous work, the log-bilinear model was demonstrated to outperform the traditional fully-connected neural network and $n$ -gram language models for the task of modeling APNews.

Later, the idea of neural network embedding model was adopted by researchers for source code modeling, which can be referred to as simple neural program models. More specifically, Maddison et al. (Maddison and Tarlow, 2014) combined log-bilinear with a tree depth-first search traversal technique (i.e., Log-bilinear Tree-Traversal models) to generate human-understandable source code. Allamanis et al. (Allamanis et al., 2015b) extended Maddison’s approach to retrieve source code snippets from natural language queries and vice versa. Allamanis et al. (Allamanis et al., 2015a) also used a log-bilinear model to recommend method and class names for object-oriented programming in Java and this model outperformed an $n$ -gram model in both of the tasks. Similar to n-gram models, the knowledge of log-bilinear models is limited to the previous $n$ - 1 tokens. Therefore, the above works needed to define the global and local context explicitly for log-bilinear models to capture the short-term, long-term dependencies and sequential property of source code. The list of contexts is still human-crafted and incomplete, thus limiting its applications in new domains. However, with their simplicity, simple neural program models are being used as (pre-trained) input features (cf. section 3.2) for various Big Code tasks.

5. Advantages of deep learning models over traditional approaches

Encoder-decoder framework (Sutskever et al., 2014) with DL models (cf. section 3) can be used to effectively capture the dependencies and sequential property of an input sequence. To be more specific, DL models are suitable for code modeling and generation since they are good at the following four important aspects: (i) automatic feature generation, (ii) capturing long-term dependencies as well as sequential property, (iii) end-to-end learning, and (iv) generalizability. Existing models have to trade-off among these four properties. For example, $n$ -gram models can automatically extract features from source code, but cannot capture long-term dependencies well due to the combinatorial explosion of terms. Similarly, simple neural program models (e.g., fully-connected or log-bilinear models) still require human-designed rules to capture the dependencies and structure of source code, which limits its end-to-end training and generalizability. In contrast, with deep domain knowledge incorporated, DSL guided models and probabilistic grammars can effectively capture the dependencies and sequential properties, but the strictly defined rules make the models harder to generalize and automate the learning process in new domains. Next, encoder-decoder framework using DL models for sequence/code modeling is presented.

Deep sequence modeling with encoder-decoder framework

In NLP, the main objective is to process a large amount of natural language mostly in the form of text or human voice. Code - a means of communication for developers - is similar to natural language (Hindle et al., 2012; Allamanis et al., 2018a), which has syntactic structures and semantic meaning. Specifically, there are various Big Code tasks (cf. section 5) whose input is a sequence (e.g., source code and/or natural language) and prediction target can be either a sequence or just a simple numeric/categorical value. Inspired by the field of NLP, such Big Code tasks can be formulated under an encoder-decoder framework. If both the input and output are sequences, then encoder-decoder framework can also be called seq2seq (Sutskever et al., 2014). The main steps of an encoder-decoder framework are illustrated in Figure 1.

The components for steps 1 and 3 are referred to as encoder and decoder, respectively, which are typically two Deep Learning (DL) models (cf. section 3.1). In the original paper (Sutskever et al., 2014), the final internal state of the encoder is used as the context vector. In more recent works, step 2 is often handled by attention mechanism (Bahdanau et al., 2014) (cf. section 3.1.3) or external memories (Weston et al., 2014) (cf. section 3.1.4) with a richer and position sensitive context sequence. In this section, three main elements of an encoder-decoder learning framework for sequence modeling including (i) DL models, (ii) input embeddings and (iii) stable training of such models are covered. It should be noted that both sequential models (i.e., modeling word-by-word) and structural models (i.e., exploiting syntactic structure of a sentence/code snippet) can be used for sequence modeling. Although section 3 mainly reviews different components of an encoder-decoder framework originally proposed for sequence modeling in NLP, this section is still important because of two reasons: (i) the presented deep models/techniques can be extended to source code, and (ii) only a small portion of such models has been utilized for source code modeling. Based on this section, section 4 subsequently presents the current practices of source code modeling and generation using encoder-decoder framework.

Two main classes of DL models for sequence modeling, namely (i) recurrent and (ii) non-recurrent neural networks are first presented. Then, three techniques to build more robust models including (i) attention mechanism, (ii) external memory and (iii) beam search are covered. It is noted that Multi-Layer Perceptron (MLP) (Rosenblatt, 1958) (a.k.a. fully-connected/feed-forward neural network) is not reviewed in this section since it is an extension of the neural network described in section 2.4 (i.e., with more hidden layers). Thus, MLP is still limited in capturing the dependencies and sequential property of a sequence, and not widely used for sequence modeling unless combined with more advanced techniques such as attention mechanism.

Recurrent Neural Network (RNN) and its variants have been widely used for building language models (Krause et al., 2017; Melis et al., 2017; Merity et al., 2017a). RNN is a special type of deep neural network, in which a block of parameters are shared and repeated many times across different parts of a sequence, resulting in a deep computational graph (Goodfellow et al., 2016). This architecture helps a network to learn with input/output of various lengths that MLPs cannot.

However, vanilla (plain) RNNs are hard to train (Pascanu et al., 2013b) and are not good at keeping past information from different time scales. Gated RNNs, such as Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) and Gated Recurrent Units (GRUs) (Chung et al., 2014), model the keeping and forgetting mechanisms explicitly with sigmoid activations, namely gates. An LSTM has three gates to control input, output and forgetting, respectively. In addition, there is a memory cell state to generate the hidden states.

RNN units can be made deep to encode more complex transitions (Pascanu et al., 2013a). Highway layers (Srivastava et al., 2015) were introduced to stabilize the training gradients in Recurrent Highway Networks (Zilly et al., 2016). To capture the long-term dependencies in time series and represent hierarchical information, RNNs layers can be stacked with different update frequencies (Koutnik et al., 2014). The gated feedback RNNs (Chung et al., 2015) allow the network to learn its own clock rates by using additional gates. The state-of-the-art RNN for language model is the Fast-Slow RNN (Mujika et al., 2017) that incorporates the strengths of both deep and multiscale RNNs.

1.2. Non-recurrent neural networks

Temporal convolution or one-dimensional convolution across time is another type of neural network that can capture long-term relations with hierarchical architecture (Waibel et al., 1988). It has been applied to sentiment analysis, sentence classification, machine translation and meta-learning (Mishra et al., 2017).

Convolutional Neural Networks (CNNs) have been used in several sentence modeling tasks. In 2013, Kalchbrenner and Blunsom (Kalchbrenner and Blunsom, 2013) used a CNN as the encoder and an RNN as the decoder for dialogue generation. One year later, Blunsom et al. proposed Dynamic CNN (Blunsom et al., 2014) for sentence semantic modeling, where variable length and relation discovery were enabled by max pooling. However, these earlier works failed to achieve similar performance of LSTMs.

Recently, non-recurrent structures have emerged again with similar performance as RNN, but they are faster to compute. Masked convolution layers are used as the decoder in a neural machine translation system (Kalchbrenner et al., 2016a). Gehring et al. (Gehring et al., 2017) proposed ConvS2S that brought skip connection (He et al., 2016) and attention (Bahdanau et al., 2014) (cf. section 3.1.3) to sentence modeling and achieved the state-of-the-art translation performance. Combining recurrent and convolutional units is also useful. He et al. (He et al., 2017a) strengthened the input-to-output correlation by adding cross-layer convolutions to stacked RNNs.

Vaswani et al. (Vaswani et al., 2017) proposed a multi-head attention model called Transformer relying on self-attention and positional encoding to compute the sequence representations. Transformer allows the decoder to attend to information arbitrarily far and reduced training time significantly without quality loss. Like CNN, the causal structure is held by masking later output for the autoregressive factorization. Recently, a deep language representation model, BERT (Devlin et al., 2018), utilizing a novel bidirectional training of a Transformer has achieved the state-of-the-art results in eleven NLP tasks such as sentence classification/tagging, question answering, and named entity recognition. Later, BERT was extended to language modeling (Dai et al., 2019) and language generation (Dong et al., 2019) by addressing its limitations of fixed-length contexts and bidirectional nature, respectively. It is noted that Transformers can be leveraged for contextual (same token with varying usages) embeddings of source code.

Speeding up the sequential generation is another interesting direction. It should be noted that all autoregressive models only generate samples sequentially since they use ancestral sampling. Thus, alternative architectures for rapid, parallel sample generation are required. Gu et al. (Gu et al., 2017a) enabled non-autoregressive learning by sampling a latent variable representing the fertilities, i.e. the usage of each source word in decoding, which required to be supervised by an external alignment system. Inverse-Autoregressive Flows (IAFs) (Kingma et al., 2016) could generate high-dimensional samples in parallel from latent variables. Oord et al. (Oord et al., 2017a) incorporated WaveNet with IAF to create parallel WaveNet, which sampled with higher fidelity, but ran 3000 times faster in audio generation.

1.3. Attention mechanism

One problem of the original encoder-decoder framework is that decoder can only access a single context vector. Human understands text lines by repeatedly attending to different parts of a sequence. To simulate this behavior, Bahdanau et al. (Bahdanau et al., 2014) used a sequence as the context and propose attention mechanism to adapt the weights of context associated with a certain output stage and impose an explicit alignment between input and output tokens. To be more specific, with an encoded sequence $F$ , and at each step $t$ , the hidden state $\mathbf{h}_{t}$ is computed using an RNN model with input source vector $\mathbf{c}_{t}$ generated by attention mechanism as additional input: $\mathbf{h}_{t}=\operatorname{RNN}(\mathbf{h}_{t-1},[\mathbf{e}_{t-1};\mathbf{c}_{t}]).$ This way of incorporating attention is known as early binding. Alternatively, attention can be considered just before generating the output token. A typical soft attention similar to Bahdanau et al.’s (Bahdanau et al., 2014) can be computed as follows:

With the previous hidden state $\mathbf{h}_{t-1}$ , the attention energy is computed with a content-based scoring function $\mathbf{u}_{t}=\mbox{score}(F,\mathbf{h}_{t-1})$ .

Exponentiate and normalize $u_{t}$ to 1: $\mathbf{a}_{t}=\mbox{softmax}(\mathbf{u}_{t})$ .

Compute the input source vector $\mathbf{c}_{t}=F\mathbf{a}_{t}$ .

The simplest way to define the score is dot-product, $\mbox{score}(F,\mathbf{h}_{t-1})=F^{T}\mathbf{h}_{t-1}$ . Or an expected input embedding $V$ can be defined so that $\mbox{score}(F,\mathbf{h}_{t-1})=F^{T}V\mathbf{h}_{t-1}$ . In the original paper (Bahdanau et al., 2014), the attention energy is computed with a Multi-Layer Perceptron (MLP) as follows:

Content-based attention energy is computed by scoring each element separately, which makes it hard to discriminate the elements with similar content from different locations. Location sensitive attention (Chorowski et al., 2015) broke through this limitation by generating attentions in an autoregressive fashion. However, unrolling through another RNN during backpropagation can greatly increase the computational time. Vaswani et al. (Vaswani et al., 2017) presented multi-head self-attention to represent the relevant context of each word in an input sequence at different locations. By combining sequence modeling and attention mechanism, the state-of-the-art results have been achieved for neural machine translation (Bahdanau et al., 2014; Wu et al., 2016; Johnson et al., 2016; Lample et al., 2017).

1.4. Memory-augmented neural networks

Attention is closely related to external memory. Together, they have become important building blocks for a neural network. External memories are used as internal states, which can be updated by attention mechanism for selective reading and updating. The classic Memory Network (MemNN) (Weston et al., 2014) tried to mimic a Random Access Memory (RAM) and use soft attention as a differentiable version of addressing.

A memory network usually takes the following inputs:

A query $q$ is the last utterance the speaker said in a general dialogue setting, or the question in a QA setting.

A memory vector $\mathbf{m}$ is the dialogue history of the model. The knowledge can be as large as the whole codebase or documentation given the model is sufficiently powerful.

And this type of network has the following modules to handle the inputs:

An encoder converts $q$ into a vector using an RNN (Cho et al., 2014) or a simpler word embedding (Mikolov et al., 2013a).

A memory module $M$ finds the best part of $\mathbf{m}$ related to $q$ . This is the addressing stage.

A controller module $C$ sends $q$ to $M$ and reads back the relevant memory, adds that to the current state. In practice, we always cycle this process to enable complicated reasoning.

A decoder generates output from the final states.

The training of MemNN is fully supervised, in which the label of the best part of memory is given at every stage of the memory addressing. Its follow-up End-to-End Memory Network (Sukhbaatar et al., 2015) uses soft attention for memory addressing to train the attention in backpropagation and relax the supervision to only the output. It is noted that using one memory unit to match both the query and the state limits the expressiveness. By splitting the memory into a key-value pair, Millor et al. (Miller et al., 2016) encoded prior knowledge and obtained better results. Recurrent Entity Network (Henaff et al., 2016) demonstrates how to enhance the memory unit by letting the agent learn to read and write the memory to track the facts. Weston (Weston, 2016) generalized this model to unsupervised learning, by adding a new stage to generate answers and predict replies. This kind of model is evaluated on various tasks such as basic reasoning on twenty bAbI tasks (Weston et al., 2015), reading children’s books (Hill et al., 2015), understanding real dialogues from movies (Dodge et al., 2015). All of the tasks can be found on the website of the bAbI projecthttps://research.fb.com/projects/babi/. Recently, many works have tried to make traditional memory paradigms differentiable so the models can be optimized with SGD. Such end-to-end trainable models have been used to handle algorithmic learning and reasoning tasks such as language understanding and program induction, which are covered in more detail in section 4.3.

1.5. Beam search

Searching for the best-decoded result with the highest probability is computationally intractable. In other words, there can be an exponentially large number of generated sentences in NLP or source code in Big Code. One solution would be to choose the word/token with the highest output probability after each time step during the decoding process. However, this greedy process will likely result in a sub-optimum result. Therefore, in machine translation, beam search is widely adopted as the heuristic search technique (Sutskever et al., 2014). Instead of taking the next word with the highest possibility directly, a list of previous most likely partial translations is kept and those selected words/tokens will be extended each translation at the current step and re-rank them. The length of the list at each time step is known as beam size. This method often improves the translation, but the performance is closely dependent on the beam size (Koehn and Knowles, 2017).

2. Input embeddings of deep learning models

In this section, the input representation for sequence modeling is presented. This is also important for code modeling since keyword representations can vary hugely within the scopes of functions, classes and projects. Such large vocabulary makes traditional representations such as one-hot encoding or $n$ -grams form a very sparse embedding vector. As a result, in recent deep language models, input words are often converted into real-valued vectors with distributed representations (Mikolov et al., 2013b). Words in the embedded space demonstrate nice emergent properties such as semantic relationship can be represented as vector arithmetic. Sharing the embedding weights with the softmax layer in the decoding process results in notable improvements for language models (Inan et al., 2016).

For NLP tasks, it is a common practice to use general-purpose pre-trained word vectors on large corpora such as word2vec (Mikolov et al., 2013b), GloVe (Pennington et al., 2014) and fastText (Bojanowski et al., 2016). The parameters can also be optimized for specific datasets and tasks together with the model. McCann et al. (McCann et al., 2017) performed the word vector training based on a machine translation task, which helped their word vectors (i.e., CoVe) to capture more complex contextual information. They showed that replacing traditional word vectors with CoVe could improve the performances of many tasks. The pre-trained word embeddings can also be fine-tuned to adapt to the Big Code task of interest. However, in the domain of source code modeling, the vocabulary size is much larger than that of NLP. As a result, re-training the word embeddings using initialization of the pre-trained parameters is also worth considering. Some treatments of source code representation are presented in more detail in Section 4.1.

In Big Code applications, sub-word level inference is also substantially important. For example, code completion algorithms should be able to suggest incomplete function or variable names by seeing only a part of the word. To incorporate character-level information, the characters can be combined into word representation. Ling et al. (Ling et al., 2015) proposed C2W model, which uses Bidirectional LSTM to construct word embedding from character sequences. CNN can also be used for character-level embeddings (Zhang et al., 2015).

3. Stable training of deep learning models

Recurrent models (e.g., RNN) for sequence modeling are hard to train (Pascanu et al., 2013b), and similar to other types of neural network, easily prone to overfitting. Like other DL architectures, RNN and its variants are usually trained using a special form of backpropagation, namely backpropagation through time. There are several techniques for optimizing the loss function of a model. Among them, Stochastic Gradient Descent (SGD) or mini-batch gradient descent is mostly adopted due to its efficient computation and parallelization on Graphics Processing Units (Goodfellow et al., 2016). Merity et al. (Merity et al., 2017a) applied a non-monotonically triggered Averaged SGD (Mandt et al., 2017) for language modeling and achieved superior results. It is noted that most of the recent papers on DL report their models trained by modern versions of SGD namely RMSProp (Tieleman and Hinton, 2012) or Adam (Kingma and Ba, 2014).

Model regularization also has a significant impact on the generalization performance. We review four common types of regularization for training DL models more effectively including: (i) dropout, (ii) normalization, (iii) activation regularization and (iv) structural regularization.

Dropout (Srivastava et al., 2014) randomly turns off several positions of the activation following a Bernoulli distribution. However, applying it to the RNN hidden states disrupts the model’s ability to preserve long-term dependencies (Zaremba et al., 2014). Two ways of retaining the information have been adopted. The first way is to limit the dropout rate of the hidden states by keeping previous information. Zoneout (Krueger et al., 2016) randomly copies previous values of the activations rather than zeroing them out. Semeniuta et al. (Semeniuta et al., 2016) applied dropout on the input gate to prevent memory loss. The second one is locked dropout, i.e., using the same dropout mask for a full forward pass. This method preserves the activation norms instead of gradually dropping information. Gal et al. (Gal and Ghahramani, 2016) linked locked dropout with variational Bayes inference and used it for embedding dropout. In addition, locked DropConnect (Wan et al., 2013) on the hidden weights resulted in substantial improvements (Merity et al., 2017a).

Normalization restricts the activations of different time steps to follow a stable distribution. Inspired by batch normalization (Ioffe and Szegedy, 2015), multiple normalization techniques customized for recurrent structures have been studied, such as recurrent batch normalization (Cooijmans et al., 2016), weight normalization (Salimans and Kingma, 2016) and layer normalization (Ba et al., 2016b). There are also normalization techniques targeting the gradient stability. For example, spectral normalization (Miyato et al., [n. d.]) was designed to keep the gradient bounded by constraining the activations to be Lipschitz continuous.

Regularization can be applied to weights and activations of DL models. $L_{2}$ regularization on the weights is referred to as weight decay. Activation regularization penalizes activations with $\alpha L_{2}(m\cdot h_{t})$ , where $\alpha$ is a regularization term, $m$ is a scaling factor and $h_{t}$ is the hidden state, respectively. Temporal activation regularization (Merity et al., 2017b) penalizes the large changes in the hidden state of a neural model with $\beta L_{2}(h_{t}-h_{t-1})$ , where $\beta$ is a scaling factor, $h_{t}$ and $h_{t-1}$ are the hidden states at time $t$ and $t-1$ , respectively.

Structural regularization prevents exploding or vanishing gradients by restricting the model structure. Model structure restriction can be done by forcing the recurrent matrix to be unitary (Arjovsky et al., 2016) or using element-wise interactions. Strongly-typed RNNs (Balduzzi and Ghifary, 2016) use type-consistent operations for the recurrent units. Other simplifications are Quasi-RNN (Bradbury et al., 2016) and Simple Recurrent Unit (Lei and Zhang, 2017).

These regularization techniques are important to reduce the overfitting and improve the generalization performance of deep source code models, since the learned models can become overly complicated due to the need to represent various types of rules.

Recent practices of building deep learning models for source code modeling and generation

This section focuses on the practices of developing Deep Learning (DL) models for source code under the encoder-decoder framework presented in section 3. Specifically, we present the techniques for the following: (i) deep encoder models, (ii) deep decoder models, and (iii) deep controller models to better generalize the capabilities of DL models to source code modeling and generation.

For many deep source code tasks, the input takes the form of a sequence, such as code snippets, comments or descriptions, where we rely on a deep module to capture the semantic and context of input for further processing. We call this kind of modules deep encoder models. The most widely used encoder models are sequential models such as RNN and its variants. For general-purpose software repository mining, the effectiveness of RNNs has been tested (White et al., 2015). However, sequential models may not be as effective for code modeling and generation due to the following limitations:

Syntactic context is not well represented in a sequential model, which may lead to a violation of the grammar rules of a programming language.

Large vocabulary of code leading to the out-of-vocabulary issue affects the generalizability of a deep code model.

Recurrent models (e.g., RNNs) suffer from the hidden-state bottleneck, in which the size of hidden-state vector limits the information a model can carry through time.

Many methods have been proposed to overcome these shortcomings. These are (i) structural (tree-/graph-based) representation, (ii) open vocabulary model and (iii) attention mechanism.

Structural representation. Abstract Syntax Tree (AST) is a natural way to capture the syntactic structure of a program. In an AST, a program is parsed into a hierarchy of non-terminal and terminal (leaf) nodes based on the syntax of a programming language. To utilize AST for code representation, the simplest way would be to use depth-first search to convert an AST into a sequence (Dam et al., 2016; Amodio et al., 2017; Li et al., 2017b). Other studies proposed DL models (Recursive Neural Networks (White et al., 2016), Tree-LSTM (Wei and Li, 2017) or CNN (Mou et al., 2016)) to work directly on the hierarchical structure of a parse tree. Recently, Zhang et al. (Zhang et al., 2019) have shown that splitting an AST into code-statement subtrees can improve the performance of tree-based representations. Recent works (i.e., code2vec (Alon et al., 2019) and code2seq (Alon et al., 2018)) have also proposed to use AST paths as a representation for code, in which the extracted paths would be aggregated using an attention-based deep neural network. Recently, Allamanis et al. (Allamanis et al., 2018b) have presented novel Gated Graph Neural Networks (Li et al., 2015) to represent source code as a directed graph. Specifically, they incorporated data/control-flow information of variables into AST to capture the syntactic and semantic structures of source code more effectively. It is observed that structural representation of source code has witnessed a growing interest from the Big Code community. There is also a comprehensive review (Chen and Monperrus, 2019) on source code embeddings.

Open vocabulary model. The vocabulary of source code is open, rather than fixed. Thus, it is impractical to train a classifier using the whole vocabulary, so it is more common to truncate it by keeping only the most frequent 1,000 or 10,000 terms and replacing the others with an Out-of-Vocabulary (OoV) token (i.e., ). The drawback of this truncation is that the OoV tokens (neologisms (Allamanis et al., 2015a)) from a testing set cannot be predicted. To solve the OoV problem, Karampatsis et al. (Karampatsis and Sutton, 2019) have proposed a novel open-vocabulary neural language model for source code modeling. This work used GRU to build a neural language model on top of sub-word (character subsequences of code tokens) units generated by the Byte pair encoding algorithm (Gage, 1994). The large-scale experiments showed that the proposed sub-word neural program model was better than the state-of-the-art n-gram model and also more robust against the OoV problem across different programming languages and projects. Character-based DL models (Karpathy, 2016; Kim et al., 2016) are also an alternative to subword for addressing the OoV problem. Recently, Cvitkovic et al. (Cvitkovic et al., 2018) extended the graph-based code representation (Allamanis et al., 2018b) to incorporate open vocabulary using a Graph-Structured Cache, in which novel words/tokens would be added as cached (Grave et al., 2017) nodes into an existing AST.

Attention mechanism. Attention mechanism can be used for addressing both the OoV issue and the hidden-state bottleneck of RNNs. Bhoopchand et al. (Bhoopchand et al., 2016) employed a pointer network to copy OoV tokens from the recent past during code completion. The pointer network is a soft attention over previous input embeddings. A controller produces a scalar to decide whether to select from the copying position or language model distribution. Li et al. (Li et al., 2017b) recently proposed a similar model but with attention over the previous hidden states. Similar to attention layers of decoder networks, the attention output is concatenated into the input. The attention used in these models is computed with a Multi-Layer Perceptron (Bahdanau et al., 2014). The drawback of this approach is that the modification of hidden-state cache and computation of attention are very computationally intensive. We have also observed that adding token copying can lead to a precision drop of token predictions comparing to no pointer network. Splitting the hidden states into separate parts for pointer network and context encoding solves this problem. Attention mechanism is also used by some non-recurrent models, e.g., Das and Shah (Das and Shah, 2015) employed a gated unit over the word embeddings for a feed-forward neural network. This model was used to represent some commonly occurring value types such as iterator variable names for contextual code completion.

2. Deep decoder models

Given the embedded features produced by an encoder model, a decoder model generates output for a target domain (e.g., code and natural language). Unlike natural languages, source code must adhere to the target syntax of a programming language. Xu et al. (Xu et al., 2017) used a sketch for SQL generation template and trained neural networks to copy certain parts to these slots with a column attention mechanism. Furthermore, the decoding template can be enriched by introducing a DSL. By training a model with direct syntactic information such as ASTs, much better results can be obtained for code generation. Dong and Lapata (Dong and Lapata, 2016) proposed the Seq2Tree model for transferring language into logical forms. A tree structure is decoded by predicting a ”nonterminal” token as a root of a subtree. Parent information is fed by incorporating attention mechanism with all previous states. This model makes it easier to keep structural information, but this simple modification only supports limited grammar rules and has no guarantee of syntactic correctness.

Yin et al. (Yin and Neubig, 2017) designed and implemented a syntax-driven neural model to transform natural language statements into corresponding Python ASTs. This model got 10+ BLEU and decent accuracy on several code generation tasks. Instead of directly generating source code $c$ , a probabilistic grammar model $g$ representing the distribution of an AST $y$ given input natural language description $x$ is defined. So that the syntax structure is automatically captured. The task is to select the best possible AST $\hat{y}=\underset{y}{\arg\max}g(y|x).$ Then given the AST, code can be inferred in a definite process.

A finite set of production rules $r\in R$ is the main component of a formal grammar specification. Using a depth-first traversal from left to right, the AST for these rules can be generated with a sequence of two types of actions $a$ on the current AST $y$ :

Now we can map the AST generation to a traditional seq2seq learning task, given an input sequence $\mathbf{x}$ , a sequence of action $\mathbf{a}$ can be generated as follows, $p(\mathbf{a}|\mathbf{x})=\prod_{t=1}^{T}p(a_{t}|\mathbf{x},a_{<t})$ , where $a_{t}$ is the action taken at time $t$ .

In fact, generating highly accurate general-purpose programs is a very challenging task. The performance of program generation relies heavily on the development of the code generative model, which needs to satisfy the following three major requirements:

Input representation of a program should be able to handle OoV tokens, which usually requires learning embedding vector from both word-level and character-level tokens. For example, it is helpful to use compositional word representation such as C2W (Ling et al., 2015) and copying mechanism (Vinyals et al., 2015) to represent newly appeared OoV names in code. Syntactic information is also important as mentioned in section 4.1. To reconstruct grammatically correct programs from generation, encoders and decoders often accept and yield AST tokens.

Addressing mechanisms (e.g., attention (Bahdanau et al., 2014) and memory (Graves et al., 2014)) are important in guiding the decoding process and also used to copy from cache and recent history. However, content-based attention (Bahdanau et al., 2014) and its extension, writable memory (Graves et al., 2014), are both not very suitable for implementing this copying mechanism. Another structure to keep recent history is associative memory (Ba et al., 2016a). The memory is kept in an associative matrix and updated at every time step. Some variants of associative memory are also incorporated with language models (Dangovski et al., 2017).

Discrete structure learning is the key to representing function logic and control flows of a program. It is important for neural program models to learn a discrete structure and make deterministic decisions based on such structure.

We further discuss the solutions and challenges in designing these components in Section 7.

3. Deep controller models

Theoretically, a neural network itself is capable of learning a program (Csáji, 2001; Siegelmann and Sontag, 1992). Therefore, to solve more complex problems, deep neural network can be used as a controller to learn the next instruction/operation to execute directly from the input-output examples without a DSL. This class of models is also referred to as neural abstract machines. Grefenstette et al. (2015) (Grefenstette et al., 2015) connected an LSTM with soft-differentiable stack-, queue- and deque-based memories to generalize the ability of RNN in machine translation. Neural Turing Machine (NTM) (Graves et al., 2014) and Differentiable Neural Computer (DNC) (Graves et al., 2016) use a recurrent model (RNN or LSTM) to read and write to an external memory matrix in a dynamic manner to simulate the execution of Turing computers. NTM uses a content-based soft attention to control reading and writing and enables access to every memory cell and gradient flow from these units. DNC defines a differentiable free list to track the usage of each memory location to address temporal orders of memory. Yang (Yang, 2016) generalized Turing machines to the continuous setting by storing memory on manifolds and controls memory addressing using Lie group actions that are differentiable. Besides NTM and DNC, there are other neural abstract machines:

Neural Programmer (Neelakantan et al., 2015)

Neural Programmer-Interpreter (Reed and De Freitas, 2015)

Neural Program Lattices (Li et al., 2017a)

Neural abstract machine is a flexible and powerful class of DL models that is capable of inferring implicit graph structure, which can be used to simulate modern program executions. Instead of constructing an explicit program representation, they learn the model weights to describe the latent procedure and then simulate the dataflow of program execution. Although neural abstract machines behave similarly to programs and they are also flexible enough to emulate program execution, there is not yet a clear way of extracting program representations directly from these models. Besides program learning, this type of model has also been used in other domains such as question answering (Graves et al., 2016; Rae et al., 2016), finding the shortest path (Graves et al., 2016), navigation in 3D simulation (Parisotto and Salakhutdinov, 2017), querying a database for entity relations (Graves et al., 2016), and one-shot recognition (Rae et al., 2016).

Deep Learning for Big Code Applications

To review Deep Learning (DL) for Big Code applications, we categorize the input and output formats of different tasks into source code analysis and program generation as shown in Figure 2. For source code analysis, the input is source code and the output can take on many forms such as natural language, code fragments/pattern or a whole program. For program generation, all tasks have the same output in the format of source code, but different inputs (e.g., code, natural language or input/output examples). The inputs of both source code analysis and program generation in our taxonomy are sequences, which can be modeled by DL models under encoder-decoder framework as introduced in sections 3 and 4. In addition, this taxonomy covers most of input and output types so any new Big Code tasks can be categorized easily. Then, suitable DL models can be selected accordingly to solve such tasks. DL for Big Code is growing very fast; thus, it is nearly impossible to include every single work for each presented application. Also, direct one-to-one performance comparison between the reviewed models of each application may not be possible due to different datasets used and/or lack of reported results in the original works. However, we still believe that this review is a good starting point for practitioners and researchers working in this emerging area.

Source code analysis tasks take source code as context and generate outputs in another format. Source code analysis utilizes the distribution learned from large code corpus and performs various kinds of predictions. Firstly, the case when the outputs are code patterns/elements is presented.

Idiom mining extracts the code segments that reappear across projects. Most common code idioms often describe important programming concepts and can be reused across projects (Allamanis and Sutton, 2013, 2014). A related task to idiom mining is predicting class/method names based on their context/bodies (Allamanis et al., 2015a), which can be generalized to source code classification. One of the first DL work on programming language processing was proposed by Mou et al. (Mou et al., 2016). This study proposed a tree-based CNN (TBCNN) with dynamic pooling learned directly from an AST of a program. The authors demonstrated that their learned feature vectors of program tokens with TBCNN could be grouped together in terms of functionality. Such representations were also demonstrated to be more effective than n-gram (Bag-of-words) methods for identifying programming tasks and detecting bubble sort patterns. Later, several studies utilizing different structural information of code (AST paths (Alon et al., 2019) or data-/control-flow information (Allamanis et al., 2018b)) achieved strong performance for these tasks as well.

Application Programming Interface (API) mining and Code clone detection witness many uses of DL models. More particularly, DeepAPI (Gu et al., 2016) was devised to learn a distributed representation using a deep RNN seq2seq model for both user queries and associated APIs. This work was found to perform better than the bag-of-word approach for API generation task. After that, many DL works modeling ASTs of source code (e.g., RtNN (White et al., 2016), Tree-LSTM (Wei and Li, 2017), ASTNN (Zhang et al., 2019)) also obtained high detection rates for code clones.

Code-convention mining finds the code practices recommended by a specific programming language but not enforced by compilers. These coding conventions (e.g., indentation, naming conventions, white space) help improve the readability and maintainability of source code. A pioneering work using machine learning in this direction was proposed by Allamanis et al. (Allamanis et al., 2014) in 2014 using $n$ -gram features combined with Support Vector Machine model. Nevertheless, after that, no other work has been reported to directly use DL for this code-convention mining. However, it is observed that recent DL models that effectively capture the usage contexts of source code (e.g., (Allamanis et al., 2018b; Cvitkovic et al., 2018)) can be investigated for this task.

Information extraction aims to identify the existence of some code snippets, program or software-related artifacts from natural language (Cerulo et al., 2013; Sharma et al., 2015), images or videos (Yadid and Yahav, 2016; Ponzanelli et al., 2016). With recent advances of DL models (e.g., CNN) in computer vision, more works (Ott et al., 2018b, a) are emerging in this direction to detect code/software elements from image and videos. Parallel to CNN, RNN has also been utilized to separate code fragments from natural language on Stack Overflow forum (Yin et al., 2018).

Program verification predicts whether there exist bugs or security issues in a program. Bug localization is closely related to program verification, except that besides code, the locations of specific types of bugs are also identified. Besides formal methods (Madsen et al., 2017) and traditional ML approaches (Chakkrit, 2016; Ray et al., 2016; Le et al., 2015; Wang et al., 2014), DL has been shown to perform quite well for this task. DeepBugs (Pradel and Sen, 2018) represented more than 150,000 JavaScript source code files using word2vec and then trained a feed-forward neural network to distinguish between buggy and non-buggy code. This approach was found to achieve an accuracy of more than 90% and have the potential to perform in real-time. Another study (Xiao et al., 2019) using a word-embedding CNN architecture for both bug reports and source code files outperformed the existing DL works (Xiao et al., 2017; Lam et al., 2017) for bug localization. Additionally, VulDeePecker (Li et al., 2018) utilized a bidirectional LSTM (bi-LSTM) model with code gadgets as input to identify multiple types of vulnerabilities in source code. There is also a comprehensive review (Ghaffarian and Shahriari, 2017) on vulnerability analysis and discovery from source code using ML and DL techniques.

The reviewed DL models for this case have either used sequential and/or structural (i.e., ASTs) code representation. Most of the DL models were better the non-DL ones, while the structural models seemed to have stronger performance than the sequential counterparts.

In other scenarios of source code analysis, the outputs can be natural language, which leads to the following tasks:

Documentation is an important artifact embedded in a software program to help annotate the requirements, operation, and uses of source code as well as make software maintenance easier. Inspired by Neural Machine Translation (Bahdanau et al., 2014), Barone et al. (Barone and Sennrich, 2017) utilized the same model to create the baseline for the task of documentation generation. Later, Hu et al. (Hu et al., 2018a) proposed an LSTM model with attention, DeepCom, to automatically generate the documentation for Java code. Recently, a novel model, code2seq (Alon et al., 2018), with an attentional decoder to select optimal compositional paths of an AST has been proposed to represent code snippets. Code2seq has achieved better performance than DeepCom for code documentation.

Summarization is a sub-task of documentation, in which the main functionality of source code or a function is briefly described. In 2015, Rush et al. (Rush et al., 2015) proposed SUM-NN (i.e., a neural attention model) for code summarization, and outperformed existing information retrieval method (i.e., minimizing cosine distance between code and the corresponding summary) and phrase-based systems (Koehn et al., 2007). However, this model tended to generate short descriptions. To overcome this issue, Iyer et al. (Iyer et al., 2016) introduced an improved neural attention model namely CODE-NN by replacing feed-forward neural network with LSTM for the decoder in a seq2seq model. Chen et al. (Chen and Zhou, 2018) presented a bimodal using two Variational AutoEncoders (i.e., one for natural language and one for source code) to support code retrieval and summarization for C# and SQL languages. To leverage the knowledge of API for summarizing source code, Hu et al. (Hu et al., 2018b) combined the representation of API sequences learned from related API summarization tasks with code sequences in a seq2seq model with attention. Like code documentation, the code2seq structural DL model (Alon et al., 2018) has recently outperformed other existing works for code summarization task.

Code review is an important step in software development since it helps identify bad coding practices that may lead to software defects. However, this step is mostly carried out manually, which is time-consuming and prone to error. To automate this process, DeepCodeReviewer (Gupta and Sundaresan, 2018) aimed to find relevant reviews for the current code snippets/program. Specifically, four separate LSTM models with word2vec embeddings were trained for different parts of source code and reviews. Then, the results were combined using a feed-forward neural network to determine the relevancy of the current review. We found that there is still not much DL work in this area, which can stimulate future efforts to improve the performance of deep code-review models.

We see many modern DL algorithms from neural language processing applied to these tasks, which demonstrates the similarity between these two fields. Some recent techniques such as structural embeddings and attention mechanisms (cf. section 4.1) have also been integrated into DL models, which enables the models to learn from massive previous knowledge and to be more expressive to capture the underlying structure of the given data. The case when both input and output are code is discussed in the next section along with other program generation tasks.

2. Program generation

Program generation tasks require the inference about the program code or code structure. We classify program generation applications into three categories based on their inputs: (i) unconditional program generation, (ii) program transduction and (iii) multimodal program generation.

In unconditional program generation, the input consists of only code corpus. The goal is to generate the next most likely tokens or draw samples similar to the current input. Code completion/suggestion is a typical task in this category.

Code completion is a useful and common feature that many code editors offer. Although most code completion tools incorporated into the Integrated Development Environments (IDEs) are mostly rule-basedhttps://www.jetbrains.com/help/idea/auto-completing-code.html, learning-based code completion has been an active field of study. Completion algorithms have been developed for different languages with high accuracy and flexibility (Gvero et al., 2013; Raychev et al., 2014b; Bielik et al., 2016a). These algorithms help enhance IDEs with better code suggestion, API calls or components based on the current context. Program generative models is an extreme case for completion where no input is given and tokens are sampled from the first one to the end as a Markov chain (Maddison and Tarlow, 2014; Nguyen et al., 2013b; Nguyen and Nguyen, 2015). Table 1 presents several representative works for neural code completion. The deep code completion models have utilized the recent practices presented in section 4.1 including (i) structural information (e.g., AST (Liu et al., 2016c; Li et al., 2017b)), (ii) open vocabulary (e.g., character-level features (Karpathy, 2016)), and (iii) attention mechanism (e.g., pointer network (Bhoopchand et al., 2016)). It is noted that these models can also be used as deep encoder models (cf. section 4.1) for other program generation tasks.

In program transduction, the goal is to convert source code into another form of code. The following applications fall into this category.

Code migration helps developers to port projects in one language to another (Aggarwal et al., 2015). It is a common case to upgrade source code for a higher version of language or framework. Automatic transducers to update APIs and code structure are very useful for development and deployment. One such API migration from Java to C# was carried out using a seq2seq model namely DeepAM (Gu et al., 2017b). A similar motivation was proposed by Nguyen et al. (Nguyen et al., 2017) to migrate APIs from Java to C#, yet still preserving the semantic information by learning the word2vec embeddings of pair-wise APIs.

Code fixing or Code repair is the next step after program verification and bug localization, in which bugs need to be fixed. In 2016, sk_p model using LSTM-based seq2seq framework was proposed to correct seven Python assignments in Massive open online courses, namely MITx, and this model achieved an average 29% of accuracy for error correcting. SynFix (Bhatia and Singh, 2016) and DeepFix (Gupta et al., 2017) with LSTM and GRUs, respectively, were trained on student assignments to fix syntax errors, which obtained roughly a complete fix rate of 30%. Another work (Santos et al., 2018) trained a combined model of LSTM and $n$ -gram on a large Java corpus from GitHub, which interestingly reported that 10-gram model slightly outperformed the LSTM counterpart for fixing syntax errors. This finding suggests that a more sophisticated (e.g., incorporating syntactic code structure) and better fine-tuned DL model can be utilized to further improve the result. There is also an extensive review on this topic (Monperrus, 2018).

Code obfuscation prevents unauthorized people from analyzing and stealing source code, and thus protects the intellectual properties. There are some off-the-shelf obfuscated code generators such as Allatorihttp://www.allatori.com/ and ProGuardhttps://www.guardsquare.com/en/products/proguard. Statistical ML language model was also used for this task (Liu, 2016), but DL models have not been much explored, except for some simple cases (Ma et al., 2014). However, this task is a good fit for encoder-decoder framework since the input is code sequence and the output is obfuscated text sequence of such input code. A different but related task, identifying obfuscated code, has witnessed more applications of DL models (Skolka et al., 2019; Wang et al., 2016). The automatically generated features of such works can give some insights into designing effective (DL) methods for code obfuscation.

Code deobfuscation is opposite to obfuscation, in which deobfuscation recovers the original version of source code from the obfuscated one. There are deobfuscation tools for JavaScript (Raychev et al., 2015; Vasilescu et al., 2017; Aebersold et al., 2016) and Android appshttp://apk-deguard.com/. A DL-based approach (Bavishi et al., 2018) namely Context2Name was proposed to recover natural variable names for minified code. It first used a deep sequential auto-encoder to extract embeddings for identifiers from their usage contexts. Such embeddings were then fed into an RNN to infer natural variable names. This approach was demonstrated to outperform the state-of-the-art Javascript deobfuscation tools like JSNicehttp://www.jsnice.org/ and JSNaughtyhttp://jsnaughty.org/. Further research on DL models for code deobfuscation can be done since DL for image deblurring, demosaicing and inpainting is being actively investigated (McPherson et al., 2016).

As mentioned in section 4.1, to get more accurate generation of programs, code completion usually takes AST node sequences as input. Attention and copying mechanism also improve the performance of code generative models. However, high-quality code transduction is hard to achieve by sentence-by-sentence transferring, since the differences in programming language properties and designs may require code structure to be changed. Back-translation and generative models can be helpful in these cases, which is similar to image transferring (Zhu et al., 2017).

A more challenging category is multimodal program generation, where the input type is not restricted. Natural language (e.g., documentation and comments), GUI screenshots (Beltramelli, 2017) and speech can all be used. Related applications are listed as follows.

Code search returns the best matching snippets of existing source code based on natural language queries. A log-bilinear neural language model was used to enable retrieving source code with natural language and vice versa (Allamanis et al., 2015b). As explained in Section 2.4, log-bilinear model alone, however, is unable to capture long code dependencies, which is essential for code search, especially for long code segments. Later, a deep recurrent model, CODEnn (i.e., bi-LSTM units combined with max-pooling layer), was proposed by Gu et al. (Gu et al., 2018). This model was shown to outperform the previous state-of-the-art results of CodeHow (Lv et al., 2015) (an extended boolean model) for code search.

Program synthesis extends code completion by generating code based on many forms of information such as natural language, images, and speech. Previously, applications were limited to DSL search (Allamanis et al., 2015b). With DL, general-purpose program synthesis for various tasks can be now tackled as shown in Table 2. As mentioned in section 4.2, to generate syntactically correct programs, many studies (e.g., (Parisotto et al., 2016; Dong and Lapata, 2016; Rabinovich et al., 2017; Yin and Neubig, 2017)) have proposed to utilize AST-based instead of sequential decoder. Such decoders predicted AST nodes sequentially using RNNs (e.g., LSTM), which could be computationally expensive. Later, Sun et al. (Sun et al., 2019) proposed a grammar-based structural CNN with attention mechanisms to replace RNN in the decoder for generating code from natural language description. The structural CNN-based decoder generated a grammar rule (sequence of tokens) per step instead of token-by-token, which makes the decoding process more compact and efficient. This approach was also shown to achieve the state-of-the-art results for Python code generation using the HeartStone benchmark dataset (cf. section 6.1). Recently, Brockschmidt et al. (Brockschmidt et al., 2018) extended the graph-based code representation (Allamanis et al., 2018b) to code generation by augmenting the AST with inherited and synthesized nodes to capture the attribute grammars (Knuth, 1968) during the decoding process. The Graph2Graph model of this work generated more accurate C# code samples compared to existing methods. With CNNs, graphics programs can be inferred from their output drawings. Ellis et al. (Ellis et al., 2017) trained a hierarchical neural net to convert simple hand drawings into a DSL, which is then converted into LaTeX code with a bias-optimal search algorithm (Schmidhuber, 2004). Beltramelli et al. (Beltramelli, 2017) used an encoder-decoder architecture to generate front-end code from GUI screenshots.

Program induction seeks to fit a given pair of input/output examples to mimic program execution, in which the execution correctness is more important than the readability of source code. Because of the complexity of the problem space, targeted programs are often limited to certain forms of DSL or only some simple problems. Balog et al. (Balog et al., 2016) used a seq2seq model, namely DeepCoder, to predict the probable DSL functions required to map the given inputs to outputs, which reduced the search space and made the program induction process 10x faster than the corresponding search-based counterparts. Another class of models is differentiable interpreters, in which pre-defined grammars of a DSL are continuously parameterized with neural networks. Such parametrization allows a model to search the program for a given input-output pair more efficiently. Evans and Grefenstette (Evans and Grefenstette, 2017) made inductive logic programming differentiable by using a neural network to perform inference given generated clauses, their weights and valuation of the axions. The model is end-to-end trainable and generalizable well on small datasets. However, this method is very memory-intensive. Riedel et al. (Riedel et al., 2016) proposed a differentiable interpreter for Forth programming language to use input-output examples to create a complete program. Differentiable approaches often perform worse than the search-based methods for low-level programming languages (e.g., Assembly) (Gaunt et al., 2016). Also, differentiable interpreters are still limited to solving only simple problems (e.g., accessing, sorting or copying array elements) (Feser et al., 2016). Neural abstract machines (cf. section 4.3) are also suitable for program induction. For example, by learning directly from recursions, Cai et al. (Cai et al., 2017) extended the ability of Neural Programmer-Interpreter (Reed and De Freitas, 2015) and provided a generalization proof about the overall system behavior. There is also a recent review (Neel, 2018) on program induction.

We observe that most recent works claiming the state-of-the-art results on various Big Code generation tasks have used some variants of RNN (e.g., LSTM/bi-LSTM/GRU with attention mechanism). Also, embeddings extracted from code tokens and ASTs are common choices for the input of these DL models. However, the proposed DL models have only been investigated for a few applications. There is still lack of extensive ablation studies to test the generalizability of these models for many different Big Code tasks. Next, in section 6, various datasets are presented to facilitate the building of deep source code models.

Datasets for Big Code Applications

There is a great need of large datasets for Deep Learning (DL) in general and deep code models in particular to exhibit their power (Goodfellow et al., 2016). In this section, two types of corpora are presented for source code analysis and program generation (cf. Figure 2). This section is mainly devoted to highlight the potential, but not yet established development of datasets for Big Code tasks.

There is a growing curated list for various code analysis tasks and datasetshttp://learnbigcode.github.io/. Currently, there are many unlabeled and large-scale open-source code corpora such as SourceForgehttps://sourceforge.net/ and GitHub. Such datasets are widely used for building a probabilistic model on source code. Some large code corpora and their usefulness for various tasks including code completion and code pattern mining are presented hereafter. These datasets can be utilized for other source code analysis tasks as well.

Karpathy et al.http://karpathy.github.io/2015/05/21/rnn-effectiveness/ used the Linux Kernel repository on GitHubhttps://github.com/torvalds/linux to demonstrate the expressive ability of RNNs. Later, this dataset has also been used to evaluate code completion tasks.

Java projects on SourceForge and GitHub are utilized for various code mining tasks (Nguyen and Nguyen, 2015).

Another large Java corpus containing all forked projects crawled from GitHub with more than one billion code tokens was also used to design new code complexity metrics and enhance the performance of code suggestion tasks (Allamanis and Sutton, 2013).

Instead of working directly on source code, code can be first converted to ASTs, upon which a code language model is built. More particularly, Bielik et al. (Bielik et al., 2016a) and Raychev et al. (Raychev et al., 2016) have used 100,000+ JavaScripthttp://www.srl.inf.ethz.ch/js150.php and Pythonhttp://www.srl.inf.ethz.ch/py150 programs, respectively, to build their language models for code completion tasks.

Although the aforementioned corpora are useful for building a robust probabilistic model on source code, there are two major problems with these unlabeled datasets. Firstly, without further annotations, only unsupervised learning can be applied. Secondly, there is no official benchmark dedicated to these tasks, researchers often train their models on their own datasets, which leads to a challenge when comparing the performance between approaches.

There have been some efforts to create datasets with labels for bug identification and code clone detection tasks. For these tasks, issue tracking systems like JIRA or BugZilla and version control systems like Git provide a huge amount of information about software development activities.

Lamkanfi et al. (Lamkanfi et al., 2013) proposed a bug dataset that combined data spanning over the whole bug-triage cycle from both defect tracking systems JIRA and BugZilla into a structured format. Such dataset motivates not only bug analyses but also reproducibility and comparison of the bug detection models in Eclipse and Mozilla.

Later, Just et al. (Just et al., 2014) prepared a bug dataset namely Defects4Jhttps://github.com/rjust/defects4j, which contains both buggy and fixed source code along with their commit messages for 395 bugs in six Java projects.

A large-scale and continuously growing dataset (Tomassi et al., 2019) of failed builds and successful fixes was automatically curated from GitHub and Travis-CIhttps://travis-ci.org/.

BigCloneBench (Svajlenko et al., 2014) is a similar effort for code clone detection tasks. This datasethttps://github.com/clonebench/BigCloneBench consists of the clones detected from IJaDataset containing 25,000 open-source Java projects. BigCloneBench has been carefully verified by experts to provide a reliable evaluation benchmark for future research on detecting various types of code clones.

2. Datasets for program generation

For code completion, there have been several useful datasets mentioned in the previous section. For other program generation tasks, natural language annotation (or other forms of information) is also required as input. Existing works (Allamanis et al., 2015b; Iyer et al., 2016; Quirk et al., 2015; Liu et al., 2016a) have shown that online forums such as Stack Overflow, IFTTThttps://ifttt.com can provide an abundant amount of code and related discussions. However, since these sources are user-generated, they are noisy and unaligned with any format. Other existing datasets are carefully annotated manually, but most of them are still limited in quantity. Some examples of such datasets for program synthesis are reviewed here.

Project Eulerhttps://projecteuler.net/ is an open platform containing over 600 programming and mathematical problems. New problem is constantly added every two weeks. The description, pseudo-code and some sample inputs and outputs for each problem are included. This dataset can serve as a great source for program synthesis. Although most of the problems have been solved, there is still no public repository that contains completely tested and well-documented code for those problems. Oda et al. (Oda et al., 2015) utilized this dataset to perform translation between Python and Japanese languages, but unfortunately, the detailed code was not disclosed.

Another large-scale and perpetual dataset (Brown et al., 2014), namely Blackbox, of code edits and IDE usages have been collected from BlueJ Java IDE. This dataset can be used for various program generation tasks such as code completion, code fixing and program synthesis.

Card2codehttps://github.com/deepmind/card2code is another high-quality dataset for program generation (Ling et al., 2016). The authors collected the card descriptions of two trading card games (i.e., Magic the Gathering and Hearthstone) and their corresponding open-source Python implementations to perform program generation. However, this dataset is too domain-specific and thus limits its applicability.

Later, Oda et al. (Barone and Sennrich, 2017) presented a parallel Django dataset to address the limitations of the previous datasets. This dataset includes all source code of Django web framework with line-by-line English annotation, including functions for a wide variety of real-world use cases such as I/O operations, string manipulation and exception handling.

Barone et al. (Barone and Sennrich, 2017) presented a large-scale datasethttps://github.com/EdinburghNLP/code-docstring-corpus containing two main corpora. The first corpus contains more than 150,370 triplets of Python function declarations, bodies and their corresponding docstrings for code documentation generation. The second corpus consists of than 161,630 pairs of function declarations and their bodies to facilitate code generation.

WikiSQL (Zhong et al., 2017) is another crowd-sourced datasethttps://github.com/salesforce/WikiSQL for developing natural language interfaces for relational databases from user’s queries.

Yin et al. (Yin et al., 2018) introduced a dataset using Stack Overflow posts, namely CoNaLa, which maps natural language intents of developers and the corresponding code snippets.

Besides program synthesis, program induction is another (special) form of program generation, in which the program datasets can always be generative with predefined rules and languages such as arithmetic, lists, group-theory and tree relations. It is noted that the dataset for program induction is still very limited. One of the few related works is the question datasethttps://github.com/anselmrothe/question_dataset collected by Rothe et al. (Rothe et al., 2017). Such dataset contains 605 natural language questions querying about a board game situation, from which the algorithm/instructions can be learned by a machine to recover the original input-output pairs.

Challenges and Directions of Deep Code Modeling and Generation

Most source code models are limited in their applications since they usually perform significantly worse on corpora different from the training set, especially for complex program generation problems. The models are usually constrained to solve very specific tasks, and thus are not flexible enough to generate code distribution for general purposes. Moreover, a real-world programming problem is often specified in a multimodal way, i.e., described with natural language and demonstrated with example inputs and outputs in text, graphics or action sequences. In this case, a model has to solve program synthesis and induction problems simultaneously. Although it is natural for a human to consider both aspects while programming, there is no existing study using examples and descriptions together to build a neural program model that generates a program non-existent in the database. Tackling a real programming challenge requires a combination of program synthesis and induction models and depends on the development of representation learning and goal-oriented dialogue. However, a proper architecture for this task has yet to come. A model that includes such complex logic and power is hard to create or train because of the following challenges:

Collecting labeled data for this complex task requires massive annotation work. Therefore, the data must be used effectively during the training process.

Deep Learning (DL) models with differentiable variables are not very good at representing discrete features, which is crucial to learn the programming rules and control flows.

A proper evaluation metric that can incorporate both semantic meaning, grammatical and execution correctness is important. However, there is little work in that direction.

Deep source code models tend to be overly complex since a large proportion of parameters are dedicated to solving problems in which real programmers have no difficulty. For example, extensive computation has to be done in memory networks to learn how to copy variable names from previous occurrences. Therefore, we should learn from human experience to devise ways to simplify such procedures.

Real-world programs have a wide range of functionalities, which is difficult to cover with sets of annotated examples. Program context also changes rapidly, e.g., the meaning of local variables changes between function scopes. Thus, we have to find ways to make the best use of data, which in turn asks for a good model and training method that can learn fast and generalize well.

To learn translation without access to a parallel corpus, Lample et al. (Lample et al., 2017) tried to circumvent the lack of parallel datasets in machine translation with adversarial training. They used Denoising Auto-encoder to learn sentence embedding and adversarial training (Ganin et al., 2016) on the encoded domain to classify source and target embeddings. We have seen some early signs of using adversarial learning for program repair (Harer et al., 2018). However, extraction of generic representations from unsupervised learning is still not the dominant approach (Oord et al., 2017b).

Reinforcement Learning (RL) is also helpful when the labeled data is scarce. It has been adopted in image captioning to generate more natural photo descriptions. An interesting crossover field for RL and programming is natural language interfaces for system control (Laengle et al., 1995; Perzanowski et al., 2001). Automatic analysis of multimodal instructions has the potential to change the way how we develop and use software. Instead of using a rule-based control system, we can use ML and DL to train agents on examples or simulations. Guu et al. (Guu et al., 2017) trained an agent to parse natural instructions to generate code in a stack-based language in SCONE (Dong and Lapata, 2016). RL also enables intelligent systems to model continuous rewards or policies and interact with its environment to gain new knowledge, which enables grammar induction on even infinite search space.

The optimization target for an RL agent is very flexible. It can be integrated into the state-of-the-art conversation (Zhou et al., 2017) and captioning models (Ren et al., 2017). This learning scheme has the potential to be used in source code modeling to improve the generation quality or even guide a model to explore the context of the program. Misra et al. (Misra et al., 2017) designed a learning model by asking the system for image understanding. By swapping the question selection module with a test selection, an automatic testing framework can be built. A few investigations of RL with DL models have also been conducted for code summarization (Wan et al., 2018) and code fixing (Gupta et al., 2018).

1.2. Weakly and semi-supervised training

The problem of utilizing partially labeled data is called semi-supervised learning. A very closely related concept is weakly supervised learning, which originally means using fewer training samples by self-training. Recently, this term often refers to the cases of using noisy or partially labeled data, for example, enhancing semantic segmentation with only image-level labeled samples (Shen et al., 2017). A widely adopted semi-supervised learning architecture is student-teacher framework (Rasmus et al., 2015). In the state-of-the-art method, Tarvainen et al. (Tarvainen and Valpola, 2017) constructed a stronger teacher model by taking the exponential moving average of the students. Such models can also be customized for program generation tasks when the labeled dataset is limited.

1.3. Active learning

Another way of dealing with a large quantity of unlabeled data is active learning, which aims to design an oracle that lets the agent select which samples should be labeled (Settles, 2010). Konyushkova and Raphael (Konyushkova et al., 2017) demonstrated that their Learning Active Learning algorithm is capable of solving real-world problems in a wide range of domains. Active learning problem can also be considered as a special case of multi-armed bandit (Ganti and Gray, 2013). The restriction is that the labeling process totally cannot be abandoned completely. These active learning schemes can be applied in Big Code data collections.

2. Discrete and symbolic representations

Different from conventional DL models for sequence modeling, a program generation model should pay more attention to representing control flows and discrete rules.

Many Big Code tasks require a rapid understanding of source code context and generalization to other source code projects given the limited training data. Attempts to extract structural information from programs date back to grammatical inference in the 90s. Grammatical inference is the process of inducing, learning or inferring grammars (De la Higuera, 2010), which generates Finite-State Automata (FSA) of various types. In this field, researchers try to associate complexity theory, formal logic and discrete structures with learning. This field has been connected with many scientific disciplines, including bioinformatics, computational linguistics and lossless compression. Different from the statistical construction of language models (Goodman, 2001; Bengio et al., 2003), where the probability distribution of transition states of a Markov model (Gagniuc, 2017) is learned, the goal of grammar inference is to determine an explicit representation of a programming language.

However, these methods have limited expressive ability and difficulty handling real-world continuous space. RNNs are powerful enough to learn very complex cases. There are many FSA extraction algorithms from RNN, called RNN Rule Extraction (RNN-RE) based on searching and sampling (Jacobsson, 2005). Sodsong et al. (Sodsong et al., 2017) trained an RNN with program paths from OpenJDK to detect caller-sensitive method vulnerabilities (Cifuentes et al., 2015) and showed that grammar could be learned with CrySSMEx (Jacobsson, 2006). But the generation process relied on very aggressive quantization and the trained complex RNN structure was hard to understand. RNN-RE has also been criticized to be ”Fool’s gold” (Kolen, 1994). Recent deep code models (Allamanis et al., 2018b; Cvitkovic et al., 2018) have captured control-/data-flow information of code using graph-based neural networks; however, their usage contexts are still pre-defined and limited to statically typed programming languages.

Deep generative models such as Generative Adversarial Nets (Goodfellow et al., 2014), Variational AutoEncoders (VAEs) (Kingma and Welling, 2013) and Autoregressive models (Oord et al., 2016) have been widely used to learn this kind of representations. Recent advances in generative modeling of images (van den Oord et al., 2016; Goodfellow et al., 2014; Gregor et al., 2016; Kingma et al., 2016), audio (Oord et al., 2016; Mehri et al., 2016) and videos (Kalchbrenner et al., 2016b; Finn et al., 2016) have yielded impressive samples and applications (Ledig et al., 2016; Isola et al., 2016; Karras et al., 2017). Such generative models can be utilized to address the representation learning issues in program generation. Deep generative models can be used to learn the probabilistic distributions of unlabeled code corpus. Vector representations can then be extracted from the distributions of these models.

2.2. Discrete representation and addressing

For language and code, more concise and discrete representations are required to conduct complex reasoning or predictions. Discrete learning can also simulate computer operations and improve model interpretability (Oord et al., 2017b). Recent advances in discrete representation techniques are presented in this section for future adaptation to program modeling and generation.

Sometimes the representation has a desired explicit symbolic form. In this case, we can define the target of representation learning as predicting attributes to fill in the predefined structure. For example, Yang et al. (Yang et al., 2017) extracted knowledge tuples from question answering datasets with a model similar to Neural Symbolic Machine (Liang et al., 2016). However, it is more common that the structure is complex and unknown, where the algorithm has to infer the structure by itself.

We have seen some exciting signs of progress in discrete representation in other domains. For example, learning discrete hidden latent vectors in generative models enables interesting applications such as speaker transferring and video prediction. Oord et al. (Oord et al., 2017b) proposed VQ-VAE that used vector quantization to prevent posterior collapse and model discrete representation. A strong autoregressive model is used as the decoder for unsupervised speaker conversion tasks. VQ-VAE gives a great example of how to represent discrete rules with only continuous models and some non-linearity. Also, this model is trained self-supervised and has many potential applications in source code modeling and generation.

Besides representation, deterministic programming logic can be simulated by discrete addressing in memory-augmented networks. Xu et al. 2015 (Xu et al., 2015) proposed stochastic hard attention for image captioning by sampling just a column. During training, one would sample a set of sequences and compute gradient with an estimation similar to REINFORCE (Williams, 1992) used for policy-gradient RL. This technique is also used for other memory-augmented models. Zaremba et al. (Zaremba and Sutskever, 2015) modified Neural Turing Machine to use discrete addressing and trained the model with REINFORCE. Bornschein et al. (Bornschein et al., 2017) proposed a memory-augmented generative model that used a variational approximation for discrete addressing. They combined discrete memory addressing with continuous latent variables for generative few-shot learning. Discrete addressing has the advantage of definite logic representation; thus, it can model the logic of source code better.

2.3. Training with discrete variables

Training a discrete latent variable model turns out to be not easy. The gradient estimation based on random sampling such as REINFORCE always introduces high variance. In general, many approaches rely on control variates to reduce the variance.

There are several gradient estimators for discrete latent variable models such as IWAE (Burda et al., 2015), NVIL (Mnih and Gregor, 2014), RWS (Bornschein and Bengio, 2014). Tucker et al. (Tucker et al., 2017) proposed a low-variance, unbiased gradient estimates for discrete variables using control variates and Gumbel-Softmax (Maddison et al., 2016; Jang et al., 2016).

3. Real-world application and evaluation

One major limitation of previous neural code completion models is that they did not provide a complete solution for real-world problems. More details are covered hereafter.

The model is trained on a fixed vocabulary. Thus, to generate new OoV values, the model has to be retrained. In addition, word-level code completion can only predict the next token after the typing of a complete word.

Incomplete code leads to ambiguous parsing results for languages with complex grammars rather than LL(1). Thus, the current state cannot be mapped to a definite training input and this can make the prediction inaccurate.

More practical and engineering-friendly frameworks for rapid new concept learning and code generation are required. The following techniques can be adopted to deal with these limitations:

Open-vocabulary learning presented in section 4.1 can address the OoV issue and partial-word code completion. There is a large vocabulary in code and much less regularity compared to natural language. However, code tokens still share similarities on the character and subword levels since variable/function names can contain the same parts of existing words in different order.

PCFG is a natural fit for ambiguous parsing, and it could be integrated into a system. The evaluation process should be changed accordingly since the probabilistic parsing also introduces errors. To better account for such error in training, one can connect PCFG parameters to the learning objective and train the model end-to-end.

Furthermore, the evaluation methods being used for the code generation tasks are not perfect. AST node prediction accuracy is not a proper evaluation metric. Although syntactic information is better represented by AST node sequences, this is not the natural order of typing and the precision does not directly reflect the productivity gain of a tool. BLEU (Papineni et al., 2002) scores are very sensitive to tokenization. To deal with this problem, there is a standard evaluation code for comparable BLEU scoreshttps://github.com/awslabs/sockeye/tree/master/contrib/sacrebleu. In addition, ROUGE (Lin, 2004) and BLEU often do not capture linguistic fluency and coherence (Liu et al., 2016b). This problem gets even worse when evaluating code generations. Bielik et al. (Bielik et al., 2016a) used precision and log probability for their probabilistic models. Practically, under a code completion setting, developers would find a tool useful if some of the predicted top- $k$ items hit (Nguyen and Nguyen, 2015).

A code-in-code-out system for generation and evaluation is more suitable for this situation. Even predicting AST sequences, code completion algorithms should include a converter between AST sequence and incomplete code. The evaluations then can be carried out by measuring the code-level accuracy. Ideally, execution correctness should be used for evaluating code generation, but it can be hard to quantify in most real cases. And the resulting error cannot be directly passed through gradient to the model parameters. Static code analysis and RL techniques can also be adopted.

4. Human-like programming

Current neural program models are still quite distant from mimicking real programmers. More specifically, developers do not usually implement everything from scratch, but they try to adapt existing code to suit their needs. One way is to divide complicated tasks into simple ones for neural program models to speed up the training and improve performance. For example, Hashimoto et al. (Hashimoto et al., 2018) proposed to solve code generation by retrieving related examples and applying modifications. Other ways to achieve human-like programming would be to use copying mechanism (Vinyals et al., 2015) to reuse code as well as utilize both description and examples for program generation.

Training a model to attend to the right part of information could be tricky. Sometimes, attention tends to repeat itself and thus generates long and meaningless sequences. To deal with this problem, we have to add a structural constraint on the attention mechanism. In most cases, an almost diagonal attention matrix is what we want. On the other hand, complex addressing mechanisms like external memory networks do not perform well on language modeling tasks. Also, these models usually cannot be efficiently trained. The complexity of program generation can be greatly reduced if human-like behavior is considered. See et al. (See et al., 2017) proposed a technique to copy words from input sequences to generate rare words. Such copying mechanism was adopted by a recent work (Chen et al., 2018) to better handle OoV code tokens to enhance the robustness of automated program repair.

4.2. Programming by description and examples

In many real-world coding scenarios such as competitive codinghttps://code.google.com/codejam/, programmers write code by comprehending descriptions in natural language and then test their code with some input/output pairs. There are some search-based methods for program generation with both description and examples as context (Polosukhin and Skidanov, 2018). However, most models are not end-to-end trainable and they also have very limited capacity, which requires further research.

Conclusions

Artificial Intelligence (AI) in general and Deep Learning (DL) in particular have been increasingly leveraged for source code modeling. To facilitate more DL uses for practitioners and researchers in this area, our literature review first presented the limitations of existing source code models. We highlighted the potential of DL models to address these challenges and provided a more general solution for a wide range of problems. We then described important elements of encoder-decoder framework using DL models. We gave some recommendations on applying encoder-decoder framework with DL models to source code modeling and generation. Various Big Code applications (i.e., source code analysis and program generation) following encoder-decoder framework were then presented. We also identified the gaps between the state-of-the-art DL models and their applicability in source code modeling and generation. For some of these gaps, we proposed several suggestions on how to address them in future research.

We also want to provide some final thoughts about AI safety for source code modeling and generation. Recently, despite the impressive performance of DL in various domains, another trend of research is emerging to point out the vulnerability in the DL models that may result in bad consequences, namely adversarial DL. For instance, an image can be perturbed to change the output of a DL model completely (Nguyen et al., 2015; Moosavi-Dezfooli et al., 2016), but a human cannot distinguish such a subtle change. Later, the idea of adversarial DL was also adapted to models used for sequence modeling (e.g., RNN) (Papernot et al., 2016). Such finding indicates that it is totally possible to fool source code models, e.g., turning a predicted vulnerable code snippet into a benign one. Therefore, besides aiming for high performance, practitioners and researchers in this area should also be aware of the robustness of a DL source code model to input data changes.