StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow

Ziyu Yao, Daniel S. Weld, Wei-Peng Chen, Huan Sun

Introduction

Online forums such as Stack Overflow (SO) (Overflow, 2017e) have contributed a huge number of code snippets, understanding and reuse of which can greatly speed up software development. Towards this goal, a lot of research work have been developed recently, such as retrieving or generating code snippets based on a natural language query, and annotating code snippets using natural language (Allamanis et al., 2015; Iyer et al., 2016; Zilberstein and Yahav, 2016; Yin and Neubig, 2017; Rabinovich et al., 2017; Loyola et al., 2017; Su et al., 2017). At the core of these work are machine learning models that map between natural language and programming language, which are typically data-hungry (Krizhevsky et al., 2012; Ratner et al., 2016; Goodfellow et al., 2016) and require large-scale and high-quality $<$ natural language question, code solution $>$ pairs (i.e., question-code or QC pairs).

In our work, we define a code snippet as a code solution when the questioner can solve the problem solely based on it (also named as “standalone” solution). Take Figure 1 as an example, which shows the accepted answer postIn SO, an accepted answer post is marked with a green check by the questioner, if he/she thinks it solves the problem. Following previous work (Iyer et al., 2016; Yang et al., 2016a), although there can be multiple answer posts to a question, we only consider the accepted one because of its verified quality, and use “accepted answer post” and “answer post” interchangeably. to question “Elegant Python function to convert CamelCase to snake_case”. Among the four code snippets { $C_{1}$ , $C_{2}$ , $C_{3}$ , $C_{4}$ }, only $C_{1}$ and $C_{3}$ are standalone code solutions to the question while the rest are not, because $C_{2}$ only gives an input-output demo of the “convert” function without its definition and $C_{4}$ is a reminder of an additional detail. Given an answer post with multiple code snippets (i.e., a multi-code answer post) like Figure 1, previous work usually collected question-code pairs in heuristic ways: Simply pair the question title with the first code snippet, or with each code snippet, or with the concatenation of all code snippets in the post (Allamanis et al., 2015; Zilberstein and Yahav, 2016). Iyer et al. (Iyer et al., 2016) merely employed accepted answer posts that contain exactly one code snippet, and discarded all others with multiple code snippets. Such heuristic question-code collection methods suffer from at least one of the following weaknesses: (1) Low precision: Questions do not match with their paired code snippets, when the latter serve as background, explanation, or input-output demo rather than as a solution (e.g., $C_{2}$ in Figure 1); (2) Low recall: If one only selects the first code snippet to pair with a question, other code solutions in an answer post (e.g., $C_{3}$ ) will be unemployed.

In fact, multi-code answer posts are very common in SO, which makes the low-precision and low-recall issues even more prominent. In the Stack Exchange Data dump(Stack Exchange, 2017), among all accepted answer posts for Python and SQL “how-to-do-it” questions (to be introduced in Section 2), 44.66% and 34.35% contain more than one code snippets respectively. Note that an accepted answer post was verified only as an entirety by the questioner, and labels on whether each individual code snippet serves as a standalone solution or not are not readily available. Moreover, it is not feasible to obtain such labels by simply running each code snippet in a programming environment for two reasons: (1) A runnable code snippet is not necessarily a code solution (e.g., $C_{4}$ in Figure 1); (2) It was reported that around 74% of Python and 88% of SQL code snippets in SO are not directly parsable or runnable (Iyer et al., 2016; Yang et al., 2016a). Nevertheless, many of them usually contain critical information to answer a question. Therefore, they can still be used in semantic analysis for downstream tasks (Allamanis et al., 2015; Iyer et al., 2016; Zilberstein and Yahav, 2016; Yang et al., 2016a) once paired with natural language questions.

To systematically mine question-code pairs with high precision and recall, we propose a novel task: Given a questionFollowing previous work (Iyer et al., 2016; Allamanis et al., 2015; Campbell and Treude, 2017), we only use the title of a question post in this work, and leave incorporating the question post content for future work. in SO and its accepted answer post with multiple code snippets, how to predict whether each code snippet is a standalone solution or not? In this paper, we focus on “how-to-do-it”-type of questions which ask how to implement a certain task like in Figure 1, since answers to such questions are most likely to be standalone code solutions. The definition and classification of different types of questions will be discussed in Section 2. We identify two challenges in our task: (1) As shown in Figure 1, code snippets in an answer post can play many non-solution roles such as serving as an input-output demo or reminder (e.g., $C_{2}$ and $C_{4}$ ), which calls for a statistical learning model to make accurate predictions. (2) Both the textual context and the programming content of a code snippet can be predictive, but an effective model to jointly utilize them needs careful design. Intuitively, a text block with patterns like “you can do …” and “this is one thorough solution …” is more likely to be followed by a code solution. For example, given $S_{1}$ and $S_{3}$ in Figure 1, a code solution is likely to be introduced after them. On the other hand, by inspecting the code content, $C_{2}$ is probably not a code solution to the question, since it contains special Python console patterns like “ $>>>...>>>$ ” and no particular definition of “convert”.

To tackle these challenges, we explore a series of models including traditional classifiers and deep learning models, and propose a novel model, named Bi-View Hierarchical Neural Network (BiV-HNN), to capture both the textual context and the programming content of each code snippet (which make the two views). In BiV-HNN, we design two different modules to learn features from text and code respectively, and combine them into a deep neural network architecture, which finally predicts whether a code snippet is a standalone solution or not. To summarize, our contributions lie in three folds:

First, to the best of our knowledge, we are the first to investigate systematically mining large-scale high-quality question-code pairs, which are critical for developing learning-based models aiming to map between natural language and programming language.

Second, we extensively explore various models including traditional classifiers and deep learning models to predict whether a code snippet is a solution or not, and propose a novel Bi-View Hierarchical Neural Network which considers both text- and code-based views. On two manually labeled datasets in Python and SQL domain, BiV-HNN outperforms both the widely adopted heuristic methods and traditional classifiers by a large margin in terms of $F_{1}$ and accuracy. Moreover, BiV-HNN does not rely on any prior knowledge and can be easily applied to other programming domains.

Last but not least, we present StaQC, the largest dataset to date of $\sim$ 148K Python and $\sim$ 120K SQL question-code pairs, systematically mined by our framework. Using multiple case studies, we show that (1) StaQC is rich in surface variation: A question can be paired with multiple code solutions, and semantically the same code snippets can have different/paraphrased natural language descriptions. (2) Owing to such diversity as well as its large scale, StaQC is a much better data resource than existing ones for constructing models to map between natural language and programming language. In addition, we can continue to grow StaQC in both size and diversity, by regularly applying our framework to the fast-growing SO. Question-code pairs in other programming languages can also be mined similarly and included in StaQC.

Preliminaries

In this section, we first clarify our task definition, and then describe how we annotated datasets for model development.

Given a question and its accepted answer post which contains multiple code snippets in Stack Overflow, we aim at predicting whether each code snippet in the answer post is a standalone solution to the question or not. As explained in Section 1, we focus on “accepted” answer posts and “standalone” solutions.

Users can ask different types of questions in SO such as “how to implement X” and “what/why is Y”. Following previous work (Nasehi et al., 2012; de Souza et al., 2014; Delfim et al., 2016), we divide questions into five types: “How-to-do-it”, “Debug/corrective”, “Conceptual”, “Seeking something, e.g., advice, tutorial”, and their combinations. In particular, a question is of type “how-to-do-it” when the questioner provides a scenario and asks how to implement it like in Figure 1.

For collecting question-code pairs, we target at “how-to-do-it” questions, because answers to other types of questions are not very likely to be standalone code solutions (e.g., answers to “Conceptual” questions are usually text descriptions). Next, we describe how to distinguish “how-to-do-it” questions from others.

2. “How-to-do-it” Question Collection

At the high level, we combined the other four question types apart from “how-to-do-it” into one category named “non-how-to” and built a binary question type classifier.

We first collected Python and SQL questions from SO based on their tags, which are available for all question posts. Specifically, we considered questions whose tags contain the keyword “python” to be in Python domain and questions tagged by “sql”, “database” or “oracle” to be in SQL domain. For each domain, we randomly sampled and labeled 250 questions for training (150), validating (20) and testing (80) the classifierDespite of the small amount of training data, no overfitting was observed in our experiments partly because the features are very simple.. Among the 250 questions, around 45% in Python and 57% in SQL are “how-to-do-it” questions. We built one Logistic Regression classifier respectively for each domain, based on simple features extracted from question and answer posts as in (Delfim et al., 2016), such as keyword-occurrence features, the number of code blocks in question/answer posts, the maximum length of code blocks, etc. Hyperparameters in classifiers were tuned based on validation sets.

Finally, we obtained a question-type classification accuracy of 0.738 (precision: 0.653, recall: 0.889, and $F_{1}$ : 0.753) for Python and an accuracy of 0.713 (precision: 0.625, recall: 0.946, and $F_{1}$ : 0.753) for SQL. The classification of question types may be further improved with more advanced features and algorithms, which is not the focus of this paper.

2.2. “How-to-do-it” Question Set Collection

Using the above classifiers, we classified all Python and SQL questions in SO whose accepted answer post contains code blocks and collected a large set of “how-to-do-it” questions in each domain. Among these “how-to-do-it” questions, around 44.66% ( $68,839$ ) Python questions and 34.45% ( $39,752$ ) SQL questions have an accepted answer post with more than one code snippets, from which we will systematically mine question-code pairs.

3. Annotating QC Pairs for Model Training

To construct training/validation/testing datasets for our task, we hired four undergraduate students familiar with Python and SQL to annotate answer posts in these two domains. For each code snippet in an answer post, annotators can assign “1” to it if they think they can solve the problem based on the code snippet alone (i.e., it is a standalone code solution), and “0” otherwise. We ensured each code snippet is annotated by two annotators and adopted the label only when both annotators agreed on it. For each programming language, around 85% code snippets were labeled. The average Cohen’s kappa agreement (Cohen, 1960) is around 0.658 for Python and 0.691 for SQL. The statistics of our manually annotated datasets are summarized in Table 1, which will be used to develop our models.

Bi-View Hierarchical NN

Without loss of generality, let us assume an answer post of a given question has a sequence of blocks $\{S_{1},C_{1},S_{2},...,S_{i},C_{i},S_{i+1},...,S_{L-1},\\ C_{L-1},S_{L}\}$ with $L$ text blocks ( $S_{i}$ ’s) and $L-1$ code blocks ( $C_{i}$ ’s) interleaving with each other. Our task is to automatically assign a binary label to each code snippet $C_{i}$ , where 1 means a standalone solution while 0 otherwise. In this work, we model each code snippet independently and predict the label of $C_{i}$ based on its textual context (i.e., $S_{i}$ , $S_{i+1}$ ) and programming content. If either $S_{i}$ or $S_{i+1}$ is empty, we insert an empty dummy text block to make our model applicable. One can extend our formulation to a more complicated sequence labeling problem where a sequence of code snippets can be modeled simultaneously, which we leave for future work.

We first analyze at the high level how each individual block contributes to elaborating the entire answer fluently. For example, in Figure 1, the first text block $S_{1}$ suggests its followed code block $C_{1}$ (which implements a function) is “thorough” and thus might be a solution. $S_{2}$ subsequently connects $C_{1}$ to examples it can work with in $C_{2}$ . In contrast, $S_{3}$ starts with the conjunction word “Or” and possibly will introduce an alternative solution (e.g., $C_{3}$ ). This observation inspires us to first model the meaning of each block separately using a token-level sequence encoder, then model the block sequence $S_{i}$ - $C_{i}$ - $S_{i+1}$ using a block-level encoder, from which we finally obtain the semantic representation of $C_{i}$ .

Figure 2 shows our model, named Bi-View Hierarchical Neural Network (BiV-HNN). It progressively learns the semantic representation of a code block from token level to block level, based on which we predict it to be a standalone solution or not. On the other hand, BiV-HNN naturally incorporates two views, i.e., textual context and code content, into the model structure. We detail each component as follows.

2. Token-level Sequence Encoder

In our work, the bidirectional GRU (i.e., Bi-GRU) contains a forward GRU reading a text block $S_{i}$ from $w_{i1}$ to $w_{iT_{i}}$ and a backward GRU which reads from $w_{iT_{i}}$ to $w_{i1}$ :

where the hidden states in both directions are initialized with zero vectors. Since the forward and backward GRU summarize the context information from different perspectives, we concatenate their last hidden states (i.e., $\overrightarrow{h}_{iT_{i}}$ , $\overleftarrow{h}_{i1}$ ) to represent the meaning of the text block $S_{i}$ :

Code block. Similarly, we employ another Bi-GRU RNN module to learn a vector representation $v_{c}$ for code block $C_{i}$ based on its code token sequence. One may directly take this code vector $v_{c}$ as the token-level representation of a code block. However, since the goal of our model is to decide whether a code snippet answers a certain question, we associate $C_{i}$ with the question title $q$ to capture their semantic correspondences in the learnt vector representation $c_{i}$ . Specifically, we first learn the question vector $v_{q}$ by applying the token-level text encoder to the word sequence in $q$ . The concatenation of $v_{q}$ and $v_{c}$ is then fed into a feedforward tanh layer (i.e., “concat feedforward” in Figure 2) for generating $c_{i}$ :

We will verify the effect of incorporating $q$ in our experiments.

Unlike modeling a code block, we do not associate a text block with question $q$ when learning its representation, because we observed no direct semantic matching between the two. For example, in Figure 1, a text block can hardly match the question by its content. However, as we discussed in Section 1, a text block with patterns like “you can do …” or “This is one thorough solution …” can imply that a code solution will be introduced after it. Therefore, we model each text block per se, without incorporating question information.

3. Block-level Sequence Encoder

Given the sequence of token-level representations $s_{i}$ - $c_{i}$ - $s_{i+1}$ , we use a bidirectional GRU-based RNN to build a block-level sequence encoder and finally obtain the code block representation:

where the encoder is initialized with zero vectors (i.e., $\overrightarrow{0}$ and $\overleftarrow{0}$ ) in both directions. We concatenate the forward state $\overrightarrow{ch_{i}}$ and the backward state $\overleftarrow{ch_{i}}$ of the code block as its semantic representation:

4. Code Label Prediction

The representation $z_{i}$ of code block $C_{i}$ is then used for prediction:

where $y_{i}=[y_{i0},y_{i1}]$ represents the probability of predicting $C_{i}$ to have label 0 or 1 respectively.

We define the loss function using cross entropy (Goodfellow et al., 2016), which is averaged over all the $N$ code snippets during training:

where $p_{i0}=0$ and $p_{i1}=1$ if the i-th code snippet is manually annotated as a solution; otherwise, $p_{i0}=1$ and $p_{i1}=0$ .

Traditional Classifiers with Feature Engineering

In addition to neural network based models like BiV-HNN, we also explore traditional classifiers like Logistic Regression (LR) (Cox, 1958) and Support Vector Machine (SVM) (Cortes and Vapnik, 1995) for our task. Features are manually crafted from both text- and code-based views:

Textual Context. (1) Token: The unigrams and bigrams in the context. (2) FirstToken: If a sentence starts with phrases like “try this” or “use it”, then the following code snippet is very likely to be the solution. Inspired by this idea, we discriminate the first token from others in the context. (3) Conn: Boolean features indicating whether a connective word/phrase (e.g., “alternatively”) occurs in the context. We used the common connective words and phrases from Penn Discourse Tree Bank (Prasad et al., 2008).

Code Content. (1) CodeToken: All code tokens in a code snippet. (2) CodeClass: To discriminate code snippets that function and can be considered for learning and pragmatic reuse (i.e., “working code” (Keivanloo et al., 2014)) from input-output demos, we introduce CodeClass, which is the probability of a code snippet being a working code. Specifically, from all the “how-to-do-it” Python questions in SO, we first collected totally 850 code snippets following text blocks such as “output:” and “output is:” as input-output code snippets. We further randomly selected 850 accepted answer posts containing exactly one code snippet and took their code snippets as the working code. We then extracted a set of features like the proportion of numbers and parenthesis and constructed a binary Logistic Regression classifier, which obtains 0.804 accuracy and 0.891 $F_{1}$ on a manually labeled testing set. Finally, the trained classifier outputs the probability for each code snippet in Python being a “working code” as the CodeClass feature. For SQL, a working code can usually be detected by keywords like “SELECT” and “DELETE”, which have been included in the CodeToken feature. Thus, we did not design the CodeClass feature for it.

There could be other features to incorporate into traditional classifiers. However, coming up with useful features is anything but an easy task. In contrast, neural network models can automatically learn advanced features from raw data and have been broadly and successfully applied in different areas (Krizhevsky et al., 2012; Simonyan and Zisserman, 2014; Mikolov et al., 2013; Szegedy et al., 2013; Cho et al., 2014). Therefore, in our work, we choose to design the neural network based model BiV-HNN. We will compare different models in experiments.

Experiments

In this section, we conduct extensive experiments to compare various models and show the advantages of our proposed BiV-HNN.

Dataset Summarization. Section 2 discussed how we manually annotated question-code pairs for training, validation and testing. Statistics were summarized in Table 1. To evaluate different models, we adopt precision, recall, $F_{1}$ , and accuracy, which are defined in the same way as in a typical binary classification setting.

Data Preprocessing. We tokenized Python code snippets with best efforts: We first applied Python built-in tokenizer and for code lines that remain untokenized after that, we adopted the “wordpunct_tokenizer” in NLTK toolkit (Loper and Bird, 2002) to separate tokens and symbols (e.g., “.” and “ $=$ ”). In addition, we detected variables, numbers and strings in a code snippet by traversing its Abstract Syntax Tree (AST) parsed with Python built-in AST parser, and replaced them with special tokens “VAR”, “NUMBER” and “STRING” respectively, to alleviate data sparsity. For SQL, we followed (Iyer et al., 2016) to perform the tokenization, which replaced table/column names with placeholder tokens and numbered them to preserve their dependencies. Finally, we collected 4,557 (3,191) word tokens and 6,581 (1,200) code tokens from Python (SQL) training set.

Implementation Details. We used Tensorflow(TensorFlow, 2017) to implement our BiV-HNN and its variants to be introduced in Section 5.2. The embedding size of word and code tokens was set at 150. The embedding vectors were pre-trained using GloVe (Pennington et al., 2014) on all Python or SQL posts in SO. Parameters were randomly initialized following (Glorot and Bengio, 2010). We started the learning rate at 0.001 and trained neural network models in mini batch of size 100 with the Adam optimizer (Kingma and Ba, 2014). The size of the GRU units was chosen from {64, 128} for token-level encoders and from {128, 256} for block-level encoders. Following the convention (Hermann et al., 2015; Luong et al., 2015; Iyer et al., 2016), we selected model parameters based on their performance on validation sets. The Logistic Regression and Support Vector Machine models were implemented with Python Scikit-learn library (Pedregosa et al., 2011).

2. Baselines and Variants of BiV-HNN

Baselines. We compare our proposed model with two commonly used heuristics for collecting QC pairs: (1) Select-First: Only treat the first code snippet in an answer post as a solution; (2) Select-All: Treat every code snippet in an answer post as a solution and pair each of them with the question. In addition, we compare our model with traditional classifiers like LR and SVM based on hand-crafted features (Section 4).

Variants of BiV-HNN. First, to evaluate the effectiveness of combining two views (i.e., textual context and code content), we adapt BiV-HNN to consider only one single view: (1) Text-HNN (Figure 3(a)): In this model, we only utilize textual contexts of a code snippet. We mask all code blocks with a special token CodeBlock and represent them with a unified vector. (2) Code-HNN (Figure 3(b)): We only feed the output of the token-level code encoder (i.e., $c_{i}$ ) into the “code label prediction” layer in Section 3, and do not model textual contexts. In addition, to evaluate the effect of question $q$ when encoding a code block, we compare BiV-HNN with BiV-HNN-nq, which directly takes the code vector $v_{c}$ as the code block representation $c_{i}$ , without associating question $q$ , for further learning. These three models are all input-level variants of BiV-HNN.

Second, to evaluate the hierarchical structure in BiV-HNN, we compare it with “flat” RNN models, which model word and code tokens as a single sequence. The comparison is conducted in both text-only and bi-view settings: (1) Text-RNN (Figure 4(a)): Compared with Text-HNN, we concatenate all words in context blocks $S_{i}$ and $S_{i+1}$ as well as the unified code vector CodeBlock as a single sequence, i.e., $\{w_{i1},...,w_{i,T_{i}},\textsc{CodeBlock},w_{i+1,1},...,w_{i+1,T_{i+1}}\}$ , using Bi-GRU RNN. The concatenation of the forward and backward hidden states of CodeBlock is considered as its final semantic vector $z_{i}$ , which is then fed into the code label prediction layer. (2) BiV-RNN (Figure 4(b)): In contrast to BiV-HNN, BiV-RNN models all word and code tokens in $S_{i}$ - $C_{i}$ - $S_{i+1}$ as a single sequence, i.e., $\{w_{i1},...,w_{iT_{i}},co_{i1},...,co_{ij},...,co_{i,|C_{i}|},w_{i+1,1},...,w_{i+1,T_{i+1}}\}$ , where $co_{ij}$ denotes the $j$ -th token in code $C_{i}$ and $|C_{i}|$ is the number of code tokens in $C_{i}$ . BiV-RNN concatenates the last hidden states in two directions as the final semantic vector $z_{i}$ for prediction. We also tried directly “flattening” BiV-HNN by concatenating tokens in $S_{i}$ - $q$ - $C_{i}$ - $S_{i+1}$ , but observed worse performance, perhaps because transitioning from $S_{i}$ to question $q$ is less natural.

Finally, at the block level, instead of using an RNN, one may apply a feedforward neural network (Rumelhart et al., 1988) to the concatenated token-level output $[s_{i},c_{i},s_{i+1}]$ . Specifically, the block-level Bi-GRU in BiV-HNN can be replaced with a one-layerFor fair comparison, we only use one layer since the Bi-GRU in BiV-HNN only has one hidden layer. feedforward neural network, denoted as BiV-HFF. Intuitively, modeling the three blocks as a sequence is more consistent with the way humans read a post. We will verify this intuition in experiments.

While there could be other variants of our model, the above ones are related to the most critical designs in BiV-HNN. We only show their performance due to space constraints.

3. Results

Our experimental results in Table 2 show the effectiveness of our BiV-HNN. On both datasets, BiV-HNN substantially outperforms heuristic baselines Select-First and Select-All by more than 15% in $F_{1}$ and accuracy. This demonstrates that our model can collect QC pairs with much higher quality than heuristic methods used in existing research. In addition, when compared with LR and SVM, BiV-HNN achieves $7\%<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>∼</mo></mrow><annotation encoding="application/x-tex">\sim</annotation></semantics></math>∼9\%$ higher $F_{1}$ and accuracy on Python dataset, and $3\%<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>∼</mo></mrow><annotation encoding="application/x-tex">\sim</annotation></semantics></math>∼5\%$ better $F_{1}$ and accuracy on SQL dataset. The gain on SQL data is relatively smaller, probably because interpreting SQL programs is a relatively easier task, implied by the observation that both simple classifiers and BiV-HNN can have around $85\%$ F1.

Results in Table 3 show the effect of key components in BiV-HNN in comparison with alternatives. Due to space constraints, we do not show the accuracy of each model, which has roughly the same pattern as $F_{1}$ . We have made the following observations: (1) Single-view variants. BiV-HNN outperforms Text-HNN and Code-HNN by a large margin on both datasets, showing that both views are critical for our task. In particular, by incorporating code content information, BiV-HNN is able to improve Text-HNN by 7% on Python dataset and around 5% on SQL dataset in $F_{1}$ . (2) No-query variant. On Python dataset, the integration of the question information in BiV-HNN brings 3% $F_{1}$ improvements over BiV-HNN-nq, which shows the effectiveness of associating the question with the code snippet for identifying code answers. For SQL dataset, adding the question gives no obvious benefit, possibly because the code content in each SQL program already carries critical information for making a prediction (e.g., a SQL program containing the command keyword “SELECT” is very likely to be a solution to the given question, regardless of the question content). (3) “Flat”-structure variants. On both datasets, the hierarchical structure leads to $1\%<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>∼</mo></mrow><annotation encoding="application/x-tex">\sim</annotation></semantics></math>∼2\%$ improvements against the “flat” structure in both bi-view (BiV-HNN vs. BiV-RNN) and single-view setting (Text-HNN vs. Text-RNN). (4) Non-sequence variant. On Python dataset, BiV-HNN outperforms BiV-HFF by around 2%, showing the block-level Bi-GRU is preferable over feedforward neural networks. The two models get roughly the same performance on SQL, probably because our task is easier in SQL domain than in Python domain as we mentioned earlier.

In summary, our BiV-HNN is much more effective than widely-adopted heuristic baselines and traditional classifiers. The key components in BiV-HNN, such as bi-view inputs, hierarchical structure and block-level sequence encoding, are also empirically justified.

Error Analysis. There are a variety of non-solution roles that a code snippet can play, such as being only one step of a multi-step solution, an input-output example, etc. We observe that more than half of the wrong predictions were false positives (i.e., predicting a non-solution code snippet as a solution), correcting which usually requires integrating information from the entire answer post. For example, when a code snippet is the first step of a multi-step solution, BiV-HNN may mistakenly take it as a complete and standalone solution, since BiV-HNN does not simultaneously take into account follow-up code snippets and their context to make predictions. In addition, BiV-HNN may make mistakes when a correct prediction requires a close examination of the content of a question post (besides its title). Exploring these directions in the future may lead to further improved model performance on this task.

Model Combination. When experimenting with the single-view variants of BiV-HNN, i.e., Text-HNN and Code-HNN, we observed that the three models complement each other in making accurate predictions. For example, on Python validation set, around 70% mistakes made by Text-HNN or Code-HNN can be corrected by considering predictions from the other two models. Although BiV-HNN is built based on both text- and code-based views, 60% of its wrong predictions can be remedied by Text-HNN and Code-HNN. The same pattern was also observed on SQL dataset.

Therefore, we further tested the effect of combining the three models via a simple heuristic: The label of a code snippet is predicted only when the three models agree on it. Using this heuristic, 69.2% code blocks on the annotated Python testing set are labeled with 0.916 $F_{1}$ and 0.911 accuracy. Similarly, on SQL testing set, 78.7% code blocks are labeled with 0.943 $F_{1}$ and 0.926 accuracy. The combined model further improves BiV-HNN by around $6\%$ while still being able to label a large portion of the code snippets. Thus, we apply this combined model to those SO answer posts that are not manually annotated yet to obtain large-scale QC pairs, to be discussed next.

StaQC: A Systematically Mined Dataset of Question-Code Pairs

In this section, we present StaQC (Stack Overflow Question-Code pairs), a large-scale and diverse set of question-code pairs automatically mined using our framework. Under various case studies, we demonstrate that StaQC can greatly help tasks aiming to associate natural language with programming language.

In Section 5, we showed that a combination of BiV-HNN and its variants can reliably identify standalone code solutions with $>90\%$ $F_{1}$ and accuracy from a large portion of the testing set. Thus we applied this combined model to all unlabeled multi-code answer posts that correspond to “how-to-do-it” questions in Python and SQL domain, and finally collected 60,083 and 41,826 question-code pairs respectively. Additionally, there are 85,294 Python and 75,637 SQL “how-to-do-it” questions whose answer post contains exactly one code snippet. For them, as in (Iyer et al., 2016), we paired the question title with the one code snippet as a question-code pair. Together with 2,169 and 2,056 manually annotated QC pairs with label “1” for each domain (Table 1), we collected a dataset of 147,546 Python and 119,519 SQL QC pairs, named as StaQC. Table 4 shows its statistics.

Note that we can continue to expand StaQC with minimal efforts, since it is automatically mined by our framework, and more and more posts will be created in SO as time goes by. QC pairs in other programming languages can also be mined similarly to further enrich StaQC beyond Python and SQL domain.

2. Diversity of StaQC

Besides the large scale, StaQC also enjoys great diversity in the sense that it contains multiple textual descriptions for semantically similar code snippets and multiple code solutions to a question. For example, considering question “How to limit a number to be within a specified range? (Python)” whose answer post contains five code snippets (Figure 5), our framework is able to correctly mine four alternative code answers. Heuristic methods may either miss some of them or mistakenly include a false solution (i.e., the 3rd code snippet). Therefore, our framework is able to obtain more alternative solutions for the same question more accurately. Moreover, Figure 6 shows two question-code pairs included in StaQC, which we easily located by comparing code solutions of relevant questions in SO (i.e., questions manually linked by SO users). Note that the two code snippets have a very similar functionality but two different text descriptions.

Figure 5 and 6 show that StaQC is highly diverse and rich in surface variation. Such a dataset is beneficial for model development. Intuitively, when certain data patterns are not observed in the training phase, a model is less capable to predict them during testing. StaQC can alleviate this issue by enabling a model to learn from alternative code solutions to the same question or from different text descriptions to similar code snippets. Next we demonstrate this benefit using an exemplar downstream task.

3. Usage Demo of StaQC on Code Retrieval

To further demonstrate the usage of StaQC, we employ it to train a deep learning model for the code retrieval task (Keivanloo et al., 2014; Allamanis et al., 2015; Iyer et al., 2016). Given a natural language description and a set of code snippet candidates, the task is to retrieve code snippets that can match the description. In particular, an effective model should rank matched code snippets as high as possible. Models are evaluated by Mean Reciprocal Rank (MRR) (Voorhees et al., 1999). In (Iyer et al., 2016), the authors proposed a neural network based model, CODE-NN, which outputs a matching score between a natural language question and a code snippet. We choose CODE-NN as it is one of the state of the arts for code retrieval and improved previous work by a large margin. For training, the authors collected around 25,870 SQL QC pairs from answer posts containing exactly one code snippet (which is paired with the question title). They manually annotated two datasets DEV and EVAL for choosing the best model parameters and for final evaluation respectively, both containing around 100 QC pairs. The final evaluation is conducted in 20 runs. In each run, for every QC pair in DEV or EVAL, (Iyer et al., 2016) randomly selected 49 code snippets from SO as non-answer candidates, and ranked all 50 code snippets based on their scores output by CODE-NN. The averaged MRR is computed as the final result.

Improved Retrieval Performance. We first trained CODE-NN using the original training set in (Iyer et al., 2016). We denote this setting as CODE-NN (original). Then we used StaQC to upgrade the training data in two most straightforward ways: (1) We directly took all the 119,519 SQL QC pairs in StaQC to train CODE-NN, denoted as CODE-NN (StaQC). (2) To emphasize the effect of our framework, we just added the 41,826 QC pairs, automatically mined from SO multi-code answer posts, to the original training set and retrained the model, which is denoted as CODE-NN (original + StaQC-multi). In both (1) and (2), questions and code snippets occurring in the DEV/EVAL set were removed from training.

In all three settings, we used the same DEV/EVAL set and the same hyper-parameters as in (Iyer et al., 2016) except the dropout rate, which was chosen from {0.5, 0.7} for each model to obtain better performance. Like (Iyer et al., 2016), we decayed the learning rate in each epoch and terminated the training when it was lower than 0.001. The best model was selected as the one achieving the highest average MRR on DEV set. When using this strategy, we observed better results on the EVAL set than those reported in (Iyer et al., 2016) (around 0.44).

Table 5 shows the average MRR score and standard deviation of each model on EVAL set. We can see that directly using StaQC for training leads to a substantial 6% improvement over using the original dataset in (Iyer et al., 2016). By adding QC pairs we mined from multi-code posts to the original training data, CODE-NN can be significantly improved by 3%. Note that the performance gains shown here are still conservative, since we adopted the same hyper-parameters and a small evaluation set, in order to see the direct impact of StaQC. Using more challenging evaluation sets and by conducting systematic hyper-parameter selection, we expect models trained on StaQC to be more advantageous. StaQC can also be used to train other code retrieval models besides CODE-NN, as well as models for other related tasks like code generation or annotation.

Discussion and Future Work

Besides boosting relevant tasks using StaQC, future work includes: (1) We currently only consider a code snippet to be a standalone solution or not. In many cases, code snippets in an answer post serve as multiple steps and should be merged to form a complete solution (Overflow, 2017d). This is a more challenging task and we leave it to the future. (2) In our experiments, we combined BiV-HNN and its two variants using a simple heuristic to achieve better performance. In the future, one can also use StaQC to retrain the three models, similar to self-training (Nigam and Ghani, 2000), or jointly train the three models in a tri-training framework (Zhou and Li, 2005). (3) One may also employ Convolutional Neural Networks (Shen et al., 2014; Krizhevsky et al., 2012; Allamanis et al., 2016), which have shown great power on representation learning, to encode text and code blocks. Moreover, we can consider encoders similar to (Nguyen and Nguyen, 2015; Mou et al., 2016) for capturing the intrinsic structure of programming language.

Related Work

Language + Code Tasks and Datasets. Tasks that map between natural language and programming language, referred to Language + Code tasks here, such as code annotation and code retrieval/generation, have been popularly studied in recent years (Giordani and Moschitti, 2009; Keivanloo et al., 2014; Oda et al., 2015; Allamanis et al., 2015; Iyer et al., 2016; Raghothaman et al., 2016; Zilberstein and Yahav, 2016; Ling et al., 2016; Vinayakarao et al., 2017). In order to train more advanced yet data-hungry models, researchers have collected data either automatically from online communities (Keivanloo et al., 2014; Allamanis et al., 2015; Iyer et al., 2016; Zilberstein and Yahav, 2016; Raghothaman et al., 2016; Ling et al., 2016; Vinayakarao et al., 2017; Barone and Sennrich, 2017) or with human intervention (Giordani and Moschitti, 2010; Oda et al., 2015). Like our work, (Allamanis et al., 2015; Iyer et al., 2016; Zilberstein and Yahav, 2016; Vinayakarao et al., 2017) utilized SO to collect data. Particularly, (Allamanis et al., 2015) merges code snippets in its answer post as the target source code and pair it with the question title. (Iyer et al., 2016) only employs accepted answer posts containing exactly one code snippet. Other interesting datasets include $\sim$ 19K $<$ English pseudo-code, Python code snippet $>$ pairs manually annotated by (Oda et al., 2015), and $\sim$ 114K pairs of Python functions and their documentation strings heuristically collected by (Barone and Sennrich, 2017) from GitHub (GitHub, 2017). Unlike their work, we systematically mine high-quality question-code pairs from SO using advanced machine learning models. Our mined dataset StaQC, the largest to date of around 148K Python and 120K SQL question-code pairs, has been shown to be a better resource. Moreover, StaQC is easily expandable in terms of both scale and programming language types.

Recurrent Neural Networks for Sequential Data. Recurrent Neural Networks have shown great success in various natural language tasks (Bahdanau et al., 2014; Cho et al., 2014; Luong et al., 2015; Hermann et al., 2015). In an RNN, terms are modeled sequentially without discrimination. Recently, in order to handle information at different levels, (Li et al., 2015; Serban et al., 2016; Tang et al., 2015; Yang et al., 2016b) stack multiple RNNs into a hierarchical structure. For example, (Yang et al., 2016b) incorporates the attention mechanism in a hierarchical RNN model to pick up important words and sentences. Their model finally aggregates all sentence vectors to learn the document representation. In comparison, we utilize the hierarchical structure to first learn the semantic meaning of each block individually, and then predict the label of a code snippet by combining two views: textual context and programming content.

Mining Stack Overflow. Stack Overflow has been the focus of the Mining Software Repositories (MSR) challenge for years (Bacchelli, 2013; Ying, 2015). A lot of work (Treude et al., 2011; Nasehi et al., 2012; de Souza et al., 2014; Duijn et al., 2015; Yang et al., 2016a; Delfim et al., 2016) have been done on exploring the categories of questions, mining source codes, etc. We follow (Nasehi et al., 2012; de Souza et al., 2014; Delfim et al., 2016) to categorize SO questions into 5 classes but only focus on the “how-to-do-it” type (Section 2). (Duijn et al., 2015; Yang et al., 2016a) analyzes the quality of code snippets (e.g., readability) or explores “usable” code snippets that could be parsed, compiled and run. Different from their work, we are interested in finding standalone code solutions, which are not necessarily directly parsable, compilable or runnable, but can be semantically paired with questions. To the best of our knowledge, we are the first to study the problem of systematically mining high-quality question-code pairs.

Conclusion

This paper explores systematically mining question-code pairs from Stack Overflow, in contrast to heuristically collecting them. We focus on the “how-to-do-it” questions since their answers are more likely to be code solutions. We present the largest-to-date dataset of diversified question-code pairs in Python and SQL domain (StaQC), systematically collected by our framework. StaQC can greatly help downstream tasks aiming to associate natural language with programming language. We will release it together with our source code for future research.

Acknowledgments

This research was sponsored in part by the Army Research Office under cooperative agreements W911NF-17-1-0412, Fujitsu gift grant, DARPA contract FA8750-13-2-0019, the University of Washington WRF/Cable Professorship, Ohio Supercomputer Center (Center, 1987), and NSF Grant CNS-1513120. The views and conclusions contained herein are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notice herein.