Deep Learning based Vulnerability Detection: Are We There Yet?
Saikat Chakraborty, Rahul Krishna, Yangruibo Ding, Baishakhi Ray
Introduction
Automated detection of security vulnerabilities is a fundamental problem in systems security. Traditional techniques are known to suffer from high false-positive/false-negative rates . For example, static analysis-based tools typically result in high false positives, i.e., detect non-vulnerable cases as vulnerable, and dynamic analysis suffers from high false negatives, i.e., cannot detect many real vulnerabilities. After prolonged effort, these tools remain unreliable, leaving significant manual overhead for developers .
Recent progress in Deep Learning (DL), especially in domains like computer vision and natural language processing, has sparked interest in using DL to detect security vulnerabilities automatically with high accuracy. According to Google scholar, 92 papers appeared in popular security and software engineering venues between 2019 and 2020 that apply learning techniques to detect different types of bugspublished in TSE, ICSE, FSE, ASE, S&P Oakland, CCS, USENIX Security, etc.. In fact, several recent studies have demonstrated very promising results achieving high accuracy (up to ) at detecting vulnerabilities .
Given such remarkable reported success of DL models at detecting vulnerabilities, it is natural to ask why they are performing so well, what kind of features these models are learning, and most importantly, whether they can be used effectively and reliably in detecting real-world vulnerabilities. Understanding such explainability and generalizability of the DL models is pertinent as it may help solve similar problems in other domains like computer vision .
For instance, the generalizability of a DL model is limited by implicit biases in the dataset, which are often introduced during the dataset generation/curation/labeling process and therefore affect both the testing and training data equally (assuming that they are drawn from the same dataset). These biases tend to allow DL models to achieve high accuracy in the test data by learning highly idiosyncratic features specific to that dataset instead of generalizable features. For example, Yudkowsky et al. described an instance where US Army found out that a neural network for detecting camouflaged tanks did not generalize well due to dataset bias even though the model achieved very high accuracy in the testing data. They found that all the photos with the camouflaged tanks in the dataset were shot in cloudy days, and the model simply learned to classify lighter and darker images instead of detecting tanks.
In this paper, we systematically measure the generalizability of four state-of-the-art Deep Learning-based Vulnerability Prediction (hereafter DLVP) techniques that have been reported to detect security vulnerabilities with high accuracy (up to ) in the existing literature. We primarily focus on the Deep Neural Network (DNN) models that take source code as input and detect vulnerabilities at function granularity. These models operate on a wide range of datasets that are either generated synthetically or adapted from real-world code to fit in simplified vulnerability prediction settings.
First, we curate a new vulnerability dataset from two large-scale popular real-world projects (Chromium and Debian) to evaluate the performance of existing techniques in the real-world vulnerability prediction setting. The code samples are annotated as vulnerable/non-vulnerable, leveraging their issue tracking systems. Since both the code and annotations come from the real-world, detecting vulnerabilities using such a dataset reflects a realistic vulnerability prediction scenario. We also use FFMPeg+Qemu dataset proposed by Zhou et al. .
To our surprise, we find that none of the existing models perform well in real-world settings. If we directly use a pre-trained model to detect the real-world vulnerabilities, the performance drops by 73%, on average. Even if we retrain these models with real-world data, their performance drops by 54% from the reported results. For example, VulDeePecker reported a precision of 86.9% in their paper. However, when we use VulDeePecker’s pre-trained model in real world datasets, its precision reduced to 11.12%, and after retraining, the precision becomes 17.68%. A thorough investigation of such poor performance reveals several problems:
Inadequate Model. The most popular models are token-based, which treat code as a sequence of tokens and do not take into account semantic dependencies that play a vital role in vulnerability predictions. Even when a graph-based model is used, it does not focus on increasing the class-separation between vulnerable and non-vulnerable categories. Thus, in realistic scenarios, they suffer from low precision and recall.
Learning Irrelevant Features. While looking at the features that the existing techniques are picking up (using state-of-the-art explanation techniques ), we find that the state-of-the-art models are essentially picking up irrelevant features that are not related to vulnerabilities and are likely artifacts of the training datasets.
Data Duplication. The training and testing data in most existing approaches contain duplicates (up to 68%); thus, artificially inflating the reported results.
Data Imbalance. Existing approaches do not alleviate the class imbalance problem of real-world vulnerability distribution as non-vulnerable code is much more frequent than the vulnerable ones.
Having established these concerns empirically, we propose a road-map that we hope will help the DL-based vulnerability prediction researchers to avoid such pitfalls in the future. To this end, we demonstrate how a more principled approach to data collection and model design, based on our empirical findings, can lead to better solutions. For data collection, we discuss how to curate real-world vulnerability prediction data incorporating both static and evolutionary (i.e., bug-fix) nature of the vulnerabilities. For model building, we show representation learning can be used on top of traditional DL methods to increase the class separation between vulnerable and non-vulnerable samples. Representation learning is a popular class of machine learning techniques that automatically discovers the input representations needed for improving classification, and thus, replaces the need for manual feature engineering. Our key insight is as follows: distinguishing features of vulnerable and benign code is complex; thus, the model must learn to represent them automatically in the feature space.
We further empirically establish that using semantic information (with graph-based models), data de-duplication, and balancing training data to address the class imbalance of vulnerable/non-vulnerable samples can significantly improve vulnerability prediction. Following these steps, we can boost precision and recall of the best performing model in the literature by up to 33.57% and 128.38% respectively over current baselines.
In summary, our contributions in this paper are:
We systematically study existing approaches in DLVP task and identify several problems with the current dataset and modeling practices.
Leveraging the empirical results, we propose a summary of best practices that can help future DLVP research and experimentally validate these suggestions.
We curated a real-world dataset from developer/user reported vulnerabilities of Chromium and Debian projects. We release our dataset in this anonymous directory https://bit.ly/3bX30ai.
We also open source all our code and data we used in this study for broader dissemination. Our code and replication data are available in https://git.io/Jf6IA.
To this end, we argue that DL-based vulnerability detection is still very much an open problem and requires a well-thought-out data collection and model design framework guided by real-word vulnerability detection settings.
Background and Challenges
DLVP methods aim to detect unknown vulnerabilities in target software by learning different vulnerability patterns from a training dataset. Most popular DLVP approaches consist of three steps: data collection, model building, and evaluation. First, data is collected for training, and an appropriate model is chosen as per design goal and resource constraints. The training data is preprocessed according to the format preferred by the chosen model. Then the model is trained to minimize a loss function. The trained model is intended to be used in the real world. To assess the effectiveness of the model performance of the model is evaluated on unseen test examples.
This section describes the theory of DL-based vulnerability prediction approaches (§2.1), existing datasets (§2.2), existing modeling techniques (§2.3), and evaluation procedure (§2.4). Therein, we discuss the challenges that potentially limit the applicability of existing DLVP techniques.
DL-based vulnerability predictors learn the vulnerable code patterns from a training data () set where code elements are labeled as vulnerable or non-vulnerable. Given a code element and corresponding vulnerable/non-vulnerable label , the goal of the model is to learn features that maximize the probability with respect to the model parameters (). Formally, training a model is learning the optimal parameter settings () such that,
2 Existing Dataset
To train a vulnerability prediction model, we need a set of annotated code with labels vulnerable or benign. The number of vulnerable code should be large enough to allow the model to learn from it. Researchers used a wide spectrum of data sources to collect data for DLVP (see Figure 1). Depending on how the code samples are collected and how they are annotated, we classify them as:
Synthetic data: The vulnerable code example and the annotations are artificially created. SATE IV Juliet dataset and SARD fall in this category. Here the examples are synthesized using known vulnerable patterns. These datasets were originally designed for evaluating traditional static and dynamic analysis based vulnerability prediction tools.
Semi-synthetic data: Here either the code or the annotation is derived artificially. For example, Draper dataset, proposed by Russell et al. , contains functions that are collected from open source repositories but are annotated using static analyzers. Examples of SARD and National Vulnerability Database (NVD ) dataset are also taken from production code; however, they are often modified in a way to demonstrate the vulnerability isolating them from their original context. Although these datasets are more complex than synthetic ones, they do not fully capture the complexities of the real-world vulnerabilities due to simplifications and isolations.
Real data: Here both the code and the corresponding vulnerability annotations are derived from real-world sources. For instance, Zhou et al. curated Devign dataset, which consists of past vulnerabilities and their fixes from four open-source projects, two of which are publicly available.
Limitations. The problems with the dataset lie in how realistic the data source is and how they are annotated (see Figure 1). A model trained on a synthetic dataset comprising of simple patterns will be limited to detecting only those simple pattern which seldom occur in real life. For instance, consider an atypical buffer overflow example in Figure 2 used by VulDeePecker and SySeVR. Albeit a good pedagogical example, real world vulnerabilities are not as simple or as isolated. Figure 3 shows another buffer overflow example from linux kernel. Though the fix is very simple, finding the vulnerability itself requires an in-depth reasoning about the semantics of different components (i.e., variables, functions etc.) of the code. A model is trained to reason about simpler examples as in Figure 2 will fail to reason about Figure 3 code. Further, any model annotated by a static analyzer inherits all the drawbacks, e.g.,, high false positive rate .
In the most realistic dataset, FFMPeg+Qemu , the ratio of vulnerable and non-vulnerable examples is approximately 45%-55%, which does not reflect the real world distribution of vulnerable code. Further, the dataset only contains function that annotates functions that went through vulnerability-fix commits as vulnerable. When a model is trained on such dataset, the model is not presented with other functions from a vulnerable functions’ context, thus will not be as effective in differentiating vulnerable functions from other non-vulnerable functions from the context.
3 Existing Modeling Approaches
Model selection depends primarily on the information that one wants to incorporate. The popular choices for DLVP are token-based or graph-based models, and the input data (code) is preprocessed accordingly .
Token-based models: In the token-based models, code is considered as a sequence of tokens. Existing token-based models used different Neural Network architectures. For instance, Li et al. proposed a Bidirectional Long Short Term Memory (BSLTM) based model, Russell et al. proposed a Convolutional Neural Network (CNN) and Radom Forest-based model and compared against Recurrent Neural Network (RNN) and CNN based baseline models for vulnerability prediction. For these relatively simple token-based models, token sequence length is an important factor to impact performance as it is difficult for the models to reason about long sequences. To address this problem, VulDeePecker and SySeVR extract code slices. The motivation behind slicing is that not every line in the code is equally important for vulnerability prediction. Therefore, instead of considering the whole code, only slices extracted from “interesting points” in code (e.g., API calls, array indexing, pointer usage, etc.) are considered for vulnerability prediction and rest are omitted.
Graph-based models: These models consider code as graphs and incorporate different syntactic and semantic dependencies. Different type of syntactic graph (Abstract Syntax Tree) and semantic graph (Control Flow graph, Data Flow graph, Program Dependency graph, Def-Use chain graph etc.) can be used for vulnerability prediction. For example, Devign leverage code property graph (CPG) proposed by Yamaguchi et al. to build their graph based vulnerability prediction model. CPG is constructed by augmenting different dependency edges (i.e., control flow, data flow, def-use, etc.) to the code’s Abstract Syntax Tree (AST) (see §4 for details).
Both graph and token-based models have to deal with vocabulary explosion problem—the number of possible identifiers (variable, function name, constants) in code can be virtually infinite, and the models have to reason about such identifiers. A common way to address this issue is to replace the tokens with abstract names . For instance, VulDeePecker replaces most of the variable and function names with symbolic names (VAR1, VAR2, FUNC1, FUNC2 etc.).
Expected input for all the models are real valued vectors commonly known as embeddings. There are several ways to embed tokens to vectors. One such way is to use an embedding layer that is jointly trained with the vulnerability prediction task . Another option is to use external word embedding tool(e.g., Word2Vec ) to create vector representation of every token. VulDeePecker and SySeVR uses Word2Vec to transform their symbolic tokens into vectors. Devign , in contrast, uses Word2Vec to transform the concrete code tokens to real vectors.
Once a model is chosen and appropriate preprocessing is done on the training dataset, the model is ready to be trained by minimizing a loss function. Most of the existing approaches optimize the model by minimizing some variation of cross-entropy loss. For instance, Russell et al. optimized their model using cross-entropy loss, Zhou et al. used regularized cross entropy loss.
Limitations. Token based models assume that tokens are linearly dependent on each other, and thus, only lexical dependencies between the tokens are present, while the semantic dependencies are lost, which often play important roles in vulnerability prediction . To incorporate some semantic information, VulDeePecker and SySeVR extracted program slices of a potentially interesting point. For example, consider the code in Figure 4. A slice w.r.t.free function call at line 10 gives us all the lines except lines 6 and 7. The token sequence of the slice are: void action ( char * data ) const { for ( data ; * data != ‘0’ ; data ++ ) { foo ( data ) ; bar ( data ) ; if ( * data == SEARCH_CHAR ) { free ( data ) ;. In this examples, while the two main components for this code being vulnerable, i.e. data ++ (line 2) and free ( data ) (line 10) are present in the token sequence, they are far apart from each other without explicitly maintaining any dependencies.
In contrast, as a graph based model can consider the data dependency edges (red edge), we see that there is a direct edge between those lines making those lines closer to each other making it easier for the model to reason about that connection. Note that this is a simple CWE example (CWE 761), which requires only the data dependency graph to reason about. Real-world vulnerabilities are much more complex and require reasoning about control flow, data flow, dominance relationship, and other kinds of dependencies between code elements . However, graph-based models, in general, are much more expensive than their token-based counterparts and do not perform well in a resource-constrained environment.
One problem with the existing approaches is that although the trained models learn to discriminate vulnerable and non-vulnerable code samples, the training paradigm does not explicitly focus on increasing the separation between the vulnerable and non-vulnerable examples. Thus, with slight variations the classifications become brittle.
Another problem pertains to data imbalance between vulnerable and benign code as the proportion of vulnerable examples in comparison to the non-vulnerable one in real world dataset is extremely low . When a model is trained on such imbalanced dataset, models tend to be biased by the non-vulnerable examples.
4 Existing Evaluation Approaches
To understand the applicability of a trained model for detecting vulnerability in the real-world, it must first be evaluated. In most cases, a trained model is evaluated on held out test set. Test examples go through the same pre-processing technique as the training and then the model predicts the vulnerability of those pre-processed test examples. This evaluation approach gives an estimate of how the model may perform when used to detect vulnerabilities in the real-world.
Limitations. Although all the existing approaches report their performances using their own evaluation dataset, it does not give a comprehensive overview of the applicability of the model in the real-world. All we can learn from such intra-dataset evaluation is how well their approach fits their own dataset. Although there are some limited case studies on such models finding vulnerabilities in real-world projects, those case studies do not shed light on the false positives and false negatives . The number of false positives and false negatives are directly correlated to the developer effort in vulnerability prediction and too much of any would hold the developer from using the model .
To address the limitations with the existing data sets, we curate a more robust and comprehensive real world dataset, ReVeal, by tracking the past vulnerabilities from two open-source projects: Linux Debian Kernel and Chromium (open source project of Chrome). We select these projects because: (i) these are two popular and well-maintained public projects with large evolutionary history, (ii) the two projects represent two important program domains (OS and browsers) that exhibit diverse security issues, and (iii) both the projects have plenty of publicly available vulnerability reports.
To curate our data, we first collect already fixed issues with publicly available patches. For Chromium, we scraped its bug repository Bugzillahttps://bugs.chromium.org/p/chromium/issues/list. For Linux Debian Kernel, we collected the issues from Debian security trackerhttps://security-tracker.debian.org/tracker/. We then identify vulnerability related issues, i.e., we choose those patches that are labeled with “security”. This identification mechanism is inspired by the security issue identification techniques proposed by Zhou et al. , where they filter out commits that do not have security related keywords.
For each patch, we extracted the corresponding vulnerable and fixed versions (i.e., old and new version) of C/C++ source and header files that are changed in the patch. We annotate the previous versions of all changed functions (i.e., the versions prior to the patch) as vulnerable and the fixed version of all the changed functions (i.e., the version after patch) as ‘clean’. Additionally, other functions that were not involved in the patch (i.e., those that remained unchanged) are all annotated as ‘clean’.
A contrived example of our data collection strategy is illustrated in Figure 5. Here, we have two versions of a file file.c. The previous version of the file (version ) has a vulnerability which is fixed in the subsequent version (version ) by patching the function ham_0() to ham_1(). In our dataset, ham_0() would be included and labeled ‘vulnerable’ and ham_1() would be included and labeled ‘clean’. The other two functions (spam() and egg) remained unchanged in the patch. Our dataset would include a copy of these two functions and label them as ‘clean’.
Annotating code in this way simulates real-world vulnerability prediction scenario, where a DL model would learn to inspect the vulnerable function in the context of all the other functions in its scope. Further, by retaining the fixed variant of the vulnerable function, the DL model may learn the nature of patch. We make available our data collection framework and the curated vulnerability data for Chromium and DebianChromium and Debian dataset: https://bit.ly/3bX30ai for broader dissemination.
In this section, we present a brief overview of the ReVeal pipeline that aims to more accurately detect the presence of real-world vulnerabilities. Figure 6 illustrates the ReVeal pipeline. It operates in two phases namely, feature extraction (Phase-I) and training (Phase-II). In the first phase we translate real-world code into a graph-embedding (§4.1). In the second phase, we train a representation learner on the extracted features to learn a representation that most ideally demarcates the vulnerable examples from non-vulnerable examples (§4.2). Algorithm 1 shows the full training procedure for ReVeal. This algorithm expects Training data – a list of tuples, where each tuple contains a code () and corresponding vulnerability annotation ().
The goal of this phase is to convert code into a compact and a uniform length feature vector while maintaining the semantic and syntactic information. Note that, the feature extraction scheme presented below represents the most commonly used series of steps for extracting features from a graph representation . ReVeal uses Algorithm 2 to extract graph embedding (graph based feature vector that represent the entirety of a function in a code).
To extract the syntax and semantics in the code, we generate a code property graph (hereafter, CPG) . The CPG is a particularly useful representation of the original code since it offers a combined and a succinct representation of the code consisting of elements from the control-flow and data-flow graph in addition to the AST and program dependency graph (or PDG). Each of the above elements offer additional context about the overall semantic structure of the code .
The current vertex embedding is not adequate since it considers each vertex in isolation. It therefore lacks information about its adjacent vertices and, as a result, the overall graph structure. This may be addressed by ensuring that each vertex embedding reflects both its information and those of its neighbors. We use gated graph neural networks (hereafter GGNN) for this purpose.
Where, is a Gated Recurrent Function, is the embedding of the current vertex , and is a transformation function that assimilates the embeddings of all of vertex ’s neighbors . is the GGNN-transformed representation of the vertex ’s original embedding . now incorporates ’s original embedding as well as the embedding of its neighbors.
The final step in preprocessing is to aggregate all the vertex embedding to create a single vector representing the whole CPG denoted by , i.e.:
Note that ReVeal uses a simple element-wise summation as the aggregation function, but in practice it is a configurable parameter in the pipeline. The result of the pipeline presented so far is an dimensional feature vector representation of the original source code. To pre-train the GGNN, we augment a classification layer on top of the GGNN feature extraction. This training mechanism is similar to Devign . Such pre-training deconstructs the task of “learning code representation”, and “learning vulnerability”, and is also used by Russell et al. . While, we pre-train GGNN in a supervised fashion, unsupervised program representation learning can also be done to learn better program presentation. However, such learning is beyond the scope of this research and we leave that for future research.
2 Training (Phase-II)
In real-world data, the number of non-vulnerable samples (i.e., negative examples) far outnumbers the vulnerable examples (i.e., positive examples) as shown in Table 2.4. If left unaddressed, this introduces an undesirable bias in the model limiting its predictive performance. Further, extracted feature vectors of the vulnerable and non-vulnerable examples exhibit a significant overlap in the feature space. This makes it difficult to demarcate the vulnerable examples from the non-vulnerable ones. Training a DL model without accounting for the overlap makes it susceptible to poor predictive performance.
To mitigate the above problems, we propose a two step approach. First, we use re-sampling to balance the ratio of vulnerable and non-vulnerable examples in the training data. Next, we train a representation learning model on the re-balanced data to learn a representation that can most optimally distinguish vulnerable and non-vulnerable examples.
In order to handle imbalance in the number of vulnerable and non-vulnerable classes, we use the “synthetic minority over-sampling technique” (for short, SMOTE) . It operates by changing the frequency of the different classes in the data. Specifically, SMOTE sub-samples the majority class (i.e., randomly deleting some examples) while super-sampling the minority class (by creating synthetic examples) until all classes have the same frequency. In the case of vulnerability prediction, the minority class is usually the vulnerable examples. SMOTE has shown to be effective in a number of domains with imbalanced datasets .
During super-sampling, SMOTE picks a vulnerable example and finds nearest vulnerable neighbors. It then builds a synthetic member of the minority class by interpolating between itself and one of its random nearest neighbors. During under-sampling, SMOTE randomly removes non-vulnerable examples from the training set. This process is repeated until a balance is reached between the vulnerable and non-vulnerable examples. We present the pseudo-code of SMOTE in Algorithm 3.
2.2 Representation Learning Model
The graph embedding of the vulnerable and non-vulnerable code samples at the end of Phase-I tend to exhibit a high degree of overlap in feature space. This effect is illustrated by the t-SNE plot of the feature space in Figure 8(a)–(d). In these examples, there are no clear distinctions between the vulnerable (denoted by ) and the non-vulnerable samples (denoted by ). This lack of separation makes it particularly difficult to train an ML model to learn the distinction between the vulnerable and the non-vulnerable samples.
To improve the predictive performance, we seek a model that can project the features from the original non-separable space into a latent space which offers a better separability between vulnerable and non-vulnerable samples. For this, we use a multi-layer perceptron (MLP) , designed to transform input feature vector () to a latent representation denoted by . The MLP consists of three groups of layers namely, the input layer (), a set of intermediate layers which are parameterized by (denoted by , and a final output layer denoted by .
To maximize the separation between the vulnerable and the non-vulnerable examples in the latent space, we adopt the triplet loss as our loss function. Triplet loss has been widely used in machine learning, specifically in representation learning, to create a maximal separation between classes . The triplet loss is comprised of three individual loss functions: (a) cross entropy loss (); (b) projection loss (); and (c) regularization loss (). It is given by:
and are two hyperparameters indicating the contribution of projection loss and regularization loss respectively. The first component of the triplet loss is to measure the cross-entropy loss to penalize miss-classifications. Cross-entropy loss increases as the predicted probability diverges from the actual label. It is given by,
Here, is the true label and represents the predicted label.The second component of the triplet loss is used the quantify how well the latent representation can separate the vulnerable and non-vulnerable examples. A latent representation is considered useful if all the vulnerable examples in the latent space are close to each other while simultaneous being farther away from all the non-vulnerable examples, i.e., examples from same class are very close (i.e., similar) to each other and examples from different class are far away from each other. Accordingly, we define a loss function which is defined by.
The final component of the triplet loss is the regularization loss () that is used to limit the magnitude of latent representation (). It has been observed that, over several iterations, the latent representation of the input tend to increase in magnitude arbitrarily . Such arbitrary increase in prevents the model from converging . Therefore, we use a regularization loss () to penalize latent representations () that are larger in magnitude. The regularization loss is given by:
With the triplet loss function, ReVeal trains the model to optimize for it parameters (i.e., ) by minimizing equation 2. The effect of using representation learning can be observed by the better separability of the vulnerable and non-vulnerable examples in Figure 8(b). Algorithm 4 shows the detailed algorithm for calculating the loss.
We use Pytorch 1.4.0 with Cuda version 10.1 to implement our method. For GGNN, we use tensorflow 1.15. We ran our experiments on single Nvidia Geforce 1080Ti GPU, Intel(R) Xeon(R) 2.60GHz 16 CPU with 252 GB ram. Neither Devign’s implementation, nor their hyperparameters ate not publicly availavle. We followed their paper and re-implemented to our best ability. For the GGNN, maximum iteration number is set to be 500. For the representation learner maximum iteration is 100. We stop the training procedure if F1-score on validation set does not increase in for 50 consecutive training iteration for GGNN and 5 for Representation Learning. Hyper-parameters for different components in ReVeal are shown in Section 5.1.
Repr-model = Representation Learning Model used in ReVeal.
2 Study Subject
Section 2.4 summarizes all the vulnerability prediction approaches and datasets studied in this paper. We evaluate the existing methods (i.e., VulDeePecker , SySeVR , Russell et al. , and Devign ) and ReVeal’s performance on two real world datasets (i.e., ReVeal dataset, and FFMPeg+Qemu). FFMPeg+Qemu was shared by Zhou et al. who also proposed the Devign model in the same work. Their implementation of Devign was not publicly available. We re-implement their method to report our results. We ensure that our results closely match their reported results in identical settings.
3 Evaluation
To understand a model’s performance, researchers and model developers need to understand the performance of a model against a known set of examples. There are two important aspect to note here, (a) the evaluation metric, and (b) the evaluation procedure.
Problem Formulation and Evaluation Metric: Most of the approaches formulate the problem as a classification problem, where given a code example, the model will provide a binary prediction indicating whether the code is vulnerable or not. This prediction formulation relies on the fact that there are sufficient number of examples (both vulnerable and non-vulnerable) to train on. In this study, we are focusing on the similar formulation. While both VulDeePecker and SySeVR formulate the problem as classification of code slices, we followed the problem formulation used by Russell et al. , and Devign , where we classify the function. This is the most suitable model working with the graph, since slices are paths in the graph.
We study approaches based on four popular evaluation metrics for classification task – Accuracy, Precision, Recall, and F1-score. Precision, also known as Positive Predictive rate, is calculated as true positive / (true positive + false positive), indicates correctness of predicted vulnerable samples. Recall, on the other hand, indicates the effectiveness of vulnerability prediction and is calculated as true positive / (true positive + false negative). F1-score is defined as the geometric mean of precision and recall and indicates balance between those.
Evaluation Procedure: Since DL models highly depend on the randomness , to remove any bias created due to the randomness, we run 30 trials of the same experiment. At every run, we randomly split the dataset into disjoint train, validation, and test sets with 80%, 10%, and 20% of the dataset respectively. We report the median performance and the inter-quartile range (IQR) of the performance. When comparing the results to baselines, we use statistical significance test and effect size test . Significance test tells us whether two series of samples differ merely by random noises. Effect sizes tells us whether two series of samples differ by more than just a trivial amount. To assert statistically sound comparisons, following previous approaches , we use a non-parametric bootstrap hypothesis test in conjunction with the A12 effect size test . We distinguish results from different experiments if both significance test and effect size test agreed that the division was statistically significant (99% confidence) and is not result of a “small” effect (A12 60%) (similar to Agrawal et al. ).
We present our empirical results as answers to the following research questions:
RQ1: How effective are existing approaches for real-world vulnerability prediction? (§6.1)
RQ2: What are the limitations of existing approaches? (§6.2)
RQ3: How to improve DLVP approaches? (§6.3)
Motivation. The goal of any DLVP approaches is to be able to predict vulnerabilities in the real-world. The datasets that the existing models are trained on contain simplistic examples that are representative of real-world vulnerabilities. Therefore, we ought to, in theory, be able to use these models to detect vulnerabilities in the real-world.
Approach. There are two possible scenarios under which these models may be used:
Scenario-A (pre-trained models): We may reuse the existing pre-trained models as it is to predict real-world vulnerabilities. To determine how they perform in such a setting, we first train the baseline models with their respective datasets as per Section 2.4. Next, we use those pre-trained models to detect vulnerabilities in the real-world (i.e., on FFMPeg+Qemu, and ReVeal dataset).
Scenario-B (re-trained models): We may rebuild the existing models first by training them on the real-world datasets, and then use those models to detect the vulnerabilities. To assess the performance of baseline approaches in this setting, we first use one portion of the FFMPeg+Qemu and ReVeal dataset to train each model. Then, we use those models to predict for vulnerabilities in the remainder of the FFMPeg+Qemu and ReVeal. We repeat the process 30 times, each time training and testing on different portions of the dataset.
Observations. Table 6(b) tabulates the performance of existing pre-trained models on predicting vulnerabilities in real-world data (i.e., Scenario-A). We observe a precipitous drop in performance when pre-trained models are used for real-world vulnerability prediction.
For example, In ReVeal dataset, VulDeePecker achieves an F1-score of only and in FFMPeg+Qemu, VulDeePecker achieves an F1-score of , while in the baseline case (see Section 5.3), the F1-score of VulDeePecker was as high as . Even the sophisticated graph-based Devign model produced an F1-score of only and precision as low as on ReVeal dataset. Similar performance drops are observed for all the other baselines. On average, we observe a 73% drop of F1-score across all the models in this setting.
For scenario-B, Table 6(c) tabulates our findings for re-trained models. Here, we also observe a significant performance drop from the baseline results. In ReVeal dataset, both Russell et al. and VulDeePecker achieve an F1-score of roughly (in contrast to their baseline performances of ). SySeVR achieved an F1-score of on ReVeal dataset. We observed similar trends in other settings, with an average F1 score drop of 54%.
Result: Existing approaches fail to generalize to real-world vulnerability prediction. If we directly use a pre-trained model to detect the real-world vulnerabilities, the f1-score drops by 73%, on average. Even if we retrain these models with real-world data, their performance drops by 54% from the reported results.
2 Key limitations of existing DLVP approaches (RQ2)
Motivation. In RQ1, we showed that existing approaches are not effective in detecting real-world vulnerabilities. In this RQ, we investigate the reasons behind their failure. We find that the baseline methods suffer from a number of problems, as listed below:
Preprocessing techniques such as slicing used by VulDeePecker and SySeVR and tokenization used by Russell et al. introduce a large number of duplicates in both the training and testing data. There are several ways duplication can be introduced by these preprocessing techniques – e.g., same slice can be extracted from different entry points, different code can have same tokens due to the abstract tokenization, etc.
Approach. We apply each preprocessing technique to its respective dataset (see §2) and also to the real-world datasets.
Observations. Table IV tabulates the number of duplicates introduced by some of the vulnerability prediction approaches. We observe that the preprocessing technique of SySeVR and VulDeePecker (i.e., slicing followed by tokenization) introduces a significant amount of () duplicate samples. Further, semi-synthetic datasets like NVD, SARD, and Juliet (comprised of much simpler code snippets) result in a large number of duplicates. In contrast, real-world datasets are much more complex and therefore have far fewer duplicates. In our case, the two real-world data contain little to no duplicates prior to preprocessing (ReVeal dataset had only 0.6%, and FFMPeg+Qemu had 0.2%). After preprocessing, although some duplicates are introduced (e.g., SySeVR’s preprocessing technique introduces 25.56% duplicates in ReVeal dataset and 22.10% duplicates in FFMPeg+Qemu), they are much lesser than baseline datasets. While duplicates created by slicing and pre-processing techniques do favor vulnerability prediction in general , it seriously undermines the capability of a DL model to extract patterns. In fact, prevalence of such duplicates in training set might lead a DL model to learn irrelevant features. Common examples between train and test sets hampers fair comparison of different DL models for vulnerability prediction task.
Ideally, a DL based model should be trained and tested on a dataset where 100% examples are unique. Duplication tends to artificially inflate the overall performance of a method , as evidenced by the discrepancy of the baseline results and results of the pre-trained models in Scenario-A of RQ1 (see Table 6(b)).
2.2 Data Imbalance
Real world data often contains significantly more non-vulnerable examples than vulnerable ones. A model trained on such skewed dataset is susceptible to being considerably biased toward the majority class.
Approach. We compute percentage on vulnerable samples w.r.t. total number of samples from different datasets used in this paper as shown in Table 2.4.
Observations. We notice that several datasets exhibit a notable imbalance in the fraction of vulnerable and non-vulnerable examples; . the percentage vulnerability is sometimes as low as . The ratio of vulnerable and non-vulnerable examples varies depending on the project and the data collection strategy employed. Existing methods fail to adequately address the data imbalance during training. This causes two problems: (1) When pre-trained models are used (i.e., Scenario-A in RQ1) to predict vulnerabilities in the real world, the ratios of vulnerable and non-vulnerable examples differ significantly in training and testing datasets. This explains why pretrained models perform poorly (as seen in Table 6(b)). (2) When the models are re-trained, they tend to be biased towards the class with the most examples (i.e., the majority class). This results in poor recall values (i.e., they miss a lot of true vulnerabilities) and hence, also the F1-score (as seen in Table 6(c)).
2.3 Learning Irrelevant Features
In order to choose a good DL model for vulnerability prediction, it is important to understand what features the model uses to make its predictions. A good model should assign greater importance to the vulnerability related code features.
Approach. To understand what features a model uses for its prediction, we find the feature importance assigned to the predicted code by the existing approaches. For token-based models such as VulDeePecker, SySeVR, and Russell et al., we use Lemna to identify feature importance . Lemna assigns each token in the input with a value , representing the contribution of that token for prediction. A higher value of indicates a larger contribution of token towards the prediction and vice versa. For graph-based models, such as Devign, Lemna is not applicable . In this case, we use the activation value of each vertex in the graph to obtain the feature importance. The larger the activation, the more critical the vertex is.
Observations. To visualize the feature importances, we use a heatmap to highlight the most to least important segments of the code. Figure 7 shows two examples of correct predictions. Figure 6(d) shows an instance where Russell et al.’s token-based method accurately predicted a vulnerability. But, the features that were considered most important for the prediction (lines 2 and 3) are not related to the actual vulnerability that appears in buggy sprintf lines (lines 6, 8, and 10). We observe similar behavior in other token based methods.
In contrast, Figure 6(e) shows an example that was misclassified as non-vulnerable by token-based methods, but graph-based models accurately predict them as vulnerable. Here we note that the vulnerability is on line 20, and graph-based models use lines 3, 7, 19 to make the prediction, i.e. mark the corresponding function as vulnerable. We observe that each of these lines shares a data dependency with line 20 (through pb and st). Since graph-based models learn the semantic dependencies between each of the vertices in the graph through the code property graph, a series of connected vertices, each with high feature importance, causes the graph-based model to make the accurate prediction. Token-based models lack the requisite semantic information and therefore fail to make accurate predictions.
2.4 Model Selection: Lack of Class Separation
Existing approaches translate source code into a numeric feature vector that can be used to train a vulnerability prediction model. The efficacy of the vulnerability prediction model depends on how separable the feature vectors of the two classes (i.e., vulnerable examples and non-vulnerable examples) are. The greater the separability of the classes, the easier it is for a model to distinguish between them.
Approach. We use t-SNE plots to inspect the separability of the existing models. t-SNE is a popular dimensionality reduction technique that is particularly well suited for visualizing how high-dimensional datasets look in a feature space . A clear separation in the t-SNE space indicates that the classes are distinguishable from one another. In order to numerically quantify the separability of the classes, we use the centroid distance proposed by Mao et al. . We first find the centroids of each of the two classes. Next, we compute the euclidean distance between the the centroids. Models that have larger the euclidean distances are preferable since they exhibit greater class separation.
Observations. Figure 8 illustrates the t-SNE plots of the existing approaches. All the existing approaches (Figure 7(a)–7(d)) exhibit a significant degree of overlap in the feature space between the two classes. This is also reflected by the relatively low distance between the centroids in each of the existing methods. Among exiting methods, Devign (Figure 7(d)) has the least centroid distance (around ); this is much lower than any other existing approach. This lack of separation explains why Devign, in spite of being a graph-based model, has poor real-world performance (see Table 5.3).
Result: Existing approaches have several limitations: they (a) introduce data duplication, (b) don’t handle data imbalance, (c) don’t learn semantic information, (d) lack class separability. DLVP may be improved by addressing these limitations.
Legends: Vul=VulDeePecker , Sys=SySeVR , Rus=Russell et al. , Dev=Devign , Rev=ReVeal.
3 How to improve DLVP approaches? (RQ3)
Motivation. In RQ2, we highlighted a number of challenges that limit the performance of existing DLVP on real-world datasets. To address these challenges, we offer ReVeal— a roadmap to help avoid some of the common problems that current state-of-the-art vulnerability prediction methods face when exposed to real-world datasets.
Approach. A detailed description of ReVeal is presented in §4. Briefly, it works as follows: (i) input code fragment is converted to a feature vector with the help of a code property graph and GGNN (§4.1); (ii) the feature vectors are re-sampled using SMOTE (§4.2.1) that addresses potential data imbalance; and finally, (iii) a multi layer perceptron based representation learner is trained to learn a representation of the feature vectors that maximally separates the positive and negative classes (§4.2.2). This pipeline offers the following benefits over the current state-of-the-art:
Addressing duplication: ReVeal does not suffer from data duplication. During pre-processing, input samples are converted to their corresponding code property graphs whose vertices are embedded with a GGNN and aggregated with an aggregation function. This pre-processing approach tends to create a unique feature for every input samples. So long as the inputs are not exactly the same, the feature vector will also not be the same.
Addressing data imbalance: ReVeal makes use of synthetic minority oversampling technique (SMOTE) to re-balance the distribution of vulnerable and non-vulnerable examples in the training data. This ensures that the trained model would be distribution agnostic and, therefore, better suited for real-world vulnerability prediction where the distribution of vulnerable and non-vulnerable examples is unknown.
Addressing model choice: ReVeal extracts semantic as well as syntactic information from the source code using code property graphs. Using GGNN, each vertex embedding is updated with the embeddings of all its neighboring vertices. This further increases the semantic richness of the embeddings. This represents a considerable improvement to the current token-based and slicing-based models. As shown in Figure 6(e), ReVeal can accurately predict the vulnerability here.
Addressing the lack of separability: As shown in Figure 7(a)–7(d), the vulnerability class is almost inseparable from the non-vulnerability class in the feature space. To address this problem, ReVeal uses a representation learner that automatically learns how to re-balance the input feature vectors such that the vulnerable and non-vulnerable classes are maximally separated . This offers significant improvements over the current state-of-the-art as shown in Figure 7(e). Compared to the other approaches of Figure 7(a)–7(d), ReVeal exhibits the highest separation between the vulnerable and non-vulnerable classes (roughly higher than other GGNN based vulnerability prediction).
We compare performance of ReVeal with existing vulnerability prediction approaches of two real-world datasets, i.e., FFMPeg+Qemu and ReVeal data.
Observations. Figures 10 and 10 compare the performance of ReVeal tool with other approaches. We observe that ReVeal offers noticeable improvements in all the metrics:
ReVeal dataset: ReVeal performs best in terms of F1-score and recall. The median recall is ( more than that of SySeVR, the next best model) and median F1-score is ( more than SySeVR). This represents a and improvement in recall and F1 over SySeVR respectively. While Devign (another GGNN based vulnerability prediction) produces a better precision, Devign’s median recall less than that of ReVeal. This indicates that, compared to Devign, ReVeal can find larger number of true-positive vulnerabilities (resulting in a better recall) at the cost slightly more false-positives (resulting in a slightly lower precision). Overall, ReVeal’s median F1-score is more than Devign, i.e., a improvement.
FFMPeg+Qemu: ReVeal outperforms other approaches in all performance metrics. ReVeal’s median accuracy, precision, recall, and F1-scores are 5.01%, 5.19%, 13.11%, and 12.64% higher respectively than the next best approach.
In the rest of this research question, we investigate contribution of each component of ReVeal. Specifically, we study what improvements are offered by the use of (a) Graph neural network (§6.3.1); (b) re-balancing training data with (§6.3.2); and finally (c) representation learning (§6.3.3).
To understand the contribution of GGNN, we create a variant of ReVeal without GGNN. In this setup, we bypass the use GGNN and aggregate the initial vertex features to create the graph features. Further, we create another variant of ReVeal that uses only GGNN without re-sampling or representation learning.
Figure 11 shows the F1-scores for the above setup. We observe that, in both ReVeal dataset and FFMPeg+Qemu, F1-score increases when we use GGNN in ReVeal’s pipeline. We observe that the improvements offered by the use of GGNN is statistically significant (with a p-value of 0.0002 in ReVeal dataset, and 0.001 in FFMPeg+Qemu). Further, when we perform the A12 effect size with 30 independent experiment runs in each case, we found that the the effect size is 81% for ReVeal dataset and 73% for FFMPeg+Qemu. This means that 81% of the times ReVeal performs better with GGNN than it does without GGNN in ReVeal dataset and 73% in FFMPeg+Qemu. Both of those effect sizes are considered large indicating ReVeal with GGNN’s f1-score distribution is better than ReVeal without GGNN.
We contend that, since GGNN embeds the neighbors’ information in every vertex, vertices have richer information about the graph. Thus ReVeal’s classification model have more information at its disposal to reason about. The result indicates that when vertices assimilate information from neighboring vertices, vulnerability prediction performance increases.
3.2 Effect of Training Data Balancing
To understand the contribution of SMOTE, we deploy two variants of ReVeal one with SMOTE and one without. Note that, ReVeal uses SMOTE as an off-the shelf data balancing tool. Choice of which data-balancing tool should be used is a configurable parameter in ReVeal’s pipeline.
Figure 12 illustrates the effect of using data re-sampling in ReVeal’s pipeline. We observe that re-balancing training data improves ReVeal’s performance in general. The more skewed the dataset, the larger the improvement. In FFMPeg+Qemu, non-vulnerable examples populates roughly of the data. There, using SMOTE offers only a improvement in F1-score (see Figure 11(b)). However, in ReVeal dataset, non-vulnerable examples populates of the data, there we obtain more than improvement in F1-score compared to not using SMOTE (see Figure 11(a)). Without SMOTE, the precision of ReVeal tool improves and reaches up to 46.23% (see Table VI in Appendix), highest achieved precision among all the experimental settings. However, this setting suffers from low recall due to data imbalance. Thus, if an user cares more about precision over recall, SMOTE can be turned off, and vice versa.
3.3 Effect of Representation Learning
In order to understand the contribution of representation learning, we replace representation learning with three other learners: (a) Random Forest (a popular decision tree based classifier used by other vulnerability prediction approaches like Russell et al. ); (b) SVM with an RBF kernel which also attempts to maximize the margin between vulnerable and non-vulnerable instances ; and (c) An off-the-shelf Multi-Layer Perceptron which uses a log-Likelihood loss .
Figure 13 shows the ReVeal’s performance with different classification models. In both ReVeal dataset and FFMPeg+Qemu, our representation learner with triplet loss achieves the best performance. ReVeal’s median F1-score is 62.8%, 13.3%, 3.5% higher than that of RF, MLP, and SVM baselines respectively in ReVeal dataset. For FFMPeg+Qemu improvement in median F1-score is 23.33%, 7.5%, 6.5% over RF, MLP, and SVM respectively.
Max-margin models results in better performance in classifying vulnerable code in general. ReVeal with the representation learner performs statistically and significantly better than SVM in both ReVeal dataset and FFMPeg+Qemu (with and ). This is likely because SVM is a shallower than a representation learning model that propagates losses across several perceptron layers.
Result: The performance of DLVP approaches can be significantly improved using the ReVeal pipeline. The use of GGNN based feature embedding along with SMOTE and representation learning remedies data-duplication, data imbalance, and lack of separability. ReVeal produces improvements of up to 33.57% in precision and 128.38% in recall over state-of-the-art methods.
The usefulness of a source code vulnerability detection tool depends on its use case scenario. Ideally, in a real-world scenario, developers would deploy a trained vulnerability prediction model to identify vulnerable functions from a codebase. In simple terms, given all the functions in the code base, developers would want to locate the vulnerable function. As discussed in Section 2, evaluating such a scenario is paramount in understanding the usefulness of an approach. Existing approaches show very little to no evaluation of how respective approaches perform in such a real-world scenario. For instance, Devign showed a simulated real imbalanced evaluation (we refer to Table 3 in Devign’s original paper). However, for that simulation, they randomly sampled their version of the test data to contain 10% vulnerable example. Their reported results also show the drop in performance in real world imbalanced settings. Nevertheless, we hypothesize that their evaluation method does not truly reflect how a method will perform in the real world since the imbalance in their dataset is artificial. In this study, we propose a method to simulate such an evaluation scenario. Therefore, we hope such evaluation settings help drive new research in vulnerability detection.
2 Vulnerability Data and Tangled commits
Tangled commits have long been studied in software engineering and a major setback for software evolution history driven research . Developers often combine more than one unrelated or weakly related changes in code in one commit causing such a commit to be entanglement of more than one changes. Our collected ReVeal data is also subject to such a threat of containing tangled code changes. Thus, we investigate the characteristics of number of function changes in vulnerability-fix patches. Figure 14 shows a histogram of number of functions that are changed per Vulnerability-fix patches. Most of the patches change very small number of functions. 80% of the changes account for 12 or fewer number of function changes.
To validate that the empirical finding in the paper are not biased by the tangled commits, we created an alternate version of ReVeal data, where we removed any patch that changes more than one function from consideration. In that version of ReVeal data, we find that ReVeal achieves 26.33% f1 score. In contrast, if we do not use representation learning, ReVeal’s f1 score drops to 22.95%. If we do not use the data balancing, ReVeal’s performance drops to 13.13%. When we remove GGNN from ReVeal’s pipeline, f1 score drops to 22.82%. These results corroborates the importance of GGNN, data balancing and representation learning in ReVeal’s pipeline irrespective of existence of tangled code changes.
There have been a wide array of ML-based vulnerability prediction research . Yamaguchi et al. applied anomaly detection techniques on embeddings produced from static tainting to discover missing conditions such as input validation. Perl et al. used commit messages to detect the vulnerabilities of a program. These work leveraged Support Vector Machine (SVM) for VP. Li et al. used multi-class SVM to detect different class of vulnerabilities. Recently, DLVP has been subject to much research in both static- and dynamic -analysis scenario. However, static settings are more popular using code slices , trees , graphs etc. Although most prominent approaches use token based representation of code, recent graph based modeling showed success in vulnerability prediction .
Code Property Graph (CPG), introduced by Yamaguchi et al. , models the combined semantic and syntactic information of a program. The CPG is a joint data structure that leverages the information from abstract syntax trees, control flow graphs and program dependency graphs. CPG has shown to be robust in reasoning about vulnerabilities . Thus, ReVeal uses CPG to extract graph based features. In addition, ReVeal reduces data imbalance bias (through re-sampling in feature space) and learns to maximize separation between vulnerable and non-vulnerable examples.
There are several choices of techniques for reducing data imbalance , all of which use different strategies for balancing any imbalanced datasets. We choose SMOTE in ReVeal’s pipeline as it has shown to be successful in other software engineering related tasks .
In this paper, we systematically study different aspects of Deep Learning based Vulnerability Detection to effectively find real world vulnerabilities. We empirically show different shortcomings of existing datasets and models that potentially limits the usability of those techniques in practice. Our investigation found that existing datasets are too simple to represent real world vulnerabilities and existing modeling techniques do not completely address code semantics and data imbalance in vulnerability detection. Following these empirical findings, we propose a framework for collecting real world vulnerability dataset. We propose ReVeal as a configurable vulnerability prediction tool that addresses the concerns we discovered in existing systems and demonstrate its potential towards a better vulnerability prediction tool.
We would like to thank Yufan Zhuang for initial help in data collection. We also thank Dr. Suman Jana for their extensive feedback on this paper.