Detecting Adversarial Samples from Artifacts

Reuben Feinman, Ryan R. Curtin, Saurabh Shintre, Andrew B. Gardner

Introduction

Deep neural networks (DNNs) are machine learning techniques that impose a hierarchical architecture consisting of multiple layers of nonlinear processing units. In practice, DNNs achieve state-of-the-art performance for a variety of generative and discriminative learning tasks from domains including image processing, speech recognition, drug discovery and genomics (LeCun et al., 2015).

Although DNNs are known to be robust to noisy inputs (Fawzi et al., 2016), they have been shown to be vulnerable to specially-crafted adversarial samples (Szegedy et al., 2014; Goodfellow et al., 2015). These samples are constructed by taking a normal sample and perturbing it, either at once or iteratively, in a direction that maximizes the chance of misclassification. Figure 1 shows some examples of adversarial MNIST images alongside noisy images of equivalent perturbation size. Adversarial attacks which require only small perturbations to the original inputs can induce high-efficacy DNNs to misclassify at a high rate. Some adversarial samples can also induce a DNN to output a specific target class (Papernot et al., 2016b). The vulnerability of DNNs to such adversarial attacks highlights important security and performance implications for these models (Papernot et al., 2016b). Consequently, significant effort is ongoing to understand and explain adversarial samples and to design defenses against them (Szegedy et al., 2014; Goodfellow et al., 2015; Papernot et al., 2016c; Tanay & Griffin, 2016; Metzen et al., 2017).

Using the intuition that adversarial samples lie off the true data manifold, we devise two novel features that can be used to detect adversarial samples:

Density estimates, calculated with the training set in the feature space of the last hidden layer. These are meant to detect points that lie far from the data manifold.

Bayesian uncertainty estimates, available in dropout neural networks. These are meant to detect when points lie in low-confidence regions of the input space, and can detect adversarial samples in situations where density estimates cannot.

When both of these features are used as inputs to a simple logistic regression model, we observe effective detection of adversarial samples, achieving an ROC-AUC of $92.6\%$ on the MNIST dataset with both noisy and normal samples as the negative class. In Section 2 we provide the relevant background information for our approach, and in Section 3 we briefly review a few state-of-the-art adversarial attacks. Then, we introduce the intuition for our approach in Section 4, with a discussion of manifolds and Bayesian uncertainty. This leads us to our results and conclusions in Sections 5 and 6.

Background

While neural networks are known to be robust to random noise (Fawzi et al., 2016), they have been shown to be vulnerable to adversarially-crafted perturbations (Szegedy et al., 2014; Goodfellow et al., 2015; Papernot et al., 2016b). Specifically, an adversary can use information about the model to craft small perturbations that fool the network into misclassifying their inputs. In the context of object classification, these perturbations are often imperceptible to the human eye, yet they can force the model to misclassify with high model confidence.

A number of works have attempted to explain the vulnerability of DNNs to adversarial samples. Szegedy et al. (2014) offered a simple preliminary explanation for the phenomenon, arguing that low-probability adversarial “pockets” are densely distributed in input space. As a result, they argued, every point in image space is close to a vast number of adversarial points and can be easily manipulated to achieve a desired model outcome. Goodfellow et al. (2015) argued that it is a result of the linear nature of deep classifiers. Although this explanation has been the most well-accepted in the field, it was recently weakened by counterexamples (Tanay & Griffin, 2016). Tanay & Griffin (2016) introduced the ‘boundary tilting’ perspective, suggesting instead that adversarial samples lie in regions where the classification boundary is close to the manifold of training data.

Research in adversarial attack defense generally falls within two categories: first, methods for improving the robustness of classifiers to current attacks, and second, methods for detecting adversarial samples in the wild. Goodfellow et al. (2015) proposed augmenting the training loss function with an additional adversarial term to improve the robustness of these models to a specific adversarial attack. Defensive distillation (Papernot et al., 2016c) is another recently-introduced technique which involves training a DNN with the softmax outputs of another neural network that was trained on the training data, and can be seen as a way of preventing the network from fitting too tightly to the data. Defensive distillation is effective against the attack of Papernot et al. (2016b). However, Carlini & Wagner (2016) showed that defensive distillation is easily broken with a modified attack.

On the detection of adversarial samples, Metzen et al. (2017) proposed augmenting a DNN with an additional “detector” subnetwork, trained on normal and adversarial samples. Although the authors show compelling performance results on a number of state-of-the-art adversarial attacks, one major drawback is that the detector subnetwork must be trained on generated adversarial samples. This implicitly trains the detector on a subset of all possible adversarial attacks; we do not know how comprehensive this subset is, and future attack modifications may be able to surmount the system. The robustness of this technique to random noise is not currently known.

Adversarial Attacks

Fast Gradient Sign Method (FGSM): Goodfellow et al. (2015) introduced the Fast Gradient Sign Method for crafting adversarial perturbations using the derivative of the model’s loss function with respect to the input feature vector. Given a base input, the approach is to perturb each feature in the direction of the gradient by magnitude $\epsilon$ , where $\epsilon$ is a parameter that determines perturbation size. For a model with loss $J(\Theta,x,y)$ , where $\Theta$ represents the model parameters, $x$ is the model input, and $y$ is the label of $x$ , the adversarial sample is generated as

With small $\epsilon$ , it is possible to fool DNNs trained for the MNIST and CIFAR-10 classification tasks with high success rate (Goodfellow et al., 2015).

Basic Iterative Method (BIM): Kurakin et al. (2017) proposed an iterative version of FGSM called the Basic Iterative Method. This is a straightforward extension; instead of merely applying adversarial noise $\eta$ once with one parameter $\epsilon$ , apply it many times iteratively with small $\epsilon$ . This gives a recursive formula:

Here, $\operatorname{clip}_{x,\epsilon}(\cdot)$ represents a clipping of the values of the adversarial sample such that they are within an $\epsilon$ -neighborhood of the original sample $x$ . This approach is convenient because it allows extra control over the attack. For instance, one can control how far past the classification boundary a sample is pushed: one can terminate the loop on the iteration when $x^{*}_{i}$ is first misclassified, or add additional noise beyond that point.

The basic iterative method was shown to be typically more effective than the FGSM attack on ImageNet images (Kurakin et al., 2017).

Jacobian-based Saliency Map Attack (JSMA): Papernot et al. (2016b) proposed a simple iterative method for targeted misclassification. By exploiting the forward derivative of a DNN, one can find an adversarial perturbation that will force the model to misclassify into a specific target class. For an input $x$ and a neural network $F$ , the output for class $j$ is denoted $F_{j}(x)$ . To achieve a target class $t$ , $F_{t}(X)$ must be increased while the probabilities $F_{j}(X)$ of all other classes $j\neq t$ decrease, until $t=\operatorname{arg\,max}_{j}F_{j}(X)$ . This is accomplished by exploiting the adversarial saliency map, which is defined as

for an input feature $i$ . Starting with a normal sample $x$ , we locate the pair of features $\{i,j\}$ that maximize $S(X,t)[i]+S(X,t)[j]$ , and perturb each feature by a constant offset $\epsilon$ . This process is repeated iteratively until the target misclassification is achieved. This method can effectively produce MNIST samples that are correctly classified by human subjects but misclassified into a specific target class by a DNN with high success rate.

Carlini & Wagner (C&W): Carlini & Wagner (2016) recently introduced a technique that is able to overcome defensive distillation. In fact, their technique encompasses a range of attacks, all cast through the same optimization framework. This results in three powerful attacks, each for a different distance metric: an $L_{2}$ attack, an $L_{0}$ attack, and an $L_{\infty}$ attack. For the $L_{0}$ attack, which we will consider in this paper, the perturbation $\delta$ is defined in terms of an auxiliary variable $\omega$ as

Then, to find $\delta^{*}$ (an ‘unrestricted perturbation’), we optimize over $\omega$ :

where $f(\cdot)$ is an objective function based on the hinge loss:

Here, $Z(x)_{i}$ is the pre-softmax output for class $i$ , $t$ is the target class, and $\kappa$ is a parameter that controls the confidence with which the misclassification occurs.

Finally, to produce the adversarial sample $x^{*}=x+\delta$ , we convert the unrestricted perturbation $\delta^{*}$ to a restricted perturbation $\delta$ , in order to reduce the number of changed pixels. By calculating the gradient $\nabla f(x+\delta^{*})$ , we may identify those pixels $\delta^{*}_{i}$ with little importance (small gradient values) and take $\delta_{i}=0$ ; otherwise, for larger gradient values we take $\delta_{i}=\delta^{*}_{i}$ . This allows an effective attack with few modified pixels, thus helping keep the norm of $\delta$ low.

These three attacks were shown to be particularly effective in comparison to other attacks against networks trained with defensive distillation, achieving adversarial sample generation success rates of 100% where other techniques were not able to top 1%.

Artifacts of Adversarial Samples

Each of these adversarial sample generation algorithms are able to change the predicted label of a point without changing the underlying true label: humans will still correctly classify an adversarial sample, but models will not. This can be understood from the perspective of the manifold of training data. Many high-dimensional datasets, such as images, are believed to lie on a low-dimensional manifold (Lee & Verleysen, 2007). Gardner et al. (2015) recently showed that by carefully traversing the data manifold, one can change the underlying true label of an image. The intuition is that adversarial perturbations—which do not constitute meaningful changes to the input—must push samples off of the data manifold. Tanay & Griffin (2016) base their investigation of adversarial samples on the assumption that adversarial samples lie near class boundaries that are close to the edge of a data submanifold. Similarly, Goodfellow et al. (2015) demonstrate that DNNs perform correctly only near the small manifold of training data. Therefore, we base our work here on the assumption that adversarial samples do not lie on the data manifold.

If we accept that adversarial samples are points that would not arise naturally, then we can assume that a technique to generate adversarial samples will, from a source point $x$ with class $c_{x}$ , typically generate an adversarial sample $x^{*}$ that does not lie on the manifold and is classified incorrectly as $c_{x^{*}}$ . If $x^{*}$ lies off of the data manifold, we may split into three possible situations:

$x^{*}$ is far away from the submanifold of $c_{x^{*}}$ .

$x^{*}$ is near the submanifold $c_{x^{*}}$ but not on it, and $x^{*}$ is far from the classification boundary separating classes $c_{x}$ and $c_{x^{*}}$ .

$x^{*}$ is near the submanifold $c_{x^{*}}$ but not on it, and $x^{*}$ is near the classification boundary separating classes $c_{x}$ and $c_{x^{*}}$ .

Figures 2(a) through 2(c) show simplified example illustrations for each of these three situations in a two-dimensional binary classification setting.

If we have an estimate of what the submanifold corresponding to data with class $c_{x^{*}}$ is, then we can determine whether $x^{*}$ falls near this submanifold after observing the prediction $c_{x^{*}}$ . Following the intuition of Gardner et al. (2015) and hypotheses of Bengio et al. (2013), the deeper layers of a DNN provide more linear and ‘unwrapped’ manifolds to work with than input space; therefore, we may use this idea to model the submanifolds of each class by performing kernel density estimation in the feature space of the last hidden layer.

The standard technique of kernel density estimation can, given the point $x$ and the set $X_{t}$ of training points with label $t$ , provide a density estimate $\hat{f}(x)$ that can be used as a measure of how far $x$ is from the submanifold for $t$ . Specifically,

where $k(\cdot,\cdot)$ is the kernel function, often chosen as a Gaussian with bandwidth $\sigma$ :

The bandwidth may typically be chosen as a value that maximizes the log-likelihood of the training data (Jones et al., 1996). A value too small will give rise to a ‘spiky’ density estimate with too many gaps (see Figure 3), but a value too large will give rise to an overly-smooth density estimate (see Figure 4). This also implies that the estimate is improved as the training set size $|X_{t}|$ increases, since we are able to use smaller bandwidths without the estimate becoming too ‘spiky.’

For the manifold estimate, we operate in the space of the last hidden layer. This layer provides a space of reasonable dimensionality in which we expect the manifold of our data to be simplified. If $\phi(x)$ is the last hidden layer activation vector for point $x$ , then our density estimate for a point $x$ with predicted class $t$ is defined as

where $X_{t}$ is the set of training points of class $t$ , and $\sigma$ is the tuned bandwidth.

To validate our intuition about the utility of this density estimate, we perform a toy experiment using the BIM attack with a convnet trained on MNIST data. In Figure 5, we plot the density estimate of the source class and the final predicted class for each iteration of BIM. One can see that the adversarial sample moves away from a high density estimate region for the correct class, and towards a high density estimate region for the incorrect class. This matches our intuition: we expect the adversarial sample to leave the correct class manifold and move towards (but not onto) the incorrect class manifold.

While a density estimation approach can easily detect an adversarial point that is far from the $c_{x^{*}}$ submanifold, this strategy may not work well when $x^{*}$ is very near the $c_{x^{*}}$ submanifold. Therefore, we must investigate alternative approaches for those cases.

2 Bayesian Neural Network Uncertainty

Beyond distance-based metrics, another powerful tool to identify low-confidence regions of the input space is the uncertainty output of Bayesian models, e.g., the Gaussian process (Rasmussen & Williams, 2005). Gaussian processes assume a Gaussian prior over the set of all functions, $\mathcal{F}$ , that can be used to map the input space to the output space. As observations ( $x,y$ ) are made, only those functions $f\in\mathcal{F}$ are retained for which $f(x)=y$ . For a new test point $x^{*}$ , the prediction for each function $f$ , $y^{*}=f(x^{*})$ , is computed and the expected value over $y^{*}$ is the used as the final prediction. Simultaneously, the variance of the output values $y^{*}$ is also used as an indicator of the model’s uncertainty. Figure 6 illustrates how in simple cases, Bayesian uncertainty can provide additional information about model confidence not conveyed by distance metrics like a density estimate.

Recently, Gal & Ghahramani (2015) proved that DNNs trained with dropout are equivalent to an approximation of the deep Gaussian process. As result, we can extract Bayesian uncertainty estimates from a wide range of DNN architectures without modification. Dropout, first introduced as a method to reduce overfitting when training DNNs (Srivastava et al., 2014), works by dropping hidden nodes from the network randomly with some probability $p$ during the training phase. During the testing phase, all nodes are kept, but the weights are scaled by $p$ . Gal & Ghahramani (2015) showed that the dropout training objective converges to a minimization of the Kullback-Leibler divergence between an aproximate distribution and the posterior of a deep Gaussian process marginalized over its covariance function parameters. After iterating to convergence, uncertainty estimates can be extracted from dropout DNNs in the following manner.

We sample $T$ times from our distribution of network configurations, typically i.i.d. $Bernoulli(o_{l})$ for each layer $l$ , and obtain parameters $\{W^{1},\cdots,W^{T}\}$ . Here $W^{t}=\{W_{1}^{t},...,W_{L}^{t}\}$ are the $L$ weight matrices sampled at iteration $t$ . Thereafter, we can evaluate a Monte Carlo estimate of the output, i.e. the first moment, as:

Similarly, we can evaluate the second moment with Monte Carlo estimation, leading us to an estimate of model variance

Relying on the intuition that Bayesian uncertainty can be useful to identify adversarial samples, we make use of dropout variance values in this paper, setting $T=50$ . As dropout on its own is known to be a powerful regularization technique (Srivastava et al., 2014), we use neural network models without weight decay, leaving $\tau^{-1}=0$ . Thus, for a test sample $x^{*}$ and stochastic predictions $\{\hat{y}^{*}_{1},...,\hat{y}^{*}_{T}\}$ , our uncertainty estimate $U(x^{*})$ can be computed as

Because we use DNNs with one output node per class, we look at the mean of the uncertainty vector as a scalar representation of model uncertainty.

To demonstrate the efficacy of our uncertainty estimates in detecting adversarial samples, we trained the LeNet convnet (LeCun et al., 1989) with a dropout rate of 0.5 applied after the last pooling layer and after the inner-product layer for MNIST classification. Figures 7(a) and 7(b) compare the distribution of Bayesian uncertainty for adversarial samples to those of normal samples and of noisy samples with equivalent perturbation size; both the BIM and JSMA cases are shown. Clearly, uncertainty distributions for adversarial samples are statistically distinct from normal and noisy samples, verifying our intuition.

Experiments

In order to evaluate the proficiency of our density and uncertainty features for adversarial detection, we test these features on MNIST, CIFAR10, and SVHN. All pixels are scaled to floats in the range of . Our models achieve near state-of-the-art accuracy on the normal holdout sets for each dataset and are described in Section 5.1. In order to properly evaluate our method, we only perturb those test samples which were correctly classified by our models in their original states. An adversary would have no reason to perturb samples that are already misclassified.

We implement each of the four attacks (FGSM, BIM, JSMA, and C&W) described in Section 3 in TensorFlow, using the cleverhans library for FGSM and JSMA (Papernot et al., 2016a). For the BIM attack, we implement two versions: BIM-A, which stops iterating as soon as miclassification is achieved (‘at the decision boundary’), and BIM-B, which runs for a fixed number of iterations that is well beyond the average misclassification point (‘beyond the decision boundary’). For each attack type, we also craft an equal number of noisy test samples as a benchmark. For FGSM and BIM, these are crafted by adding Guassian noise to each pixel with a scale set so that the mean $L_{2}$ -norm of the perturbation matches that of the adversarial samples. For JSMA and C&W, which flip pixels to their min or max values, these are crafted by observing the number of pixels that were altered in the adversarial case and flipping an equal number of pixels randomly. Details about model accuracies on the adversarial sets and average perturbation sizes are provided in Table 1. Some examples of normal, noisy and adversarial samples are displayed in Figure 8.

Here, we briefly describe the models used for each dataset and their accuracies on normal and noisy test samples.

MNIST: We use the LeNet (LeCun et al., 1989) convnet architecture with a dropout rate of 0.5 after last pooling layer and after the inner-product layer. This model reports $98.7\%$ accuracy on normal samples and $97.2\%$ accuracy on noisy samples.

SVHN: We use the LeNet architecture with an extra intermediate inner-product layer to assist with higher dimensionality. We used a dropout rate of 0.5 after the last pooling layer and after each inner-product layer. This model reports $92.2\%$ accuracy on normal samples and $79.2\%$ accuracy on noisy samples.

CIFAR-10: We use a deep 12-layer convnet with a dropout rate of 0.5 applied after the last pooling layer and after each of the 2 inner-product layers. This model reports $82.6\%$ accuracy on normal samples and $79.2\%$ accuracy on noisy samples.

Training was done using the Adadelta optimizer with cross-entropy loss and a batch size of 256.

2 Feature Values

When we generate adversarial samples, the uncertainty typically grows larger than the original sample, and the density estimate typically grows smaller. This makes sense: the adversarial sample is likely to be in a region of higher uncertainty, lower density estimates, or both. In addition, the change is far more pronounced than if we simply perturb the sample with random noise.

In order to demonstrate this phenomenon, we generate adversarial samples and randomly perturbed (noisy) samples from the test data points for MNIST and CIFAR-10. For each attack, we calculate the percentage of points with higher uncertainty values than the corresponding original unperturbed samples, and the percentage of points with lower density estimates than the corresponding original unperturbed samples. The results are shown in Table 2.

We can see clearly that uncertainty is generally increased when adversarial samples are generated, and density estimates are generally decreased. These results suggest that our two features are reliable indicators of adversarial samples. Therefore, we next move on to the task of detecting adversarial samples.

3 Adversarial Sample Classifiers

To evaluate our adversarial detection method, we aggregate all adversarial samples of different types into a unified set, and do the same with the noisy and normal samples. For each dataset and attack, we have built three binary classifiers:

uncertainty: this classifier simply thresholds on the uncertainty value of a sample.

density: this classifier simply thresholds on the negative log kernel density of a sample.

combined: this is a simple logistic regression classifier with two features as input: the uncertainty and the density estimate.

These detection models are used to distinguish adversarial samples–the positive class–from normal and noisy samples, which jointly constitue the negative class. The logistic regression model is trained by generating adversarial samples for every correctly-classified training point using each of the four adversarial attacks, and then using the uncertainty values and density estimates for the original and adversarial samples as a labeled training set. The two features are z-scored before training.

Because these are all threshold-based classifiers, we may generate an ROC for each method. Figure 9 shows ROCs for each classifier with a couple of datasets. We see that the performance of the combined classifier is better than either the uncertainty or density classifiers, demonstrating that each feature is able to detect different qualities of adversarial features. Further, the ROCs demonstrate that the uncertainty and density estimates are effective indicators that can be used to detect if a sample is adversarial. Figure 10 shows the ROCs for each individual attack; the combined classifier is able to most easily handle the JSMA, BIM-A and C&W attacks.

In Table 3, the ROC-AUC measures are shown, for each of the three classifiers, on each dataset, for each attack. The performance is quite good, suggesting that the combined classifier is able to effectively detect adversarial samples from a wide range of attacks on a wide range of datasets.

Conclusions

We have shown that adversarial samples crafted to fool DNNs can be effectively detected with two new features: kernel density estimates in the subspace of the last hidden layer, and Bayesian neural network uncertainty estimates. These two features handle complementary situations, and can be combined as an effective defense mechanism against adversarial samples. Our results report that we can, in some cases, obtain an ROC-AUC for an adversarial sample detector of up to 90% or more when both normal and noisy samples constitute the negative class. The performance is good on a wide variety of attacks and a range of image datasets.

In our work here, we have only considered convolutional neural networks. However, we believe that this approach can be extended to other neural network architectures as well. Gal (2015) showed that the idea of dropout as a Bayesian approximation could be applied to RNNs as well, allowing for robust uncertainty estimation. In future work, we aim to apply our features to RNNs and other network architectures.

Acknowledgements

We thank Nikolaos Vasiloglou and Nicolas Papernot for useful discussions that helped to shape the direction of this paper. We also thank Symantec Corporation for providing us with the resources used to conduct this research.