On Detecting Adversarial Perturbations

Jan Hendrik Metzen, Tim Genewein, Volker Fischer, Bastian Bischoff

Introduction

In the last years, machine learning and in particular deep learning methods have led to impressive performance on various challenging perceptual tasks, such as image classification (Russakovsky et al., 2015; He et al., 2016) and speech recognition (Amodei et al., 2016). Despite these advances, perceptual systems of humans and machines still differ significantly. As Szegedy et al. (2014) have shown, small but carefully directed perturbations of images can lead to incorrect classification with high confidence on artificial systems. Yet, for humans these perturbations are often visually imperceptible and do not stir any doubt about the correct classification. In fact, so called adversarial examples are crucially characterized by requiring minimal perturbations that are quasi-imperceptible to a human observer. For computer vision tasks, multiple techniques to create such adversarial examples have been developed recently. Perhaps most strikingly, adversarial examples have been shown to transfer between different network architectures, and networks trained on disjoint subsets of data (Szegedy et al., 2014). Adversarial examples have also been shown to translate to the real world (Kurakin et al., 2016), e.g., adversarial images can remain adversarial even after being printed and recaptured with a cell phone camera. Moreover, Papernot et al. (2016a) have shown that a potential attacker can construct adversarial examples for a network of unknown architecture by training an auxiliary network on similar data and exploiting the transferability of adversarial inputs.

The vulnerability to adversarial inputs can be problematic and even prevent the application of deep learning methods in safety- and security-critical applications. The problem is particularly severe when human safety is involved, for example in the case of perceptual tasks for autonomous driving. Methods to increase robustness against adversarial attacks have been proposed and range from augmenting the training data (Goodfellow et al., 2015) over applying JPEG compression to the input (Dziugaite et al., 2016) to distilling a hardened network from the original classifier network (Papernot et al., 2016b). However, for some recently published attacks (Carlini & Wagner, 2016), no effective counter-measures are known yet.

In this paper, we propose to train a binary detector network, which obtains inputs from intermediate feature representations of a classifier, to discriminate between samples from the original data set and adversarial examples. Being able to detect adversarial perturbations might help in safety- and security-critical semi-autonomous systems as it would allow disabling autonomous operation and requesting human intervention (along with a warning that someone might be manipulating the system). However, it might intuitively seem very difficult to train such a detector since adversarial inputs are generated by tiny, sometimes visually imperceptible, perturbations of genuine examples. Despite this intuition, our results on CIFAR10 and a 10-class subset of ImageNet show that a detector network that achieves high accuracy in detection of adversarial inputs can be trained successfully. Moreover, while we train a detector network to detect perturbations of a specific adversary, our experiments show that detectors generalize to similar and weaker adversaries. An obvious attack against our approach would be to develop adversaries that take into account both networks, the classification and the adversarial detection network. We present one such adversary and show that we can harden the detector against such an adversary using a novel training procedure.

Background

Since their discovery by Szegedy et al. (2014), several methods to generate adversarial examples have been proposed. Most of these methods generate adversarial examples by optimizing an image w.r.t. the linearized classification cost function of the classification network by maximizing the probability for all but the true class or minimizing the probability of the true class (e.g., (Goodfellow et al., 2015), (Kurakin et al., 2016)). The method introduced by Moosavi-Dezfooli et al. (2016b) estimates a linearization of decision boundaries between classes in image space and iteratively shifts an image towards the closest of these linearized boundaries. For more details about these methods, please refer to Section 3.1.

Several approaches exist to increase a model’s robustness against adversarial attacks. Goodfellow et al. (2015) propose to augment the training set with adversarial examples. At training time, they minimize the loss for real and adversarial examples, while adversarial examples are chosen to fool the current version of the model. In contrast, Zheng et al. (2016) propose to append a stability term to the objective function, which forces the model to have similar outputs for samples of the training set and their perturbed versions. This differs from data augmentation since it encourages smoothness of the model output between original and distorted samples instead of minimizing the original objective on the adversarial examples directly. Another defense-measure against certain adversarial attack methods is defensive distillation (Papernot et al., 2016b), a special form of network distillation, to train a network that becomes almost completely resistant against attacks such as the L-BFGS attack (Szegedy et al., 2014) and the fast gradient sign attack (Goodfellow et al., 2015). However, Carlini & Wagner (2016) recently introduced a novel method for constructing adversarial examples that manages to (very successfully) break many defense methods, including defensive distillation. In fact, the authors find that previous attacks were very fragile and could easily fail to find adversarial examples even when they existed. An experiment on the cross-model adversarial portability (Rozsa et al., 2016) has shown that models with higher accuracies tend to be more robust against adversarial examples, while examples that fool them are more portable to less accurate models.

Even though the existence of adversarial examples has been demonstrated several times on many different classification tasks, the question of why adversarial examples exist in the first place and whether they are sufficiently regular to be detectable, which is studied in this paper, has remained open. Szegedy et al. (2014) speculated that the data-manifold is filled with “pockets” of adversarial inputs that occur with very low probability and thus are almost never observed in the test set. Yet, these pockets are dense and so an adversarial example is found virtually near every test case. The authors further speculated that the high non-linearity of deep networks might be the cause for the existence of these low-probability pockets. Later, Goodfellow et al. (2015) introduced the linear explanation: Given an input and some adversarial noise $\eta$ (subject to: $||\eta||_{\infty}<\epsilon$ ), the dot product between a weight vector $w$ and an adversarial input $x^{\text{adv}}=x+\eta$ is given by $w^{\text{T}}x^{\text{adv}}=w^{\text{T}}x+w^{\text{T}}\eta$ . The adversarial noise $\eta$ causes a neuron’s activation to grow by $w^{\text{T}}\eta$ . The max-norm constraint on $\eta$ does not allow for large values in one dimension, but if $x$ and thus $\eta$ are high-dimensional, many small changes in each dimension of $\eta$ can accumulate to a large change in a neuron’s activation. The conclusion was that “linear behavior in high-dimensional spaces is sufficient to cause adversarial examples”.

Tanay & Griffin (2016) challenged the linear-explanation hypothesis by constructing classes of images that do not suffer from adversarial examples under a linear classifier. They also point out that if the change in activation $w^{\text{T}}\eta$ grows linearly with the dimensionality of the problem, so does the activation $w^{\text{T}}x$ . Instead of the linear explanation, Tanay et al. provide a different explanation for the existence of adversarial examples, including a strict condition for the non-existence of adversarial inputs, a novel measure for the strength of adversarial examples and a taxonomy of different classes of adversarial inputs. Their main argument is that if a learned class boundary lies close to the data manifold, but the boundary is (slightly) tilted with respect to the manifoldIt is easier to imagine a linear decision-boundary - for neural networks this argument must be translated into a non-linear equivalent of boundary tilting., then adversarial examples can be found by perturbing points from the data manifold towards the classification boundary until the perturbed input crosses the boundary. If the boundary is only slightly tilted, the distance required by the perturbation to cross the decision-boundary is very small, leading to strong adversarial examples that are visually almost imperceptibly close to the data. Tanay et. al further argue that such situations are particularly likely to occur along directions of low variance in the data and thus speculate that adversarial examples can be considered an effect of an over-fitting phenomenon that could be alleviated by proper regularization, though it is completely unclear how to regularize neural networks accordingly.

Recently, Moosavi-Dezfooli et al. (2016a) demonstrated that there even exist universal, image-agnostic perturbations which, when added to all data points, fool deep nets on a large fraction of ImageNet validation images. Moreover, they showed that these universal perturbations are to a certain extent also transferable between different network architectures. While this observation raises interesting questions about geometric properties and correlations of different parts of the decision boundary of deep nets, potential regularities in adversarial perturbations may also help detecting them. However, the existence of universal perturbations does not necessarily imply that the adversarial examples generated by data-dependent adversaries will be regular. Actually, Moosavi-Dezfooli et al. (2016a) show that universal perturbations are not unique and that there even exist many different universal perturbations which have little in common. This paper studies if data-dependent adversarial perturbations can nevertheless be detected reliably and answers this question affirmatively.

Methods

In this section, we introduce the adversarial attacks used in the experiments, propose an approach for detecting adversarial perturbations, introduce a novel adversary that aims at fooling both the classification network and the detector, and propose a training method for the detector that aims at counteracting this novel adversary.

Here, $\varepsilon$ is a hyper-parameter governing the distance between adversarial and original image. As suggested in Kurakin et al. (2016) we also refer to this as the fast method due to its non-iterative and hence fast computation.

As an extension, Kurakin et al. (2016) introduced an iterative version of the fast method, by applying it several times with a smaller step size $\alpha$ and clipping all pixels after each iteration to ensure results stay in the $\varepsilon$ -neighborhood of the original image:

2 Detecting Adversarial Examples

We augment classification networks by (relatively small) subnetworks, which branch off the main network at some layer and produce an output $p_{adv}\in$ which is interpreted as the probability of the input being adversarial. We call this subnetwork “adversary detection network” (or “detector” for short) and train it to classify network inputs into being regular examples or examples generated by a specific adversary. For this, we first train the classification networks on the regular (non-adversarial) dataset as usual and subsequently generate adversarial examples for each data point of the train set using one of the methods discussed in Section 3.1. We thus obtain a balanced, binary classification dataset of twice the size of the original dataset consisting of the original data (label zero) and the corresponding adversarial examples (label one). Thereupon, we freeze the weights of the classification network and train the detector such that it minimizes the cross-entropy of $p_{adv}$ and the labels. The details of the adversary detection subnetwork and how it is attached to the classification network are specific for datasets and classification networks. Thus, evaluation and discussion of various design choices of the detector network are provided in the respective section of the experimental results.

3 Dynamic Adversaries and Detectors

Note that we found a smaller $\alpha$ to be essential for this method to work; more specifically, we use $\alpha=0.25$ . Since such an adversary can adapt to the detector, we call it a dynamic adversary. To counteract dynamic adversaries, we propose dynamic adversary training, a method for hardening detectors against dynamic adversaries. Based on the approach proposed by Goodfellow et al. (2015), instead of precomputing a dataset of adversarial examples, we compute the adversarial examples on-the-fly for each mini-batch and let the adversary modify each data point with probability 0.5. Note that a dynamic adversary will modify a data point differently every time it encounters the data point since it depends on the detector’s gradient and the detector changes over time. We extend this approach to dynamic adversaries by employing a dynamic adversary, whose parameter $\sigma$ is selected uniform randomly from $ $, for generating the adversarial data points during training. By training the detector in this way, we implicitly train it to resist dynamic adversaries for various values of$ \sigma $. In principle, this approach bears the risk of oscillation and unlearning for$ \sigma>0$ since both, the detector and adversary, adapt to each other (i.e., there is no fixed data distribution). In practice, however, we found this approach to converge stably without requiring careful tuning of hyperparameters.

Experimental Results

In this section, we present results on the detectability of adversarial perturbations on the CIFAR10 dataset (Krizhevsky, 2009), both for static and dynamic adversaries. Moreover, we investigate whether adversarial perturbations are also detectable in higher-resolution images based on a subset of the ImageNet dataset (Russakovsky et al., 2015).

We use a 32-layer Residual Network (He et al., 2016, ResNet) as classifier. The structure of the network is shown in Figure 1. The network has been trained for 100 epochs with stochastic gradient descent and momentum on 45000 data points from the train set. The momentum term was set to $0.9$ and the initial learning rate was set to $0.1$ , reduced to $0.01$ after 41 epochs, and further reduced to $0.001$ after $61$ epochs. After each epoch, the network’s performance on the validation data (the remaining 5000 data points from the train set) was determined. The network with maximal performance on the validation data was used in the subsequent experiments (with all tunable weights being fixed). This network’s accuracy on non-adversarial test data is $91.3$ %. We attach an adversary detection subnetwork (called “detector” below) to the ResNet. The detector is a convolutional neural network using batch normalization (Ioffe & Szegedy, 2015) and rectified linear units. In the experiments, we investigate different positions where the detector can be attached (see also Figure 1).

Figure 2 (right) compares the detectability of different adversaries for detectors attached at different points to the classification network. $\varepsilon$ was chosen minimal under the constraint that the classification accuracy is below 30%. For the “Fast” and “Iterative” adversaries, the attachment position AD(2) works best, i.e., attaching to a middle layer where more abstract features are already extracted but still the full spatial resolution is maintained. For the DeepFool methods, the general pattern is similar except for AD(4), which works best for these adversaries.

1.2 Dynamic Adversaries

In this section, we evaluate the robustness of detector networks to dynamic adversaries (see Section 3.3). For this, we evaluate the detectability of dynamic adversaries for $\sigma\in\{0.0,0.1,\dots,1.0\}$ . We use the same optimizer and detector network as in Section 4.1.1. When evaluating the detectability of dynamic adversaries with $\sigma$ close to $1$ , we need to take into account that the adversary might choose to solely focus on fooling the detector, which is trivially achieved by leaving the input unmodified. Thus, we ignore adversarial examples that do not cause a misclassification in the evaluation of the detector and evaluate the detector’s accuracy on regular data versus the successful adversarial examples. Figure 5 shows the results of a dynamic adversary with $\varepsilon=1$ against a static detector, which was trained to only detect static adversaries, and a dynamic detector, which was explicitly trained to resist dynamic adversaries. As can be seen, the static detector is not robust to dynamic adversaries since for certain values of $\sigma$ , namely $\sigma=0.3$ and $\sigma=0.4$ , the detectability is close to chance level while the predictive performance of the classifier is severely reduced to less than 30% accuracy. A dynamic detector is considerably more robust and achieves a detectability of more than 70% for any choice of $\sigma$ .

2 10-class ImageNet

Discussion

Why can tiny adversarial perturbations be detected that well? Adopting the boundary tilting perspective of Tanay & Griffin (2016), strong adversarial examples occur in situations in which classification boundaries are tilted against the data manifold such that they lie close and nearly parallel to the data manifold. A detector could (potentially) identify adversarial examples by detecting inputs which are slightly off the data manifold’s center in the direction of a nearby class boundary. Thus, the detector can focus on detecting inputs which move away from the data manifold in a certain direction, namely one of the directions to a nearby class boundary (the detector does not have explicit knowledge of class boundaries but it might learn about their direction implicitly from the adversarial training data). However, training a detector which captures these directions in a model with small capacity and generalizes to unseen data requires certain regularities in adversarial perturbations. The results of Moosavi-Dezfooli et al. (2016a) suggest that there may exist regularities in the adversarial perturbations since universal perturbations exist. However, these perturbations are not unique and data-dependent adversaries might potentially choose among many different possible perturbations in a non-regular way, which would be hard to detect. Our positive results on detectability suggest that this is not the case for the tested adversaries. Thus, our results are somewhat complementary to Moosavi-Dezfooli et al. (2016a): while they show that universal, image-agnostic perturbations exist, we show that image-dependent perturbations are sufficiently regular to be detectable. Whether a detector generalizes over different adversaries depends mainly on whether the adversaries choose among many different possible perturbations in a consistent way.

Why is the joint classifier/detector system harder to fool? For a static detector, there might be areas which are adversarial to both classifier and detector; however, this will be a (small) subset of the areas which are adversarial to the classifier alone. Nevertheless, results in Section 4.1.2 show that such a static detector can be fooled along with the classifier. However, a dynamic detector is considerably harder to fool: on the one hand, it might further reduce the number of areas which are both adversarial to classifier and detector. On the other hand, the areas which are adversarial to the detector might become increasingly non-regular and difficult to find by gradient descent-based adversaries.

Conclusion and Outlook

In this paper, we have shown empirically that adversarial examples can be detected surprisingly well using a detector subnetwork attached to the main classification network. While this does not directly allow classifying adversarial examples correctly, it allows mitigating adversarial attacks against machine learning systems by resorting to fallback solutions, e.g., a face recognition might request human intervention when verifying a person’s identity and detecting a potential adversarial attack. Moreover, being able to detect adversarial perturbations may in the future enable a better understanding of adversarial examples by applying network introspection to the detector network. Furthermore, the gradient propagated back through the detector may be used as a source of regularization of the classifier against adversarial examples. We leave this to future work. Additional future work will be developing stronger adversaries that are harder to detect by adding effective randomization which would make selection of adversarial perturbations less regular. Finally, developing methods for training detectors explicitly such that they can detect many different kinds of attacks reliably at the same time would be essential for safety- and security-related applications.

We would like to thank Michael Herman and Michael Pfeiffer for helpful discussions and their feedback on drafts of this article. Moreover, we would like to thank the developers of Theano (The Theano Development Team, 2016), keras (https://keras.io), and seaborn (http://seaborn.pydata.org/).