DeepFool: a simple and accurate method to fool deep neural networks

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Pascal Frossard

Introduction

Deep neural networks are powerful learning models that achieve state-of-the-art pattern recognition performance in many research areas such as bioinformatics , speech , and computer vision . Though deep networks have exhibited very good performance in classification tasks, they have recently been shown to be particularly unstable to adversarial perturbations of the data . In fact, very small and often imperceptible perturbations of the data samples are sufficient to fool state-of-the-art classifiers and result in incorrect classification. (e.g., Figure 1). Formally, for a given classifier, we define an adversarial perturbation as the minimal perturbation $\bm{r}$ that is sufficient to change the estimated label $\hat{k}(\bm{x})$ :

where $\bm{x}$ is an image and $\hat{k}(\bm{x})$ is the estimated label. We call $\Delta(\bm{x};\hat{k})$ the robustness of $\hat{k}$ at point $\bm{x}$ . The robustness of classifier $\hat{k}$ is then defined as

An accurate method for finding the adversarial perturbations is thus necessary to study and compare the robustness of different classifiers to adversarial perturbations. It might be the key to a better understanding of the limits of current architectures and to design methods to increase robustness. Despite the importance of the vulnerability of state-of-the-art classifiers to adversarial instability, no well-founded method has been proposed to compute adversarial perturbations and we fill this gap in this paper.

Our main contributions are the following:

We propose a simple yet accurate method for computing and comparing the robustness of different classifiers to adversarial perturbations.

We perform an extensive experimental comparison, and show that 1) our method computes adversarial perturbations more reliably and efficiently than existing methods 2) augmenting training data with adversarial examples significantly increases the robustness to adversarial perturbations.

We show that using imprecise approaches for the computation of adversarial perturbations could lead to different and sometimes misleading conclusions about the robustness. Hence, our method provides a better understanding of this intriguing phenomenon and of its influence factors.

We now review some of the relevant work. The phenomenon of adversarial instability was first introduced and studied in . The authors estimated adversarial examples by solving penalized optimization problems and presented an analysis showing that the high complexity of neural networks might be a reason explaining the presence of adversarial examples. Unfortunately, the optimization method employed in is time-consuming and therefore does not scale to large datasets. In , the authors showed that convolutional networks are not invariant to some sort of transformations based on the experiments done on Pascal3D+ annotations. Recently, Tsai et al. provided a software to misclassify a given image in a specified class, without necessarily finding the smallest perturbation. Nguyen et al. generated synthetic unrecognizable images, which are classified with high confidence. The authors of also studied a related problem of finding the minimal geometric transformation that fools image classifiers, and provided quantitative measure of the robustness of classifiers to geometric transformations. Closer to our work, the authors of introduced the “fast gradient sign” method, which computes the adversarial perturbations for a given classifier very efficiently. Despite its efficiency, this method provides only a coarse approximation of the optimal perturbation vectors. In fact, it performs a unique gradient step, which often leads to sub-optimal solutions. Then in an attempt to build more robust classifiers to adversarial perturbations, introduced a smoothness penalty in the training procedure that allows to boost the robustness of the classifier. Notably, the method in was applied in order to generate adversarial perturbations. We should finally mention that the phenomenon of adversarial instability also led to theoretical work in that studied the problem of adversarial perturbations on some families of classifiers, and provided upper bounds on the robustness of these classifiers. A deeper understanding of the phenomenon of adversarial instability for more complex classifiers is however needed; the method proposed in this work can be seen as a baseline to efficiently and accurately generate adversarial perturbations in order to better understand this phenomenon.

The rest of paper is organized as follows. In Section 2, we introduce an efficient algorithm to find adversarial perturbations in a binary classifier. The extension to the multiclass problem is provided in Section 3. In Section 4, we propose extensive experiments that confirm the accuracy of our method and outline its benefits in building more robust classifiers.

DeepFool for binary classifiers

In the case where the classifier $f$ is affine, it can easily be seen that the robustness of $f$ at point $\bm{x}_{0}$ , $\Delta(\bm{x}_{0};f)$ From now on, we refer to a classifier either by $f$ or its corresponding discrete mapping $\hat{k}$ . Therefore, $\rho_{\text{adv}}(\hat{k})=\rho_{\text{adv}}(f)$ and $\Delta(\bm{x};\hat{k})=\Delta(\bm{x};f)$ ., is equal to the distance from $\bm{x}_{0}$ to the separating affine hyperplane $\mathscr{F}=\{\bm{x}:\bm{w}^{T}\bm{x}+b=0\}$ (Figure 2). The minimal perturbation to change the classifier’s decision corresponds to the orthogonal projection of $\bm{x}_{0}$ onto $\mathscr{F}$ . It is given by the closed-form formula:

Assuming now that $f$ is a general binary differentiable classifier, we adopt an iterative procedure to estimate the robustness $\Delta(\bm{x}_{0};f)$ . Specifically, at each iteration, $f$ is linearized around the current point $\bm{x}_{i}$ and the minimal perturbation of the linearized classifier is computed as

The perturbation $\bm{r}_{i}$ at iteration $i$ of the algorithm is computed using the closed form solution in Eq. (3), and the next iterate $\bm{x}_{i+1}$ is updated. The algorithm stops when $\bm{x}_{i+1}$ changes sign of the classifier. The DeepFool algorithm for binary classifiers is summarized in Algorithm 1 and a geometric illustration of the method is shown in Figure 3.

In practice, the above algorithm can often converge to a point on the zero level set $\mathscr{F}$ . In order to reach the other side of the classification boundary, the final perturbation vector $\hat{\bm{r}}$ is multiplied by a constant $1+\eta$ , with $\eta\ll 1$ . In our experiments, we have used $\eta=0.02$ .

DeepFool for multiclass classifiers

where $f_{k}(\bm{x})$ is the output of $f(\bm{x})$ that corresponds to the $k^{\text{th}}$ class. Similarly to the binary case, we first present the proposed approach for the linear case and then we generalize it to other classifiers.

Let $f(\bm{x})$ be an affine classifier, i.e., $f(\bm{x})=\mathbf{W}^{\top}\bm{x}+\bm{b}$ for a given $\mathbf{W}$ and $\bm{b}$ . Since the mapping $\hat{k}$ is the outcome of a one-vs-all classification scheme, the minimal perturbation to fool the classifier can be rewritten as follows

where $\bm{w}_{k}$ is the $k^{\text{th}}$ column of $\mathbf{W}$ . Geometrically, the above problem corresponds to the computation of the distance between $\bm{x}_{0}$ and the complement of the convex polyhedron $P$ ,

where $\bm{x}_{0}$ is located inside $P$ . We denote this distance by $\text{{dist}}(\bm{x}_{0},P^{c})$ . The polyhedron $P$ defines the region of the space where $f$ outputs the label $\hat{k}(\bm{x}_{0})$ . This setting is depicted in Figure 4. The solution to the problem in Eq. (6) can be computed in closed form as follows. Define $\hat{l}(\bm{x}_{0})$ to be the closest hyperplane of the boundary of $P$ (e.g. $\hat{l}(\bm{x}_{0})=3$ in Figure 4). Formally, $\hat{l}(\bm{x}_{0})$ can be computed as follows

The minimum perturbation $\bm{r}_{*}(\bm{x}_{0})$ is the vector that projects $\bm{x}_{0}$ on the hyperplane indexed by $\hat{l}(\bm{x}_{0})$ , i.e.,

In other words, we find the closest projection of $\bm{x}_{0}$ on faces of $P$ .

2 General classifier

It should be noted that the optimization strategy of DeepFool is strongly tied to existing optimization techniques. In the binary case, it can be seen as Newton’s iterative algorithm for finding roots of a nonlinear system of equations in the underdetermined case . This algorithm is known as the normal flow method. The convergence analysis of this optimization technique can be found for example in . Our algorithm in the binary case can alternatively be seen as a gradient descent algorithm with an adaptive step size that is automatically chosen at each iteration. The linearization in Algorithm 2 is also similar to a sequential convex programming where the constraints are linearized at each step.

Experimental results

We now test our DeepFool algorithm on deep convolutional neural networks architectures applied to MNIST, CIFAR-10, and ImageNet image classification datasets. We consider the following deep neural network architectures:

MNIST: A two-layer fully connected network, and a two-layer LeNet convoluational neural network architecture . Both networks are trained with SGD with momentum using the MatConvNet package.

CIFAR-10: We trained a three-layer LeNet architecture, as well as a Network In Network (NIN) architecture .

ILSVRC 2012: We used CaffeNet and GoogLeNet pre-trained models.

In order to evaluate the robustness to adversarial perturbations of a classifier $f$ , we compute the average robustness $\hat{\rho}_{\text{adv}}(f)$ , defined by

where $\hat{\bm{r}}(\bm{x})$ is the estimated minimal perturbation obtained using DeepFool, and $\mathscr{D}$ denotes the test setFor ILSVRC2012, we used the validation data..

We compare the proposed DeepFool approach to state-of-the-art techniques to compute adversarial perturbations in and . The method in solves a series of penalized optimization problems to find the minimal perturbation, whereas estimates the minimal perturbation by taking the sign of the gradient

with $J$ the cost used to train the neural network, $\bm{\theta}$ is the model parameters, and $y$ is the label of $\bm{x}$ . The method is called fast gradient sign method. In practice, in the absence of general rules to choose the parameter $\epsilon$ , we chose the smallest $\epsilon$ such that $90\%$ of the data are misclassified after perturbation.Using this method, we observed empirically that one cannot reach $100\%$ misclassification rate on some datasets. In fact, even by increasing $\epsilon$ to be very large, this method can fail in misclassifying all samples.

2 Results

We report in Table 1 the accuracy and average robustness $\hat{\rho}_{\text{adv}}$ of each classifier computed using different methods. We also show the running time required for each method to compute one adversarial sample.

It can be seen that DeepFool estimates smaller perturbations (hence closer to minimal perturbation defined in (1)) than the ones computed using the competitive approaches. For example, the average perturbation obtained using DeepFool is $5$ times lower than the one estimated with . On the ILSVRC2012 challenge dataset, the average perturbation is one order of magnitude smaller compared to the fast gradient method. It should be noted moreover that the proposed approach also yields slightly smaller perturbation vectors than the method in . The proposed approach is hence more accurate in detecting directions that can potentially fool neural networks. As a result, DeepFool can be used as a valuable tool to accurately assess the robustness of classifiers. On the complexity aspect, the proposed approach is substantially faster than the standard method proposed in . In fact, while the approach involves a costly minimization of a series of objective functions, we observed empirically that DeepFool converges in a few iterations (i.e., less than $3$ ) to a perturbation vector that fools the classifier. Hence, the proposed approach reaches a more accurate perturbation vector compared to state-of-the-art methods, while being computationally efficient. This makes it readily suitable to be used as a baseline method to estimate the robustness of very deep neural networks on large-scale datasets. In that context, we provide the first quantitative evaluation of the robustness of state-of-the-art classifiers on the large-scale ImageNet dataset. It can be seen that despite their very good test accuracy, these methods are extremely unstable to adversarial perturbations: a perturbation that is $1000$ smaller in magnitude than the original image is sufficient to fool state-of-the-art deep neural networks.

We illustrate in Figure 1 perturbed images generated by the fast gradient sign and DeepFool. It can be observed that the proposed method generates adversarial perturbations which are hardly perceptible, while the fast gradient sign method outputs a perturbation image with higher norm.

In this section, we fine-tune the networks of Table 1 on adversarial examples to build more robust classifiers for the MNIST and CIFAR-10 tasks. Specifically, for each network, we performed two experiments: (i) Fine-tuning the network on DeepFool’s adversarial examples, (ii) Fine-tuning the network on the fast gradient sign adversarial examples. We fine-tune the networks by performing 5 additional epochs, with a $50\%$ decreased learning rate only on the perturbed training set. For each experiment, the same training data was used through all $5$ extra epochs. For the sake of completeness, we also performed $5$ extra epochs on the original data. The evolution of $\hat{\rho}_{\text{adv}}$ for the different fine-tuning strategies is shown in Figures 6(a) to 6(d), where the robustness $\hat{\rho}_{\text{adv}}$ is estimated using DeepFool, since this is the most accurate method, as shown in Table 1. Observe that fine-tuning with DeepFool adversarial examples significantly increases the robustness of the networks to adversarial perturbations even after one extra epoch. For example, the robustness of the networks on MNIST is improved by 50% and NIN’s robustness is increased by about 40%. On the other hand, quite surprisingly, the method in can lead to a decreased robustness to adversarial perturbations of the network. We hypothesize that this behavior is due to the fact that perturbations estimated using the fast gradient sign method are much larger than minimal adversarial perturbations. Fine-tuning the network with overly perturbed images decreases the robustness of the networks to adversarial perturbations. To verify this hypothesis, we compare in Figure 7 the adversarial robustness of a network that is fine-tuned with the adversarial examples obtained using DeepFool, where norms of perturbations have been deliberately multiplied by $\alpha=1,2,3$ . Interestingly, we see that by magnifying the norms of the adversarial perturbations, the robustness of the fine-tuned network is decreased. This might explain why overly perturbed images decrease the robustness of MNIST networks: these perturbations can really change the class of the digits, hence fine-tuning based on these examples can lead to a drop of the robustness (for an illustration, see Figure 8). This lends credence to our hypothesis, and further shows the importance of designing accurate methods to compute minimal perturbations.

Table 3 lists the accuracies of the fine-tuned networks. It can be seen that fine-tuning with DeepFool can improve the accuracy of the networks. Conversely, fine-tuning with the approach in has led to a decrease of the test accuracy in all our experiments. This confirms the explanation that the fast gradient sign method outputs overly perturbed images that lead to images that are unlikely to occur in the test data. Hence, it decreases the performance of the method as it acts as a regularizer that does not represent the distribution of the original data. This effect is analogous to geometric data augmentation schemes, where large transformations of the original samples have a counter-productive effect on generalization.While the authors of reported an increased generalization performance on the MNIST task (from $0.94\%$ to $0.84\%$ ) using adversarial regularization, it should be noted that the their experimental setup is significantly different as trained the network based on a modified cost function, while we performed straightforward fine-tuning.

To emphasize the importance of a correct estimation of the minimal perturbation, we now show that using approximate methods can lead to wrong conclusions regarding the adversarial robustness of networks. We fine-tune the NIN classifier on the fast gradient sign adversarial examples. We follow the procedure described earlier but this time, we decreased the learning rate by 90%. We have evaluated the adversarial robustness of this network at different extra epochs using DeepFool and the fast gradient sign method. As one can see in Figure 9, the red plot exaggerates the effect of training on the adversarial examples. Moreover, it is not sensitive enough to demonstrate the loss of robustness at the first extra epoch. These observations confirm that using an accurate tool to measure the robustness of classifiers is crucial to derive conclusions about the robustness of networks.

Conclusion

In this work, we proposed an algorithm, DeepFool, to compute adversarial examples that fool state-of-the-art classifiers. It is based on an iterative linearization of the classifier to generate minimal perturbations that are sufficient to change classification labels. We provided extensive experimental evidence on three datasets and eight classifiers, showing the superiority of the proposed method over state-of-the-art methods to compute adversarial perturbations, as well as the efficiency of the proposed approach. Due to its accurate estimation of the adversarial perturbations, the proposed DeepFool algorithm provides an efficient and accurate way to evaluate the robustness of classifiers and to enhance their performance by proper fine-tuning. The proposed approach can therefore be used as a reliable tool to accurately estimate the minimal perturbation vectors, and build more robust classifiers.

This work has been partly supported by the Hasler Foundation, Switzerland, in the framework of the CORA project.