Learning with a Strong Adversary

Ruitong Huang, Bing Xu, Dale Schuurmans, Csaba Szepesvari

Introduction

Deep Neural Network (DNN) models have recently demonstrated impressive learning results in many visual and speech classification problems (Krizhevsky et al., 2012; Hinton et al., 2012a). One reason for this success is believed to be the expressive capacity of deep network architectures. Even though classifiers are typically evaluated by their misclassification rate, robustness is also a highly desirable property: intuitively, it is desirable for a classifier to be ‘smooth’ in the sense that a small perturbation of its input should not change its predictions significantly. An intriguing recent discovery is that DNN models do not typically possess such a robustness property (Szegedy et al., 2013). An otherwise highly accurate DNN model can be fooled into misclassifying typical data points by introducing a human-indistinguishable perturbation of the original inputs. We call such a perturbed data set ‘adversarial examples’. An even more curious fact is that the same set of such adversarial examples is also misclassified by a diverse set of classification models, such as KNN, Boosting Tree, even if they are trained with different architectures and different hyperparameters.

Since the appearance of Szegedy et al. (2013), increasing attention has been paid to the curious phenomenon of ‘adversarial perturbation’ in the deep learning community; see, for example (Goodfellow et al., 2014; Fawzi et al., 2015; Miyato et al., 2015; Nøkland, 2015; Tabacof & Valle, 2015). Goodfellow et al. (2014) suggest that one reason for the detrimental effect of adversarial examples lies in the implicit linearity of the classification models in high dimensional spaces. Additional exploration by Tabacof & Valle (2015) has demonstrated that, for image classification problems, adversarial images inhabit large ”adversarial pockets” in the pixel space. Based on these observations, different ways of finding adversarial examples have been proposed, among which the most relevant to our study is that of (Goodfellow et al., 2014), where a linear approximation is used to obviate the need for any auxiliary optimization problem to be solved. In this paper, we further investigate the role of adversarial training on classifier robustness and propose a simple new approach to finding ’stronger’ adversarial examples. Experimental results suggest that the proposed method is more effective than previous approaches in the sense that the resulting DNN classifiers obtain worse performance under the same magnitude of perturbation.

The main achievement of this paper is a training method that is able to produce robust classifiers with high classification accuracy in the face of stronger data perturbation. The approach we propose differs from previous approaches in a few ways. First, Goodfellow et al. (2014) suggests using an augmented objective that combines the original training objective with an additional objective that is measured after after the training inputs have been perturbed. Alternatively, (Nøkland, 2015) suggest, as a specialization of the method in (Goodfellow et al., 2014), to only use the objective defined on the perturbed data. However, there is no theoretical analysis to justify that classifiers learned in this way are indeed robust; both methods are proposed heuristically. In our proposed approach, we formulate the learning procedure as a min-max problem that forces the learned DNN model to be robust against adversarial examples, so that the learned classifier is inherently robust. In particular, we allow an adversary to apply perturbations to each data point in an attempt to maximize classification error, while the learning procedure attempts to minimize misclassification error against the adversary. We call this learning procedure ‘learning with a strong adversary’. Such min-max formulation has been discussed specifically in (Goodfellow et al., 2014), but emphasis is on its regularization effect particularly for logistic regression. Our setting is more general and applicable to different loss functions and different types of perturbations, and is the origin of our learning procedure. Analysis of the regularization effect of such min-max formulation is postponed in the appendix, since it is not closly related to the main content of the paper. It turns out that an efficient method for finding such adversarial examples is required as an intermediate step to solve such a min-max problem, which is the first problem we address. Then we develop the full min-max training procedure that incorporates robustness to adversarially perturbed training data. The learning procedure that results turns out to have some similarities to the one proposed in (Nøkland, 2015). Another min-max formulation is proposed in Miyato et al. (2015) but still with the interpretation of regularization. These approaches are based on significantly different understandings of this problem. Recently, a theoretical exploration of the robustness of classifiers (Fawzi et al., 2015) suggests that, as expected, there is a trade-off between expressive power and robustness. This paper can be considered as an exploration into this same trade-off from an engineering perspective.

The remainder of the paper is organized as follows. First, we propose a new method for finding adversarial examples in Section 2. Section 3 is then devoted to developing the main method: a new procedure for learning with a stronger form of adversary. Finally, we provide an experimental evaluation of the proposed method on MNIST and CIFAR-10 in Section 4.

Finding adversarial examples

Consider an example $(x,y)\in\mathcal{X}\times\{1,2,\ldots,K\}$ and assume that $\mathcal{N}(x)=y$ , where $y$ is the true label for $x$ . Our goal is to find a small perturbation $r\in\mathcal{X}$ so that $\mathcal{N}(X+r)\neq y$ . This problem was originally investigated by Szegedy et al. (2013), who propose the following perturbation procedure: given $x$ , solve

The simple method we propose to find such a perturbation $r$ is based on the linear approximation of $g(x)$ , $\hat{g}(x+r)=g(x)+Hr$ , where $H=\frac{\partial g}{\partial w}|_{x}$ is the Jacobian matrix.

As an alternative, we consider the following question: for a fixed index $j\neq y$ , what is the minimal $r_{(j)}$ satisfying $\mathcal{N}(x+r_{(j)})=j$ ? Replacing $g$ by its linear approximation $\hat{g}$ , one of the necessary conditions for such a perturbation $r$ is:

where $H_{j}$ is the $j$ -th row of $H$ . Therefore, the norm of the optimal $r_{(j)}^{*}$ is greater than the following objective value:

The optimal solution to this problem is provided in Proposition 1.

It is straightforward that the optimal objective value is $\|r_{(j)}\|=\frac{\alpha_{y}-\alpha_{j}}{\|H_{j}-H_{y}\|_{*}}$ . In particular, the optimal $r_{(j)}^{*}$ for common norms are:

If $\|\cdot\|$ is the $L_{2}$ norm, then $r_{(j)}^{*}=\frac{\alpha_{y}-\alpha_{j}}{\|H_{j}-H_{y}\|_{2}^{2}}(H_{j}-H_{y})$ ;

If $\|\cdot\|$ is the $L_{\infty}$ norm, then $r_{(j)}^{*}=\frac{\alpha_{y}-\alpha_{j}}{\|H_{j}-H_{y}\|_{1}}\text{sign}(H_{j}-H_{y})$ ;

If $\|\cdot\|$ is the $L_{1}$ norm, then $r_{(j)}^{*}=\frac{c}{\|H_{j}-H_{y}\|_{\infty}}e_{k}$ where $k$ satisfies $|(H_{j}-H_{y})_{k}|=\|H_{j}-H_{y}\|_{\infty}$ . Here $V_{k}$ is the $k$ -th element of $V$ .

However, such $r_{(j)}^{*}$ is necessary but not sufficient to guarantee that $\arg\!\max_{i}\hat{g}(x+r_{(j)})_{i}=j$ . The following proposition shows that in order to have $\hat{g}$ make a wrong prediction, it is enough to use the minimum among all $r_{(j)}^{*}$ ’s.

Let $I=\arg\!\min_{i}\|r_{(i)}^{*}\|$ . Then $r_{I}^{*}$ is the solution of the following problem:

Putting these observations together, we achieve an algorithm for finding adversarial examples, as shown in Algorithm 1.

Toward Robust Neural Networks

We enhance the robustness of a neural network model by preparing the network for the worst examples by training with the following objective:

To solve the problem (2) using SGD, one needs to compute the derivative of $L_{i}$ with respect to (the parameters that define) $g$ . The following preliminary proposition suggests a way of computing this derivative.

Given $h:\mathcal{U}\times\mathcal{V}\rightarrow\mathcal{W}$ differentiable almost everywhere, define $L(v)=\max_{u\in\mathcal{U}}h(u,v)$ . Assume that $L$ is uniformly Lipschitz-continuous as a function of $v$ , then the following results holds almost everywhere:

Note that $L$ is uniformly Lipschitz-continuous, therefore by Rademacher’s theorem, $L$ is differentiable almost everywhere. For $v_{0}$ where $L$ is differentiable, the Fréchet subderivative of $L$ is actually a singleton set of its derivative.

Consider the function $\hat{L}(v)=h(u^{*},v)$ . Since $h$ is differentiable, $\frac{\partial h}{\partial v}(u^{*},v_{0})$ is the derivative of $\hat{L}$ at point $v_{0}$ . Also $\hat{L}(v_{0})=L(v_{0})$ . Thus, by Proposition 2 of (Neu & Szepesvári, 2012), $\frac{\partial h}{\partial v}(u^{*},v_{0})$ also belongs to the subderivative of $L$ . Therefore,

The differentiability of $h$ in Proposition 3 usually holds. The uniformly Lipschitz-continuous of neural networks was also discussed in the paper of Szegedy et al. (2013). It still remains to compute $u^{*}$ in Proposition 3. In particular given $(x_{i},y_{i})$ , we need to solve

We postpone the solution for the above problem to the end of this section. Given that we can have an approximate solution for Equation (3), a simple SGD method to compute a local solution for Equation (2) is then shown in Algorithm 2.

For complex prediction problems, deeper neural networks are usually proposed, which can be interpreted as consisting of two parts: the lower layers of the network can be interpreted as learning a representation for the input data, while the upper layers can be interpreted as learning a classification model on top of the learned representation. The number of layers that should be interpreted as providing representations versus classifications is not precise and varies between datasets; we treat this as a hyperparameter in our method. Given such an interpretation, denote the representation layers of the network as $\mathcal{N}_{\rm rep}$ and the classification layers of the network as $\mathcal{N}_{\rm cla}$ . We propose to perform the perturbation over the output of $\mathcal{N}_{\rm rep}$ rather than the raw data. Thus the problem of learning with an adversary can be formulated as follows:

Similarly, Equation (4) can be solved by the following SGD method given in Algorithm 3.

We propose two different perturbation methods based on two different principles. The first proposed method, similar to that of (Goodfellow et al., 2014), does not require the solution of an optimization problem. Experimental results show that this method, compared to the method proposed in (Goodfellow et al., 2014), is more effective in the sense that, under the same magnitude of perturbation, the accuracy reduction in the network is greater.

where $x_{i}$ could be the raw data or the output of $\mathcal{N}_{\rm rep}$ . Since $h$ is decreasing, $r^{*}=\arg\min_{\|r^{(i)}\|\leq c}g(x_{i}+r^{(i)})_{y_{i}}$ .

The optimal solutions for $r^{*}$ given common norms are:

If $\|\cdot\|$ is the $L_{2}$ norm, then $r_{(j)}^{*}=c\,\frac{H_{y_{i}}}{\|H_{y_{i}}\|_{2}}$ ;

If $\|\cdot\|$ is the $L_{\infty}$ norm, then $r_{(j)}^{*}=c\,\text{sign}(H_{y_{i}})$ ;

If $\|\cdot\|$ is the $L_{1}$ norm, then $r_{(j)}^{*}=c\,e_{k}$ , where $k$ satisfies $|H_{y_{i}}|=\|H_{y_{i}}\|_{\infty}$ .

1.2 Misclassification Based Loss

where $r_{I}^{*}$ is the output of Algorithm 1.

Experimental Evaluation

To investigate the training method proposed above, in conjunction with the different approaches for determining perturbations, we consider the MNIST (LeCun et al., 1998b) and CIFAR-10 data sets. The MNIST data set contains 28x28 grey scale images of handwritten digits. We normalize the pixel values into the range by dividing by 256. The CIFAR-10 (Krizhevsky & Hinton, 2009) dataset is a tiny nature image dataset. CIFAR-10 datasets contains 10 different classes images, each image is an RGB image in size of 32x32. Input images are subtracted by mean value 117, and randomly cropped to size 28x28. We also normalize the pixel value into the range (-1, 1) by dividing 256. This normalization is for evaluating perturbation magnitude by $L_{2}$ norm. For both datasets, we randomly choose 50,000 images for training and 10,000 for testing.

All experiment models are trained by using MXNet (Chen et al., 2015)Reproduce code: https://github.com/Armstring/LearningwithaStrongAdversary.

The networks classification accuracy decreases with increasing magnitude of the perturbation. These results suggest that Adv_Alpha is consistently, but slightly, more effective than Adv_Loss, and these two method are significantly more effective than Adv_Loss_Sign.

2 Learning with an adversary

We sumarize the results in Table 2. Note that the normal method can not afford any perturbation on the validation set, showing that it is highly non-robust. By training with dropout, both the accuracy and robustness of the neural network are improved, but robustness remains weak; for the adversarial set generated by Adv_Alpha in particular, the resulting classification accuracy is only $19.3\%$ . Goodfellow’s method improves the network’s robustness greatly, compared to the previous methods. However, the best accuracy and the most robustness are both achieved by LWA. In particular, on the adversarial sets generated by our methods (Adv_Loss and Adv_Alpha), the performance is improved from $84.4\%$ to $86.7\%$ , and from $83.6\%$ to $86.2\%$ . The result of LWA_Rep is also reported for comparison. Overall, it achieves worse performance than Goodfellow’s method (Goodfellow et al., 2014) and LWA, but still much more robust than Dropout.

We also evaluated these learning methods on the LeNet model (LeCun et al., 1998a), which is more complex, including convolution layers. We use Dropout for Goodfellow’s method and LWA. The resulting learning curves are reported in Figure 2. It is interesting that we do not observe the trade-off between robustness and accuracy once again; this phenomenon also occurred with the 2-hidden-layers neural network.

The final result is summarized in Table 3, which shows the great robustness of LWA. We don’t observe the superiority of perturbing the representation layer to perturbing the raw data. In the learned networks, a small perturbation on the raw data incurs a much larger perturbation on the representation layer, which makes LWA_Rep difficult to achieve both high accuracy and robustness. How to avoid such perturbation explosion in the representation network remains open for future investigation.

Experiments on CIFAR-10: CIFAR-10 is a more difficult task compared to MNIST. Inspired by VGG-D Network (Simonyan & Zisserman, 2014), we use a network formed by 6 convolution layers with 3 fully connected layers. Same to VGG-Network, we use ReLU as the activation function. For each convolution stage, we use three 3x3 convolution layers followed by a max-pooling layer with stride of 2. We use 128, 256 filters for each convolution stage correspondingly. For three fully connected layer, we use 2048, 2048, 10 hidden units. We also split this network in representation learner and classifier view: the last two fully connected layers with hidden units 2048 and 10 are classifier and other layers below formed a representation learner. We compare the following methods: 1. Normal training (Normal); 2. Normal training with Dropout (Dropout); 3. Goodfellow’s method with Dropout (Goodfellow’s method); 3. Learning with a strong adversary on raw data with Dropout (LWA); 4. Learning with a strong adversary at the representation layer with Dropout (LWA_Rep). We use batch normalization (BN) (Ioffe & Szegedy, 2015) to stabilize the learning of LWA_Rep. We also test the performances of different methods with batch normalization (BN). The results are summarized in Table 4. Learning with a strong adversary again achieves better robustness, but we also observe a small decrease on their classification accuracies.

Conclusion

We investigate the curious phenomenon of ’adversarial perturbation’ in a formal min-max problem setting. A generic algorithm is developed based on the proposed min-max formulation, which is more general and allows to replace previous heuristic algorithms with formally derived ones. We also propose a more efficient way in finding adversarial examples for a given network. The experimental results suggests that learning with a strong adversary is promising in the sense that compared to the benchmarks in the literature, it achieves significantly better robustness while maintain high normal accuracy.

We thank Ian Goodfellow, Naiyan Wang, Yifan Wu for meaningful discussions. Also we thank Mu Li for granting us access of his GPU machine to run experiments. This work was supported by the Alberta Innovates Technology Futures and NSERC.

References

Appendix A Analysis of the Regularization Effect of Learning with an Adversary on Logistic Regression

The induced regularization $R_{z}(w)$ is non-convex.

An counter example is enough to prove that $R_{z}(w)$ is non-convex. Let $c=0.5$ and $z=(x,y)=(1,1)$ , then $R_{z}(w):\real\rightarrow\real$ is

The figure of this function clearly shows its non-convexity. ∎

One of the common problem about logistic regression without regularization is that there is no valid solution given a linearly separable data set $\underline{Z}$ . The following proposition shows that this issue is relieved by learning with an adversary.

Assume that the linearly separable data set $\underline{Z}$ has a margin less than $2c$ , then logistic regression with adversary is guaranteed to have bounded solutions. Moreover, if $\|\cdot\|_{*}$ is twice differentiable almost everywhere and $\sum_{i}\frac{\partial Q_{i}}{\partial w}\frac{\partial Q_{i}}{\partial w}^{\top}$ is positive definite where $Q_{i}=-y_{i}x_{i}^{\top}w+\|w\|_{*}$ , then logistic regression with adversary has unique solution.

Since the margin of $\underline{Z}$ is less than $2c$ , after the data is perturbed, it is no longer linear separable. Therefore for any $w$ , there exists some $i$ such that $-y_{i}w^{\top}x_{i}+c\|w\|_{*}$ . Note that $-y_{i}w^{\top}x_{i}+c\|w\|_{*}$ is homogeneous in $w$ . Thus, if the optimal solution $w^{*}$ has a infinity norm, then

However, assigning $w=0$ have finite loss. Therefore, $w^{*}$ is bounded.

Moreover, since $\|\cdot\|_{*}$ is twice differentiable, consider the Hession matrix

Unlike adding a regularization term to the objective function, learning with an adversary can only guarantee unique solutions, when the data set is linearly separable with small margin. In such sense, learning with an adversary is a weaker regularization.