Adversarial Robustness through Local Linearization

Chongli Qin, James Martens, Sven Gowal, Dilip Krishnan, Krishnamurthy Dvijotham, Alhussein Fawzi, Soham De, Robert Stanforth, Pushmeet Kohli

stat.ML cs.LG

Introduction

In a seminal paper, Szegedy et al. demonstrated that neural networks are vulnerable to visually imperceptible but carefully chosen adversarial perturbations which cause neural networks to output incorrect predictions. After this revealing study, a flurry of research has been conducted with the focus of making networks robust against such adversarial perturbations . Concurrently, researchers devised stronger attacks that expose previously unknown vulnerabilities of neural networks .

Of the many approaches proposed , adversarial training is empirically the best performing algorithm to train networks robust to adversarial perturbations. However, the cost of adversarial training becomes prohibitive with growing model complexity and input dimensionality. This is primarily due to the cost of computing adversarial perturbations, which is incurred at each step of adversarial training. In particular, for each new mini-batch one must perform multiple iterations of a gradient-based optimizer on the network’s inputs to find said perturbations.While computing the globally optimal adversarial example is NP-hard , gradient descent with several random restarts was empirically shown to be quite effective at computing adversarial perturbations of sufficient quality. As each step of this optimizer requires a new backwards pass, the total cost of adversarial training scales as roughly the number of such steps. Unfortunately, effective adversarial training of ImageNet often requires large number of steps to avoid problems of gradient obfuscation , making it much more expensive than conventional training, almost prohibitively so.

One approach which can alleviate the cost of adversarial training is training against weaker adversaries that are cheaper to compute. For example, by taking fewer gradient steps to compute adversarial examples during training. However, this can produce models which are robust against weak attacks, but break down under strong attacks – often due to gradient obfuscation. In particular, one form of gradient obfuscation occurs when the network learns to fool a gradient based attack by making the loss surface highly convoluted and non-linear (see Fig 1), this has also been observed by Papernot et al . In turn the non-linearity prevents gradient based optimization methods from finding an adversarial perturbation within a small number of iterations . In contrast, if the loss surface was linear in the vicinity of the training examples, which is to say well-predicted by local gradient information, gradient obfuscation cannot occur. In this paper, we take up this idea and introduce a novel regularizer that encourages the loss to behave linearly in the vicinity of the training data. We call this regularizer the local linearity regularizer (LLR). Empirically, we find that networks trained with LLR exhibit far less gradient obfuscation, and are almost equally robust against strong attacks as they ares against weak attacks.

The main contributions of our paper are summarized below:

We show that training with LLR is significantly faster than adversarial training, allowing us to train a robust ImageNet model with a $5\times$ speed up when training on 128 TPUv3 cores .

We show that LLR trained models exhibit higher robustness relative to adversarially trained models when evaluated under strong attacks. Adversarially trained models can exhibit a decrease in accuracy of 6% when increasing the attack strength at test time for CIFAR-10, whereas LLR shows only a decrease of 2%.

We achieve new state of the art results for adversarial accuracy against untargeted white-box attack for ImageNet (with $\epsilon=4/255$ This means that every pixel is perturbed independently by up to 4 units up or down on a scale where pixels take values ranging between 0 and 255.): $47\%$ . Furthermore, we match state of the art results for CIFAR 10 (with $\epsilon=8/255$ ): $52.81\%$ We note that TRADES gets 55% against a much weaker attack; under our strongest attack, it gets 52.5%..

We perform a large scale evaluation of existing methods for adversarially robust training under consistent, strong, white-box attacks. For this we recreate several baseline models from the literature, training them both for CIFAR-10 and ImageNet (where possible).Baselines created are adversarial training, TRADES and CURE . To the contrary of CIFAR-10, we are currently unable to achieve consistent and competitive results on ImageNet at $\epsilon=4/255$ using TRADES.

Background and Related Work

where $p_{i}(x;\theta)$ is defined as above, and $y$ is a 1-hot vector representing the class label. While ERM is effective at training neural networks that perform well on holdout test data, the accuracy on the test set goes to zero under adversarial evaluation. This is a result of a distribution shift in the data induced by the attack. To rectify this, adversarial training seeks to perturb the data distribution by performing adversarial attacks during training. More concretely, adversarial training minimizes the loss function

Since the invention of adversarial training, a corpus of work has researched alternative ways of making networks robust. One such approach is the TRADES method which is a form of regularization that specifically maximizes the trade-off between robustness and accuracy – as many studies have observed these two quantities to be at odds with each other . Others, such as work by Ding et al adaptively increase the perturbation radius by find the minimal length perturbation which changes the output label. Some have proposed architectural changes which promote adversarial robustness, such as the "denoise" model for ImageNet.

Motivating the Local Linearity Regularizer

is an indicator of how linear the surface is. Consequently, we consider the quantity

to be a measure of how linear the surface is within a neighbourhood $B(\epsilon)$ . We call this quantity the local linearity measure.

2 Empirical Observations on Adversarial Training

Local Linearity Regularizer (LLR)

2 Local Linearity Regularization (LLR)

Following the analysis above, we propose the following objective for adversarially robust training

3 Local Linearity γ(ϵ;x)𝛾italic-ϵ𝑥\gamma(\epsilon;x) is a sufficient regularizer by itself

Experiments and Results

We perform experiments using LLR on both CIFAR-10 and ImageNet datasets. We show that LLR gets state of the art adversarial accuracy on CIFAR-10 (at $\epsilon=8/255$ ) and ImageNet (at $\epsilon=4/255$ ) evaluated under a strong adversarial attack. Moreover, we show that as the attack strength increases, the degradation in adversarial accuracy is more graceful for networks trained using LLR than for those trained with standard adversarial training. Further, we demonstrate that training using LLR is $5\times$ faster for ImageNet. Finally, we show that, by linearizing the loss surface, models are less prone to gradient obfuscation.

ImageNet: The perturbation radii considered are $\epsilon=4/255$ and $\epsilon=16/255$ . The architecture used for this is from which is ResNet-152. We use softplus as activation function. For $\epsilon=4/255$ , the baselines we compare our results against is our recreated versions of ADV and denoising model (DENOISE) .We attempted to use TRADES on ImageNet but did not manage to get competitive results. Thus they are omitted from the baselines. For $\epsilon=16/255$ , we compare LLR to ADV and DENOISE networks which have been published from the the literature. Due to computational constraints, we limit ourselves to evaluating all models on the first 1K images of the test set.

To make sure that we have a close estimate of the true robustness, we evaluate all the models on a wide range of attacks these are described below.

To accurately gauge the true robustness of our network, we tailor our attack to give the lowest possible adversarial accuracy. The two parts which we tune to get the optimal attack is the loss function for the attack and its corresponding optimization procedure. The loss functions used are described below, for the optimization procedure please refer to Appendix F.1.

Loss Functions: The three loss functions we consider are summarized in Table 1. We use the difference between logits for the loss function rather than the cross-entropy loss as we have empirically found the former to yield lower adversarial accuracy.

2 Results for Robustness

For CIFAR-10, the main adversarial accuracy results are given in Table 2. We compare LLR training to ADV , CURE and TRADES , both with our re-implementation and the published models Note the network published for TRADES uses a Wide-ResNet-34-10 so this is not shown in the table but under the same rigorous evaluation we show that TRADES get 84.91% nominal accuracy, 53.41% under Untargeted and 52.58% under Multi-Targeted. We’ve also ran DeepFool (not in the table as the attack is weaker) giving ADV(S): 64.29%, CURE(S): 58.73%, TRADES(S): 63.4%, LLR(S): 65.87%.. Note that our re-implementation using softplus activations perform at or above the published results for ADV, CURE and TRADES. This is largely due to the learning rate schedule used, which is the similar to the one used by TRADES .

Interestingly, for adversarial training (ADV), using the Multi-Targeted attack for evaluation gives significantly lower adversarial accuracy compared to Untargeted. The accuracy obtained are $49.79\%$ and $55.26\%$ respectively. Evaluation using Multi-Targeted attack consistently gave the lowest adversarial accuracy throughout. Under this attack, the methods which stand out amongst the rest are LLR and TRADES. Using LLR we get state of the art results with $52.81\%$ adversarial accuracy.

For ImageNet, we compare against adversarial training (ADV) and the denoising model (DENOISE) . The results are shown in Table 3. For a perturbation radius of 4/255, LLR gets 47% adversarial accuracy under the Untargeted attack which is notably higher than the adversarial accuracy obtained via adversarial training which is 39.70%. Moreover, LLR is trained with just two-steps of PGD rather than 30 steps for adversarial training. The amount of computation needed for each method is further discussed in Sec 5.2.1.

Further shown in Table 3 are the results for $\epsilon=16/255$ . We note a significant drop in nominal accuracy when we train with LLR to perturbation radius 16/255. When testing for perturbation radius of 16/255 we also show that the adversarial accuracy under Untargeted is very poor (below 8%) for all methods. We speculate that this perturbation radius is too large for the robustness problem. Since adversarial perturbations should be, by definition, imperceptible to the human eye, upon inspection of the images generated using an adversarial attack (see Fig 8) - this assumption no longer holds true. The images generated appear to consist of super-imposed object parts of other classes onto the target image. This leads us to believe that a more fine-grained analysis of what should constitute "robustness for ImageNet" is an important topic for debate.

For ImageNet, we trained on 128 TPUv3 cores , the total training wall time for the LLR network (4/255) is 7 hours for 110 epochs. Similarly, for the adversarially trained (ADV) networks the total wall time is 36 hours for 110 epochs. This is a $5\times$ speed up.

2.2 Accuracy Degradation: Strong vs Weak Evaluation

The resulting model trained using LLR degrades gracefully in terms of adversarial accuracy when we increase the strength of attack, as shown in Fig 3. In particular, Fig 3(a) shows that, for CIFAR-10, when the attack changes from Untargeted to Multi-Targeted, the LLR’s accuracy remains similar with only a $2.18\%$ drop in accuracy. Contrary to adversarial training (ADV), where we see a $5.64\%$ drop in accuracy. We also see similar trends in accuracy in Table 2. This could indicate that some level of obfuscation may be happening under standard adversarial training.

As we empirically observe that LLR evaluates similarly under weak and strong attacks, we hypothesize that this is because LLR explicitly linearizes the loss surface. An extreme case would be when the surface is completely linear - in this instance the optimal adversarial perturbation would be found with just one PGD step. Thus evaluation using a weak attack is often good enough to get an accurate gauge of how it will perform under a stronger attack.

For ImageNet, see Fig 3(b), the adversarial accuracy trained using LLR remains significantly higher (7.5%) than the adversarially trained network going from a weak to a stronger attack.

3 Resistance to Gradient Obfuscation

Conclusions

We show that, by promoting linearity, deep classification networks are less susceptible to gradient obfuscation, thus allowing us to do fewer gradient descent steps for the inner optimization. Our novel linearity regularizer promotes locally linear behavior as justified from a theoretical perspective. The resulting models achieve state of the art adversarial robustness on the CIFAR-10 and Imagenet datasets, and can be trained $5\times$ faster than regular adversarial training.

References

Appendix A Empirical Observations on Adversarial Training: Supplementary

Appendix B Local Linearity Upper Bounds Robustness: Proof of Proposition 4.1

Appendix C Local Linearity γ(ϵ,x)𝛾italic-ϵ𝑥\gamma(\epsilon,x) is a sufficient regularizer by itself

The starting point for proving our bounds will be the following local quadratic approximation of the loss:

Here, $G(x)$ is the Generalized Gauss-Newton matrix (GGN) , and $\varepsilon(\delta)$ denotes the error of the approximation.

where $J$ is the Jacobian of $f$ , and $H_{\nu}$ is the Hessian of $\nu(y,z)$ with respect to $z$ .

C.2 Bounds for common loss functions

Our basic strategy in proving the following results is to rearrange Eq (9) to establish the following bound on the curvature in terms of $g(\delta;x)$ which is defined in Eq (5) in the main text:

Suppose that $\nu(y,z)=\frac{1}{2}\|y-z\|^{2}$ is the squared error and $z=f(x;\theta)$ is the output of the neural network. Then for any perturbation vector $\delta\in B(\epsilon)$ we have

where $\varepsilon(\delta)$ is the error of the local quadratic approximation defined in Equation 9.

Suppose that $\nu(y,z)=\log(y^{\top}p(z))$ is the softmax cross-entropy error, where $y$ is a 1-hot target vector, and $p(z)$ is the vector of probabilities computed via the softmax function. Then for any perturbation vector $\delta\in B(\epsilon)$ we have

where $\varepsilon(\delta)$ is the error of the local quadratic approximation defined in Equation 9.

We note $p^{\top}y$ is just the probability of the target label under the model. And so $1/p^{\top}y$ won’t be very big, provided that the model is properly classifying the data with some reasonable degree of certainty. (Indeed, for highly certain predictions it will be close to $1$ .) Thus the upper bound given in Proposition C.2 should shrink at a reasonable rate as the regularizer $\gamma(\epsilon;x)$ does, provided that error term $\varepsilon(\delta)$ is negligable.

Appendix D Proofs

Using these facts, and applying the Cauchy-Schwarz inequality, we get

Taking the square root of both sides yields the claim. ∎

D.2 Proof of Proposition C.2

Because the entries of $p$ are non-negative and sum to $1$ we can factor this as

and where $q$ is defined as the entry-wise square root of the vector $p$ . To see that this is correct, note that

where we have used the properties of $q$ and $p$ , such as $q^{\top}q=1$ , $\operatorname{diag}(q)q=p$ , etc.

Using this factorization we can rewrite the curvature term as

where we have defined $\Delta z=J\delta$ (intuitively, this is “the change in $z$ due to $\delta$ ”). Thus by Equation 10 we have

Let $v=\frac{1}{q^{\top}y}y$ , which is well defined because $q$ is entry-wise positive (since $p$ must be), and $y$ is a one-hot vector. Using said properties of $y$ and $q$ we have that

where $\odot$ denotes the entry-wise product.

Using the above facts, and applying the Cauchy-Schwarz inequality, we arrive at

where we have used the facts that $(q^{\top}y)^{2}=p^{\top}y$ and $\|y\|=1$ . Taking the square root of both sides yields the claim. ∎

Appendix E Local Linearity Regularizer - Algorithm

Appendix F Experiments and Results: Supplementary

F.2 Training and Hyperparameters

The hyperparameters for $\lambda$ and $\mu$ are chosen by doing a hyperparameter sweep.

To train the LLR network the initial learning rate is 0.1, the decay schedule is similar to , we decay by 0.1 after 35, 70 and 95 epochs. We train for 100 epochs. The LLR hyperparameters are $\lambda=3$ and $\mu=6$ , the weights placed on the nominal loss is 3. We use $l_{2}$ -regularization of 1e-4. The training is done on batch size of 512. We slowly increase the perturbation radius over 20 epochs from 0 to 4/255. We train with 2 steps of PGD using Adam and step size 0.1.

To train the LLR network the initial learning rate is 0.1, we decay by 0.1 after 17 and 35 epochs and 50 epochs – we train to 55 epochs. The LLR hyperparameters are $\lambda=3$ and $\mu=9$ , the weights placed on the nominal loss is 3. We use $l_{2}$ -regularization of 1e-4. The training is done on batch size of 512. We slowly increase the perturbation radius over 90 epochs from 0 to 16/255. We train with 10 steps of PGD using Adam with step size of 0.1.

F.3 Ablation Studies

F.4 Resistance to Gradient Obfuscation

In Fig 7 we show the adversarial perturbations for networks ADV-2 and LLR-2. We see that, in contrast to LLR-2, the adversarial perturbation for ADV-2 looks similar to random noise. When the adversarial perturbation resembles random noise, this is often a sign that the network is gradient obfuscated.

Furthermore, we show that the adversarial accuracy for LLR-2 is 44.50% as opposed to ADV-2 which is 0%. Surprisingly, even training with just 1 step of PGD for LLR (LLR-1) we obtain non-zero adversarial accuracy.

In Fig 7, we show the values of $\gamma(\epsilon,x)$ we obtain when we train with LLR or adversarial training (ADV). To find $\gamma(\epsilon,x)=\max_{\delta\in B(\epsilon)}g(\delta,x)$ we maximize $g(\delta,x)$ by running 50 steps of PGD with step size 0.1. Here, we see that values of $\gamma(\epsilon,x)$ for adversarial training with 20 steps of PGD is similar to LLR-2. In contrast, adversarial training (ADV-2) with just two steps of PGD gives much higher values of $\gamma(\epsilon,x)$ .

F.5 Adversarially Perturbed Images for 16/255

The perturbation radius 16/255 has become the norm to use to gauge how robust a network is on ImageNet. However, to be robust we need to make sure that the perturbation is sufficiently small such that it does not significantly affect our visual perception. We hypothesize that this perturbation radius is outside of this regime. Fig 8 shows that we can find examples which not only wipe out objects (the curbs) in the image, but can actually add faint images onto the white background. This significantly affects our visual perception of the image.