Logit Pairing Methods Can Fool Gradient-Based Attacks

Marius Mosbach, Maksym Andriushchenko, Thomas Trost, Matthias Hein, Dietrich Klakow

Introduction

Szegedy et al. showed that state-of-the-art image classifiers are not robust against certain small perturbations of the inputs, known as adversarial examples. Since then, many new attacks have been proposed aiming at better ways of crafting adversarial examples, and also many new defenses to increase the robustness of classifiers. Notably, almost all defenses previously proposed have been broken by applying different attacks [Carlini and Wagner, 2017, Athalye and Sutskever, 2017, Athalye and Carlini, 2018, Athalye et al., 2018].

A prominent defense that could not be broken so far is adversarial training [Goodfellow et al., 2014, Madry et al., 2017]. There is also a line of work on the provable robustness of classifiers [Hein and Andriushchenko, 2017, Wong and Kolter, 2018, Raghunathan et al., 2018], classifiers which by definition cannot be broken because they derive and report a lower bound on the worst-case adversarial accuracy, i.e. the accuracy of the model on the strongest possible adversarial examples under the given threat model. On the other hand, any attack provides only an upper bound on the worst-case adversarial accuracy, which we will refer to as just adversarial accuracy. The problem with many recently proposed defenses is that they evaluate only upper bounds, which might be arbitrarily loose, i.e. there might exist an attack which is able to reduce the adversarial accuracy significantly compared to some baseline attack. However, one of the main issues with lower bounds is that they are usually too small to be useful, so a special way of maximizing them is applied during training. This may interfere with a proposed defense which one aims to evaluate. Thus, providing non-trivial lower bounds – with or without special training – is an important and active area of research [Wong et al., 2018, Zhang et al., 2018, Xiao et al., 2018, Croce et al., 2018]. Unfortunately, these methods do not yet scale to large-scale datasets like ImageNet, and thus one still has to rely solely on upper bounds on adversarial accuracy to estimate the robustness on these datasets.

Given the recent history of breaking most of the defenses accepted at ICLR 2018 [Athalye et al., 2018, Uesato et al., 2018], it is now natural to question any new non-certified defense. Many recent papers [Buckman et al., 2018, Kannan et al., 2018, Yao et al., 2018] that claim robustness of their models mainly rely on the PGD attack from Madry et al. with the default settings. They assume that they evaluate their models against a “strong adversary” and that the adversarial accuracy they obtain is close to the minimal possible. In this paper, we show that it is not the case for CLP, LSQ and some ALP models proposed by Kannan et al. .

We consider the classification of images with pixel values in $ $and focus on the white-box threat model, i.e. an attacker has complete knowledge of the model. We consider adversarial perturbations bounded with respect to an$ L_{\infty} $norm of$ \epsilon=76.5 $for MNIST, and$ \epsilon=16 $for CIFAR-10 and Tiny ImageNet. We follow the settings of Kannan et al. and evaluate MNIST and CIFAR-10 models with untargeted attacks. Tiny ImageNet models were evaluated with targeted attacks which is consistent with Athalye et al. . For crafting adversarial examples, we used the PGD attack [Madry et al., 2017] with maximum adversarial perturbation$ \epsilon $and experimented with different number of iterations$ n $, step sizes$ \epsilon_{i} $, and restarts$ r$. For further comparison we also evaluate all MNIST and CIFAR-10 models against the SPSA attack [Uesato et al., 2018]. When crafting untargeted adversarial examples, we maximize the loss using the true label.

Contribution.

First, we give empirical evidence that CLP, LSQ, and some ALP models distort the loss surface in the input space and thus fool gradient-based attacks without providing actual robustness. This can be seen as a particular case of masked or obfuscated gradients [Papernot et al., 2017, Athalye et al., 2018]. We illustrate this by analyzing the input space loss surface in two random directions (Figure 1). We provide an extensive experimental evaluation of the robustness of CLP, LSQ, and ALP models on MNIST, CIFAR-10, and Tiny ImageNet datasets against the PGD attack with a large number of iterations and random restarts reducing the adversarial accuracy of e.g. MNIST-LSQ model from 70.6% to 5.0% (Table 1). Our results suggest that while CLP and LSQ do not provide actual robustness, ALP may provide additional robustness on top of adversarial training. However, the increase is much smaller than claimed by Kannan et al. (Table 3), and it remains unclear whether this is actual robustness or the increase in adversarial accuracy is only due to the distortion of the loss surface in the input space. Finally, we highlight the importance of performing many random restarts and an exhaustive grid search over the attack parameters of PGD, especially when the loss surface is distorted. We illustrate this by plotting the distribution of the loss values over different restarts of PGD (Figure 3) as well as heatmaps of the adversarial accuracy for different PGD attack parameters (Figures 2, 9, 10).

Related work.

We note that a regularization term similar to ALP was proposed before in Heinze-Deml and Meinshausen (the conditional variance penalty) in the context of adversarial domain shifts. However, they do not study its effect in the context of $L_{p}$ -bounded adversarial examples, thus we only analyze the results of ALP from Kannan et al. .

Recently, Engstrom et al. evaluated the robustness of ALP on a single ImageNet model. However, there are important differences compared to our work. First, they do not explore the robustness of the computationally cheap methods CLP and LSQ, which are a significant contribution of Kannan et al. . Second, they only test an ALP model that was trained on clean examples, while Kannan et al. mainly advocate for the usage of ALP combined with mixed-minibatch PGD (i.e. training on 50% clean and 50% adversarial examples), which we explore in detail. Finally, while Engstrom et al. only consider a single ImageNet model, we perform experiments on MNIST, CIFAR10, and Tiny ImageNet and show that the conclusions depend on the dataset, highlighting the importance of evaluating multiple models on multiple datasets before attempting to draw general conclusions.

Experiments

We note that none of CLP nor LSQ models were officially released by Kannan et al. . Thus, following Kannan et al. we train our CLP and LSQ models from scratch with Gaussian data augmentation (denoted by $\mathcal{N}(\mu,\sigma)$ in all tables). On MNIST we use the same LeNet architecture as Kannan et al. and train all models for 500 epochs with a batch size of $200$ . For CIFAR-10, we use the ResNet20-v2 architecture [He et al., 2016] and all models are trained for $100$ epochs with a batch size of $128$ . On Tiny ImageNet we use the same ResNet50-v2 architecture as Kannan et al. and analyze their pre-trained ALP models as well as our own models that we trained from scratch with standard data augmentation and weight decay for 100 epochs using a batch size of 256. Note that we do not perform any fine-tuning from pre-trained ImageNet models to better understand the contribution of the logit pairing methods. For all models, we use the Adam optimizer [Kingma and Adam, 2015], and evaluate on 1000 images drawn randomly from the test data. For crafting adversarial examples, we use the PGD and SPSA attacks implemented in the Cleverhans library [Papernot et al., 2018]. We apply adversarial training using the PGD attack with the step size of $2.55$ and $40$ iterations for MNIST, and the step size of $2.0$ and $10$ iterations for CIFAR-10 and Tiny ImageNet.

We visualize the cross-entropy loss in a two-dimensional subspace of the input space in the vicinity of an input point $x$ , where the subspace is spanned by two random signed vectors scaled by $\epsilon=38.25$ for MNIST and $\epsilon=16.0$ for CIFAR-10. We observe that the loss surfaces of models trained with CLP, LSQ, and ALP can contain many local maxima which makes gradient-based attacks such as PGD and SPSA difficult (Figure 1 and Figures 4, 5, 6, 7 in the Appendix). In order to deal with this and in contrast to the results given by Kannan et al. and Engstrom et al. , we first perform a grid search over the step size $\epsilon_{i}$ and the number of iterations $n$ of the PGD attack. We then run our attacks with multiple random restarts $r$ and report the adversarial accuracy over the most harmful restarts. We illustrate the importance of performing a grid search over the PGD attack parameters in Figure 2. Additionally, we show the importance of having many random restarts by plotting the distribution of the loss of the PGD attack across many restarts in Figure 3, which highlights that there are cases where many random restarts are needed in order to find an adversarial example.

We make our code and models publicly availablehttps://github.com/uds-lsv/evaluating-logit-pairing-methods.

The results of our evaluation on MNIST are given in Table 1. We find that when performing only a single restart of the PGD attack with the default settings, the model trained with LSQ provides an adversarial accuracy of $70.6\%$ . However, as can be seen in Figure 2, the default PGD settings on MNIST ( $\epsilon_{i}=2.55$ , $n=40$ ) are suboptimal compared to having a larger step size and more iterations. As a result, by increasing the step size as well as the number of iterations and restarts, we can significantly reduce the adversarial accuracy of the LSQ model to $5.0\%$ . Following the same approach, we can reduce the adversarial accuracy for the model trained with CLP from $62.4\%$ to $4.1\%$ . We could not achieve a similar reduction in accuracy by using the SPSA attack with $r=1$ .

For the adversarially trained models, the situation is different. Even our strongest attack could not reduce the adversarial accuracies of the models combining adversarial training with ALP below $89.9\%$ and $85.7\%$ . Further, the ALP model which was trained on clean samples only, achieves a comparable adversarial accuracy of $88.9\%$ against our strongest attack, giving an improvement of $1.7\%$ over the models trained using adversarial training only.

2 Results on CIFAR-10

Results on CIFAR-10 can be found in Table 2. Again, we find that both CLP and LSQ do not give actual robustness as the accuracy of the models trained using either CLP or LSQ can be reduced to $0.0\%$ and $1.7\%$ , respectively. This clearly shows that the robustness of $27.0\%$ of the LSQ model against the baseline PGD attack is misleading. On the other hand, we find that ALP can lead to some robustness even against our strongest PGD attack and outperforms adversarial training by $3.4\%$ , which is in agreement with our findings on MNIST. Note that also for CIFAR-10 we can not achieve a similar drop in accuracy by simply using the SPSA attack.

3 Results on Tiny ImageNet

The results of our experiments on Tiny ImageNet are given in Table 3. Note that we use targeted PGD attacks for crafting adversarial examples and report adversarial accuracy instead of success rate in order to be consistent with Kannan et al. . We again find that both CLP and LSQ models do not provide actual robustness. Next, we analyze the model provided by Kannan et al. “Fine-tuned Plain + ALP LL ( $\lambda=0.5$ )”, which was fine-tuned from a model trained on full ImageNet. Our results show that we can reduce the adversarial accuracy from $31.8\%$ to $3.6\%$ . This suggests that this model does not provide state-of-the-art robustness against white-box PGD attacks. However, when combined with training only on targeted adversarial examples, ALP marginally improves the adversarial accuracy over plain targeted adversarial training by $0.2\%$ while sacrificing $4.2\%$ of clean accuracy. In additional experiments (Tables 4 and 5 in the Appendix) we confirm the hypothesis of Engstrom et al. that adversarial training using an untargeted PGD attack leads to improved adversarial accuracy. Note that Engstrom et al. can reduce the adversarial accuracy of a pre-trained ALP model trained on full ImageNet to $0.6\%$ . In contrast to that, we consider a different model released by Kannan et al. that was fine-tuned on Tiny ImageNet. This explains the difference in adversarial accuracy reported in Table 3.

Conclusions

We perform an empirical evaluation investigating the robustness of logit pairing methods introduced by Kannan et al. . We find that both CLP and LSQ deteriorate the input space loss surface and make crafting adversarial examples with gradient-based attacks difficult, without providing actual robustness. This suggests that the current practice of evaluating against the PGD attack with default settings can be misleading. Therefore, one should consider performing an exhaustive grid search over the PGD attack parameters in addition to performing many random restarts of PGD, which helps to find adversarial examples in such cases (Figures 2, 3). Finally, we show that the ALP models of Kannan et al. do not improve drastically over adversarial training alone.

We would like to thank Michael Hedderich, Francesco Croce, and Dave Howcroft for their helpful feedback on this paper. Furthermore, we thank the reviewers for their valuable comments. Marius Mosbach acknowledges partial support by the German Research Foundation (DFG) as part of SFB 1102.