Theoretically Principled Trade-off between Robustness and Accuracy

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P. Xing, Laurent El Ghaoui, Michael I. Jordan

Introduction

In response to the vulnerability of deep neural networks to small perturbations around input data [SZS+13], adversarial defenses have been an imperative object of study in machine learning [HPG+17], computer vision [SKN+18, XWZ+17, MC17], natural language processing [JL17], and many other domains. In machine learning, study of adversarial defenses has led to significant advances in understanding and defending against adversarial threat [HWC+17]. In computer vision and natural language processing, adversarial defenses serve as indispensable building blocks for a range of security-critical systems and applications, such as autonomous cars and speech recognition authorization. The problem of adversarial defenses can be stated as that of learning a classifier with high test accuracy on both natural and adversarial examples. The adversarial example for a given labeled data (x,y)(\bm{x},y) is a data point x\bm{x}^{\prime} that causes a classifier cc to output a different label on x\bm{x}^{\prime} than yy, but is “imperceptibly similar” to x\bm{x}. Given the difficulty of providing an operational definition of “imperceptible similarity,” adversarial examples typically come in the form of restricted attacks such as ϵ\epsilon-bounded perturbations [SZS+13], or unrestricted attacks such as adversarial rotations, translations, and deformations [BCZ+18, ETT+17, GAG+18, XZL+18, AAG19, ZCS+19]. The focus of this work is the former setting, though our framework can be generalized to the latter.

Despite a large literature devoted to improving the robustness of deep-learning models, many fundamental questions remain unresolved. One of the most important questions is how to trade off adversarial robustness against natural accuracy. Statistically, robustness can be be at odds with accuracy [TSE+19]. This has led to an empirical line of work on adversarial defense that incorporates various kinds of assumptions [SZC+18, KGB17]. On the theoretical front, methods such as relaxation based defenses [KW18, RSL18a] provide provable guarantees for adversarial robustness. They, however, ignore the performance of classifier on the non-adversarial examples, and thus leave open the theoretical treatment of the putative robustness/accuracy trade-off.

The problem of adversarial defense becomes more challenging when computational issues are considered. For example, the straightforward empirical risk minimization (ERM) formulation of robust classification involves minimizing the robust 0-1 loss maxx:xxϵ1{c(x)y},\max_{\bm{x}^{\prime}:\|\bm{x}^{\prime}-\bm{x}\|\leq\epsilon}\mathbf{1}\{c(\bm{x}^{\prime})\not=y\}, a loss which is NP-hard to optimize even if ϵ=0\epsilon=0 in general. Hence, it is natural to expect that some prior work on adversarial defense replaced the 0-1 loss 1()\mathbf{1}(\cdot) with a surrogate loss [MMS+18, KGB17, UOKvdO18]. However, there is little theoretical guarantee on the tightness of this approximation.

We begin with an example that illustrates the trade-off between accuracy and adversarial robustness in Section 2.4. This phenomenon was first theoretically demonstrated by [TSE+19]. We construct another toy example where the Bayes optimal classifier achieves natural error 0%0\% and robust error 100%100\%, while the trivial all-one classifier achieves both natural error and robust error 50%50\% (Table 1). While a large literature on the analysis of robust error in terms of generalization [SST+18, CBM18, YRB18] and computational complexity [BPR18, BLPR18], in this work we focus on how to address the trade-off between the natural error and the robust error.

2 Summary of contributions

Our work tackles the problem of trading accuracy off against robustness and advances the state-of-the-art in multiple ways.

Theoretically, we characterize the trade-off between accuracy and robustness for classification problems via decomposing the robust error as the sum of the natural error and the boundary error. We provide differentiable upper bounds on both terms using the theory of classification-calibrated loss, which are shown to be the tightest upper bounds uniform over all probability distributions and measurable predictors.

Algorithmically, inspired by our theoretical analysis, we propose a new formulation of adversarial defense, TRADES, as optimizing a regularized surrogate loss. The loss consists of two terms: the term of empirical risk minimization encourages the algorithm to maximize the natural accuracy, while the regularization term encourages the algorithm to push the decision boundary away from the data, so as to improve adversarial robustness (see Figure 1).

Experimentally, we show that our proposed algorithm outperforms state-of-the-art methods under both black-box and white-box threat models. In particular, the methodology won the final round of the NeurIPS 2018 Adversarial Vision Challenge.

Preliminaries

We illustrate our methodology using the framework of binary classification, but it can be generalized to other settings as well.

2 Robust (classification) error

3 Boundary error

4 Trade-off between natural and robust errors

Our study is motivated by the trade-off between natural and robust errors. [TSE+19] theoretically showed that training models to be robust may lead to a reduction of standard accuracy by constructing a toy example. To illustrate the phenomenon, we provide another toy example here.

Example. Consider the case (X,Y)D(X,Y)\sim\mathcal{D}, where the marginal distribution over the instance space is a uniform distribution over $,andfor, and fork=0,1,...,\lceil\frac{1}{2\epsilon}-1\rceil$,

See Figure 2 for the visualization of η(x)\eta(x). We consider two classifiers: a) the Bayes optimal classifier sign(2η(x)1)\textup{{sign}}(2\eta(x)-1); b) the all-one classifier which always outputs “positive.” Table 1 displays the trade-off between natural and robust errors: the minimal natural error is achieved by the Bayes optimal classifier with large robust error, while the optimal robust error is achieved by the all-one classifier with large natural error.

Our goal. In practice, one may prefer to trade-off between robustness and accuracy by introducing weights in (1) to bias more towards the natural error or the boundary error. Noting that both the natural error and the boundary error involve 0-1 loss functions, our goal is to devise tight differentiable upper bounds on both of these terms. Towards this goal, we utilize the theory of classification-calibrated loss.

5 Classification-calibrated surrogate loss

and define H(η):=infα(2η1)0Cη(α)H^{-}(\eta):=\inf_{\alpha(2\eta-1)\leq 0}C_{\eta}(\alpha). The classification-calibrated condition requires that imposing the constraint that α\alpha has an inconsistent sign with the Bayes decision rule sign(2η1)\textup{{sign}}(2\eta-1) leads to a strictly larger ϕ\phi-risk:

We assume that the surrogate loss ϕ\phi is classification-calibrated, meaning that for any η1/2\eta\not=1/2, H(η)>H(η)H^{-}(\eta)>H(\eta).

We argue that Assumption 1 is indispensable for classification problems, since without it the Bayes optimal classifier cannot be the minimizer of the ϕ\phi-risk. Examples of classification-calibrated loss include hinge loss, sigmoid loss, exponential loss, logistic loss, and many others (see Table 2).

Properties. Classification-calibrated loss has many structural properties that one can exploit. We begin by introducing a functional transform of classification-calibrated loss ϕ\phi which was proposed by [BJM06]. Define the function ψ:[0,)\psi:\rightarrow[0,\infty) by ψ=ψ~\psi=\widetilde{\psi}^{**}, where ψ~(θ):=H(1+θ2)H(1+θ2)\widetilde{\psi}(\theta):=H^{-}\left(\frac{1+\theta}{2}\right)-H\left(\frac{1+\theta}{2}\right). Indeed, the function ψ(θ)\psi(\theta) is the largest convex lower bound on H(1+θ2)H(1+θ2)H^{-}\left(\frac{1+\theta}{2}\right)-H\left(\frac{1+\theta}{2}\right). The value H(1+θ2)H(1+θ2)H^{-}\left(\frac{1+\theta}{2}\right)-H\left(\frac{1+\theta}{2}\right) characterizes how close the surrogate loss ϕ\phi is to the class of non-classification-calibrated losses.

Under Assumption 1, the function ψ\psi has the following properties: ψ\psi is non-decreasing, continuous, convex on $andand\psi(0)=0$.

Relating 0-1 loss to Surrogate Loss

In this section, we present our main theoretical contributions for binary classification and compare our results with prior literature. Binary classification problems have received significant attention in recent years as many competitions evaluate the performance of robust models on binary classification problems [BCZ+18]. We defer the discussion of multi-class problems to Section 4.

2 Lower bound

Theorem 3.2 demonstrates that in the presence of extra conditions on the loss function, i.e., limx+ϕ(x)=0\lim_{x\rightarrow+\infty}\phi(x)=0, the upper bound in Section 3.1 is tight. The condition holds for all the losses in Table 2.

Algorithmic Design for Defenses

We name our method TRADES (TRadeoff-inspired Adversarial DEfense via Surrogate-loss minimization).

Intuition behind the optimization. Problem (3) captures the trade-off between the natural and robust errors: the first term in (3) encourages the natural error to be optimized by minimizing the “difference” between f(X)f(\bm{X}) and YY, while the second regularization term encourages the output to be smooth, that is, it pushes the decision boundary of classifier away from the sample instances via minimizing the “difference” between the prediction of natural example f(X)f(\bm{X}) and that of adversarial example f(X)f(\bm{X}^{\prime}). This is conceptually consistent with the argument that smoothness is an indispensable property of robust models [CBG+17]. The tuning parameter λ\lambda plays a critical role on balancing the importance of natural and robust errors. To see how the λ\lambda affects the solution in the example of Section 2.4, problem (3) tends to the Bayes optimal classifier when λ+\lambda\rightarrow+\infty, and tends to the all-one classifier when λ0\lambda\rightarrow 0.

Comparisons with prior work. We compare our approach with several related lines of research in the prior literature. One of the best known algorithms for adversarial defense is based on robust optimization [MMS+18, KW18, WSMK18, RSL18a, RSL18b]. Most results in this direction involve algorithms that approximately minimize

A related line of research is adversarial training by regularization [MMIK18, KGB17, RDV17, ZSLG16]. There are several key differences between the results in this paper and those of [KGB17, RDV17, ZSLG16]. Firstly, the optimization formulations are different. In the previous works, the regularization term either measures the “difference” between f(X)f(\bm{X}^{\prime}) and YY [KGB17], or its gradient [RDV17]. In contrast, our regularization term measures the “difference” between f(X)f(\bm{X}) and f(X)f(\bm{X}^{\prime}). While [ZSLG16] generated the adversarial example X\bm{X}^{\prime} by adding random Gaussian noise to X\bm{X}, our method simulates the adversarial example by solving the inner maximization problem in Eqn. (3). Secondly, we note that the losses in [MMIK18, KGB17, RDV17, ZSLG16] lack of theoretical guarantees. Our loss, with the presence of the second term in problem (3), makes our theoretical analysis significantly more subtle. Moreover, our algorithm takes the same computational resources as [KGB17], which makes our method scalable to large-scale datasets. We defer the experimental comparisons of various regularization based methods to Table 5.

Heuristic algorithm. In response to the optimization formulation (3), we use two heuristics to achieve more general defenses: a) extending to multi-class problems by involving multi-class calibrated loss; b) approximately solving the minimax problem via alternating gradient descent. For multi-class problems, a surrogate loss is calibrated if minimizers of the surrogate risk are also minimizers of the 0-1 risk [PS16]. Examples of multi-class calibrated loss include cross-entropy loss. Algorithmically, we extend problem (3) to the case of multi-class classifications by replacing ϕ\phi with a multi-class calibrated loss L(,)\mathcal{L}(\cdot,\cdot):

where f(X)f(\bm{X}) is the output vector of learning model (with softmax operator in the top layer for the cross-entropy loss L(,)\mathcal{L}(\cdot,\cdot)), Y\bm{Y} is the label-indicator vector, and λ>0\lambda>0 is the regularization parameter. One can also exchange f(X)f(\bm{X}) and f(X)f(\bm{X}^{\prime}) in the second term of (5). The pseudocode of adversarial training procedure, which aims at minimizing the empirical form of problem (5), is displayed in Algorithm 1.

The key ingredient of the algorithm is to approximately solve the linearization of inner maximization in problem (5) by the projected gradient descent (see Step 7). We note that xi\bm{x}_{i} is a global minimizer with zero gradient to the objective function g(x):=L(f(xi),f(x))g(\bm{x}^{\prime}):=\mathcal{L}(f(\bm{x}_{i}),f(\bm{x}^{\prime})) in the inner problem. Therefore, we initialize xi\bm{x}_{i}^{\prime} by adding a small, random perturbation around xi\bm{x}_{i} in Step 5 to start the inner optimizer. More exhaustive approximations of the inner maximization problem in terms of either optimization formulations or solvers would lead to better defense performance.

Semi-supervised learning. We note that TRADES problem (5) can be straightforwardly applied to the semi-supervised learning framework, as the second term in problem (5) does not depend on the label Y\bm{Y}. Therefore, with more unlabeled data points, one can approximate the second term (in the expectation form) better by the empirical loss minimization. There are many interesting recent works which explore the benefits of invloving unlabeled data [CRS+19, SFK+19, ZCH+19].

Acceleration. Adversarial training is typically more than 10x slower than natural training. To resolve this issue for TRADES, [SNG+19, ZZL+19] proposed new algorithms to solve problem (5) at negligible additional cost compared to natural training.

Experimental Results

We verify the tightness of the established upper bound in Theorem 3.1 for binary classification problem on MNIST dataset. The negative examples are ‘1’ and the positive examples are ‘3’. Here we use a Convolutional Neural Network (CNN) with two convolutional layers, followed by two fully-connected layers. The output size of the last layer is 1. To learn the robust classifier, we minimize the regularized surrogate loss in Eqn. (3), and use the hinge loss in Table 2 as the surrogate loss ϕ\phi, where the associated ψ\psi-transform is ψ(θ)=θ\psi(\theta)=\theta.

To verify the tightness of our upper bound, we calculate the left hand side in Theorem 3.1, i.e.,

The results in Table 3 show the tightness of our upper bound in Theorem 3.1. It shows that the differences between ΔRHS\Delta_{\text{RHS}} and ΔLHS\Delta_{\text{LHS}} under various λ\lambda’s are very small.

2 Sensitivity of regularization hyperparameter λ𝜆\lambda

The regularization parameter λ\lambda is an important hyperparameter in our proposed method. We show how the regularization parameter affects the performance of our robust classifiers by numerical experiments on two datasets, MNIST and CIFAR10. For both datasets, we minimize the loss in Eqn. (5) to learn robust classifiers for multi-class problems, where we choose L\mathcal{L} as the cross-entropy loss.

MNIST setup. We use the CNN which has two convolutional layers, followed by two fully-connected layers. The output size of the last layer is 10. We set perturbation ϵ=0.1\epsilon=0.1, perturbation step size η1=0.01\eta_{1}=0.01, number of iterations K=20K=20, learning rate η2=0.01\eta_{2}=0.01, batch size m=128m=128, and run 5050 epochs on the training dataset. To evaluate the robust error, we apply FGSMk (white-box) attack with 4040 iterations and 0.0050.005 step size. The results are in Table 4.

CIFAR10 setup. We apply ResNet-18 [HZRS16] for classification. The output size of the last layer is 10. We set perturbation ϵ=0.031\epsilon=0.031, perturbation step size η1=0.007\eta_{1}=0.007, number of iterations K=10K=10, learning rate η2=0.1\eta_{2}=0.1, batch size m=128m=128, and run 100100 epochs on the training dataset. To evaluate the robust error, we apply FGSMk (white-box) attack with 2020 iterations and the step size is 0.0030.003. The results are in Table 4.

3 Adversarial defenses under various attacks

Previously, [ACW18] showed that 7 defenses in ICLR 2018 which relied on obfuscated gradients may easily break down. In this section, we verify the effectiveness of our method with the same experimental setup under both white-box and black-box threat models.

MNIST setup. We use the CNN architecture in [CW17] with four convolutional layers, followed by three fully-connected layers. We set perturbation ϵ=0.3\epsilon=0.3, perturbation step size η1=0.01\eta_{1}=0.01, number of iterations K=40K=40, learning rate η2=0.01\eta_{2}=0.01, batch size m=128m=128, and run 100100 epochs on the training dataset.

CIFAR10 setup. We use the same neural network architecture as [MMS+18], i.e., the wide residual network WRN-34-10 [ZK16]. We set perturbation ϵ=0.031\epsilon=0.031, perturbation step size η1=0.007\eta_{1}=0.007, number of iterations K=10K=10, learning rate η2=0.1\eta_{2}=0.1, batch size m=128m=128, and run 100100 epochs on the training dataset.

We summarize our results in Table 5 together with the results from [ACW18]. We also implement methods in [ZSLG16, KGB17, RDV17] on the CIFAR10 dataset as they are also regularization based methods. For MNIST dataset, we apply FGSMk (white-box) attack with 4040 iterations and the step size is 0.010.01. For CIFAR10 dataset, we apply FGSMk (white-box) attack with 2020 iterations and the step size is 0.0030.003, under which the defense model in [MMS+18] achieves 47.04%47.04\% robust accuracy. Table 5 shows that our proposed defense method can significantly improve the robust accuracy of models, which is able to achieve robust accuracy as high as 56.61%56.61\%. We also evaluate our robust model on MNIST dataset under the same threat model as in [SKC18] (C&W white-box attack [CW17]), and the robust accuracy is 99.46%99.46\%. See appendix for detailed information of models in Table 5.

3.2 Black-box attacks

We verify the robustness of our models under black-box attacks. We first train models without using adversarial training on the MNIST and CIFAR10 datasets. We use the same network architectures that are specified in the beginning of this section, i.e., the CNN architecture in [CW17] and the WRN-34-10 architecture in [ZK16]. We denote these models by naturally trained models (Natural). The accuracy of the naturally trained CNN model is 99.50%99.50\% on the MNIST dataset. The accuracy of the naturally trained WRN-34-10 model is 95.29%95.29\% on the CIFAR10 dataset. We also implement the method proposed in [MMS+18] on both datasets. We denote these models by Madry’s models (Madry). The accuracy of [MMS+18]’s CNN model is 99.36%99.36\% on the MNIST dataset. The accuracy of [MMS+18]’s WRN-34-10 model is 85.49%85.49\% on the CIFAR10 dataset.

For both datasets, we use FGSMk (black-box) method to attack various defense models. For MNIST dataset, we set perturbation ϵ=0.3\epsilon=0.3 and apply FGSMk (black-box) attack with 4040 iterations and the step size is 0.010.01. For CIFAR10 dataset, we set ϵ=0.031\epsilon=0.031 and apply FGSMk (black-box) attack with 2020 iterations and the step size is 0.0030.003. Note that the setup is the same as the setup specified in Section 5.3.1. We summarize our results in Table 6 and Table 7. In both tables, we use two source models (noted in the parentheses) to generate adversarial perturbations: we compute the perturbation directions according to the gradients of the source models on the input images. It shows that our models are more robust against black-box attacks transfered from naturally trained models and [MMS+18]’s models. Moreover, our models can generate stronger adversarial examples for black-box attacks compared with naturally trained models and [MMS+18]’s models.

4 Case study: NeurIPS 2018 Adversarial Vision Challenge

Competition settings. In the adversarial competition, the adversarial attacks and defenses are under the black-box setting. The dataset in this competition is Tiny ImageNet, which consists of 550,000 data (with our data augmentation) and 200 classes. The robust models only return label predictions instead of explicit gradients and confidence scores. The task for robust models is to defend against adversarial examples that are generated by the top-5 submissions in the un-targeted attack track. The score for each defense model is evaluated by the smallest perturbation distance that makes the defense model fail to output correct labels.

Conclusions

In this paper, we study the problem of adversarial defenses against structural perturbations around input data. We focus on the trade-off between robustness and accuracy, and show an upper bound on the gap between robust error and optimal natural error. Our result advances the state-of-the-art work and matches the lower bound in the worst-case scenario. The bounds motivate us to minimize a new form of regularized surrogate loss, TRADES, for adversarial training. Experiments on real datasets and adversarial competition demonstrate the effectiveness of our proposed algorithms. It would be interesting to combine our methods with other related line of research on adversarial defenses, e.g., feature denoising technique [XWvdM+18] and network architecture design [CBG+17], to achieve more robust learning systems.

Acknowledgements. We thank Maria-Florina Balcan, Avrim Blum, Zico Kolter, and Aleksander Mądry for valuable comments and discussions.

References

Appendix A Other Related Works

Attack methods. Although deep neural networks have achieved great progress in various areas [ZSS19, ZXJ+18], they are brittle to adversarial attacks. Adversarial attacks have been extensively studied in the recent years. One of the baseline attacks to deep nerual networks is the Fast Gradient Sign Method (FGSM) [GSS15]. FGSM computes an adversarial example as

They can be adapted to the purpose of black-box attacks by running the algorithms on another similar network which is white-box to the algorithms [TKP+18]. Though defenses that cause obfuscated gradients defeat iterative optimization based attacks, [ACW18] showed that defenses relying on this effect can be circumvented. Other attack methods include MI-FGSM [DLP+18] and LBFGS attacks [TV16].

Robust optimization based defenses. Compared with attack methods, adversarial defense methods are relatively fewer. Robust optimization based defenses are inspired by the above-mentioned attacks. Intuitively, the methods train a network by fitting its parameters to the adversarial examples:

Following this framework, [HXSS15, SYN15] considered one-step adversaries, while [MMS+18] worked with multi-step methods for the inner maximization problem. There are, however, two critical differences between the robust optimization based defenses and the present paper. Firstly, robust optimization based defenses lack of theoretical guarantees. Secondly, such methods do not consider the trade-off between accuracy and robustness.

Relaxation based defenses. We mention another related line of research in adversarial defenses—relaxation based defenses. Given that the inner maximization in problem (6) might be hard to solve due to the non-convexity nature of deep neural networks, [KW18] and [RSL18a] considered a convex outer approximation of the set of activations reachable through a norm-bounded perturbation for one-hidden-layer neural networks. [WSMK18] later scaled the methods to larger models, and [RSL18b] proposed a tighter convex approximation. [SND18, VNS+18] considered a Lagrangian penalty formulation of perturbing the underlying data distribution in a Wasserstein ball. These approaches, however, do not apply when the activation function is ReLU.

Appendix B Proofs of Main Results

In this section, we provide the proofs of our main results.

B.2 Proof of Theorem 3.2

The first inequality follows from Theorem 3.1. Thus it suffices to prove the second inequality.

where ff^{*} is the Bayes optimal classifier which outputs “positive” for all data points. ∎

Appendix C Extra Theoretical Results

In this section, we provide extra theoretical results for adversarial defenses.

provided that the marginal distribution over X\mathcal{X} is products of log-concave measures. A measure is log-concave if the logarithm of its density is a concave function. The class of log-concave measures contains many well-known (classes of) distributions as special cases, such as Gaussian and uniform measure over ball.

Our results are inspired by the isoperimetric inequality of log-concave distributions by the work of [Bar01]. Intuitively, the isoperimetric problem consists in finding subsets of prescribed measure, such that its measure increases the less under enlargement. Our analysis leads to the following guarantee on the quantity (7).

for an absolute constant c>0c>0. Furthermore, among all such probability measures and classifiers, the linear classifier over products of Gaussian measure with mean and variance 1/(2π)1/(2\pi) achieves the lower bound.

For a Borel set A\mathcal{A} and for ϵ>0\epsilon>0, denote by Aϵ={x:d(x,A)ϵ}\mathcal{A}_{\epsilon}=\{\bm{x}:d(\bm{x},\mathcal{A})\leq\epsilon\}. The boundary measure of A\mathcal{A} is then defined as

Before proceeding, we cite the following results from [Bar01].

To apply Lemma C.2, we set the A\mathcal{A} in Lemma C.2 as the event {sign(f(X))=+1}\{\textup{{sign}}(f(\bm{X}))=+1\}. Therefore, the set

Adding both sides of Eqns. (9) and (10), we have

C.2 Margin based generalization bounds

for all w\bm{w} and γ\gamma, where for each fixed γ\gamma, we use ii to denote the smallest index such that γiγ\gamma_{i}\leq\gamma.

If xpb\|\bm{x}\|_{p}\leq b and wqa\|\bm{w}\|_{q}\leq a, where 2p<2\leq p<\infty and 1/p+1/q=11/p+1/q=1, then γ>0\forall\gamma>0,

The theorem is a straightforward result of Lemmas C.3 and C.4 with

and γi=b/2i\gamma_{i}=b/2^{i} and pi=1/2ip_{i}=1/2^{i}. ∎

Therefore, we can design the following algorithm—Algorithm 2.

C.3 A lemma

We denote by f():=2η()1f^{*}(\cdot):=2\eta(\cdot)-1 the Bayes decision rule throughout the proofs.

Appendix D Extra Experimental Results

In this section, we provide extra experimental results to verify the effectiveness of our proposed method TRADES.

We use the same model, i.e., the WRN-34-10 architecture in [ZK16], to implement the methods in [ZSLG16], [KGB17] and [RDV17]. The experimental setup is the same as TRADES, which is specified in the beginning of Section 5. For example, we use the same batch size and learning rate for all the methods. More specifically, we find that using one-step adversarial perturbation method like FGSM in the regularization term, defined in [KGB17], cannot defend against FGSMk (white-box) attack. Therefore, we use FGSMk with the cross-entropy loss to calculate the adversarial example X\bm{X}^{\prime} in the regularization term, and the perturbation step size η1\eta_{1} and number of iterations KK are the same as in the beginning of Section 5.

As for defense models in Table 5, we implement the ‘TRADES’ models, the models trained by using other regularization losses in [KGB17, RDV17, ZSLG16], and the defense model ‘Madry’ in the antepenultimate line of Table 5. We evaluate [WSMK18]’s model based on the checkpoint provided by the authors. The rest of the models in Table 5 are reported in [ACW18].

D.2 Extra attack results in Section 5.3.1

Extra white-box attack results are provided in Table 8.

D.3 Extra attack results in Section 5.3.2

Extra black-box attack results are provided in Table 9 and Table 10. We apply black-box FGSM attack on the MNIST dataset and the CIFAR10 dataset.

D.4 Experimental setup in Section 5.3.2

The robust accuracy of [MMS+18]’s CNN model is 96.01%96.01\% on the MNIST dataset. The robust accuracy of [MMS+18]’s WRN-34-10 model is 47.66%47.66\% on the CIFAR10 dataset. Note that we use the same white-box attack method introduced in the Section 5.3.1, i.e., FGSM20, to evaluate the robust accuracies of Madry’s models.

D.5 Interpretability of the robust models trained by TRADES

D.5.2 Adversarial examples on Bird-or-Bicycle dataset

We find that the robust models trained by TRADES have strong interpretability. To see this, we apply a (spatial-tranformation-invariant) version of TRADES to train ResNet-50 models in response to the unrestricted adversarial examples of the bird-or-bicycle dataset [BCZ+18]. The dataset is bird-or-bicycle, which consists of 30,000 pixel-224×224224\times 224 images with label either ‘bird’ or ‘bicycle’. The unrestricted threat models include structural perturbations, rotations, translations, resizing, 17+ common corruptions, etc.

We show in Figures 7 and 8 the adversarial examples by the boundary attack with random spatial transformation on our robust model trained by the variant of TRADES. The boundary attack [BRB18] is a black-box attack method which searches for data points near the decision boundary and attack robust models by these data points. Therefore, the adversarial images obtained by boundary attack characterize the images around the decision boundary of robust models. We attack our model by boundary attack with random spatial transformations, a baseline in the competition. The classification accuracy on the adversarial test data is as high as 95%95\% (at 80%80\% coverage), even though the adversarial corruptions are perceptible to human. We observe that the robust model trained by TRADES has strong interpretability: in Figure 7 all of adversarial images have obvious feature of ‘bird’, while in Figure 8 all of adversarial images have obvious feature of ‘bicycle’. This shows that images around the decision boundary of truly robust model have features of both classes.