Analysis of classifiers' robustness to adversarial perturbations

Alhussein Fawzi, Omar Fawzi, Pascal Frossard

Introduction

State-of-the-art deep networks have recently been shown to be surprisingly unstable to adversarial perturbations (Szegedy et al, 2014). Unlike random noise, adversarial perturbations are minimal perturbations that are sought to switch the estimated label of the classifier. On vision tasks, the results of Szegedy et al (2014) have shown that perturbations that are hardly perceptible to the human eye are sufficient to change the decision of a deep network, even if the classifier has a performance that is close to the human visual system. This surprising instability raises interesting theoretical questions that we initiate in this paper. What causes classifiers to be unstable to adversarial perturbations? Are deep networks the only classifiers that have such unstable behaviour? Is it at all possible to design training algorithms to build deep networks that are robust or is the instability to adversarial noise an inherent feature of all deep networks? Can we quantify the difference between random noise and adversarial noise? Providing theoretical answers to these questions is crucial in order to achieve the goal of building classifiers that are robust to adversarial hostile perturbations.

In this paper, we introduce a framework to formally study the robustness of classifiers to adversarial perturbations in the binary setting. We provide a general upper bound on the robustness of classifiers to adversarial perturbations, and then illustrate and specialize the obtained upper bound for the families of linear and quadratic classifiers. In both cases, our results show the existence of a fundamental limit on the robustness to adversarial perturbations. This limit is expressed in terms of a distinguishability measure between the classes, which depends on the considered family of classifiers. Specifically, for linear classifiers, the distinguishability is defined as the distance between the means of the two classes, while for quadratic classifiers, it is defined as the distance between the matrices of second order moments of the two classes. For both classes of functions, our upper bound on the robustness is valid for all classifiers in the family independently of the training procedure, and we see the fact that the bound is independent of the training procedure as a strength. This result has the following important implication: in difficult classification tasks involving a small value of distinguishability, any classifier in the set with low misclassification rate is vulnerable to adversarial perturbations. Importantly, the distinguishability parameter related to quadratic classifiers is much larger than that of linear classifiers for many datasets of interest, and suggests that it is harder to find adversarial examples for more flexible classifiers. We further compare the robustness to adversarial perturbations of linear classifiers to the more traditional notion of robustness to random uniform noise. The latter robustness is shown to be larger than the former by a factor of $\sqrt{d}$ (with $d$ the dimension of input signals), thereby showing that in high dimensional classification tasks, linear classifiers can be robust to random noise even for small values of the distinguishability. We illustrate the newly introduced concepts and our theoretical results on a running example used throughout the paper. We complement our theoretical analysis with experimental results, and show that the intuition obtained from the theoretical analysis also holds for more complex classifiers.

The phenomenon of adversarial instability has recently attracted a lot of attention from the deep network community. Following the original paper (Szegedy et al, 2014), several attempts have been made to make deep networks robust to adversarial perturbations (Chalupka et al, 2014; Gu and Rigazio, 2014). Moreover, a distinct but related phenomenon has been explored in Nguyen et al (2014). Closer to our work, the authors of (Goodfellow et al, 2015) provided an empirical explanation of the phenomenon of adversarial instability, and designed an efficient method to find adversarial examples. Specifically, contrarily to the original explanation provided in Szegedy et al (2014), the authors argue that it is the “linear” nature of deep nets that causes the adversarial instability. Instead, our paper adopts a rigorous mathematical perspective to the problem of adversarial instability and shows that adversarial instability is due to the low flexibility of classifiers compared to the difficulty of the classification task.

Our work should not be confused with works on the security of machine learning algorithms under adversarial attacks (Biggio et al, 2012; Barreno et al, 2006; Dalvi et al, 2004). These works specifically study attacks that manipulate the learning system (e.g., change the decision function by injecting malicious training points), as well as defense strategies to counter these attacks. This setting significantly differs from ours, as we examine the robustness of a fixed classifier to adversarial perturbations (that is, the classifier cannot be manipulated). The stability of learning algorithms has also been defined and extensively studied in (Bousquet and Elisseeff, 2002; Lugosi and Pawlak, 1994). Again, this notion of stability differs from the one studied here, as we are interested in the robustness of fixed classifiers, and not of learning algorithms.

The construction of learning algorithms that achieve robustness of classifiers to data corruption has been an active area of research in machine learning and robust optimization (see e.g., Caramanis et al (2012) and references therein). For a specific disturbance model on the data samples, the robust optimization approach for constructing robust classifiers seeks to minimize the worst possible empirical error under such disturbances (Lanckriet et al, 2003; Xu et al, 2009). It is shown that, for many disturbance models, the desired objective function can be written as a tractable convex optimization problem. Our work studies the robustness of classifiers from a different perspective; we establish upper bounds on the robustness of classifiers independently of the learning algorithms. That is, using our bounds, we can certify the instability of a class of classifiers to adversarial perturbations, independently of the learning mechanism. In other words, while algorithmic and optimization aspects of robust classifiers have been studied in the above works, we focus on fundamental limits on the adversarial robustness of classifiers that are independent of the learning scheme.

The paper is structured as follows: Sec. 2 introduces the problem setting. In Sec. 3, we introduce a running example that is used throughout the paper. We introduce in Sec. 4 a theoretical framework for studying the robustness to adversarial perturbations. In the following two sections, two case studies are analyzed in detail. The robustness of linear classifiers (to adversarial and random noise) is studied in Sec. 5. In Sec. 6, we study the adversarial robustness of quadratic classifiers. Experimental results illustrating our theoretical analysis are given in Section 7. Proofs and additional discussion on the choice of the norms to measure perturbations are finally deferred to the appendix.

Problem setting

In words, $\rho_{\text{adv}}(f)$ is defined as the average norm of the minimal perturbations required to flip the estimated labels of the datapoints. Note that $\rho_{\text{adv}}(f)$ is a property of both the classifier $f$ and the distribution $\mu$ , but it is independent of the true labels of the datapoints $y$ .In that aspect, our definition slightly differs from the one proposed in Szegedy et al (2014), which defines the robustness to adversarial perturbations as the average of the norms of the minimal perturbations required to misclassify all datapoints. As our notion of robustness is larger, the upper bounds derived in our paper also directly apply for the definition of robustness in Szegedy et al (2014). Moreover, it should be noted that $\rho_{\text{adv}}$ is different from the margin considered by SVMs. In fact, SVM margins are traditionally defined as the minimal distance to the (linear) boundary over all training points, while $\rho_{\text{adv}}$ is defined as the average distance to the boundary over all training points. In addition, distances in our case are measured in the input space, while the margin is defined in the feature space for kernel SVMs.

In this paper, we also study the robustness of classifiers to random uniform noise, that we define as follows. For a given $\epsilon\in$ , let

We summarize the quantities of interest in Table 1.

Running example

We introduce in this section a running example used throughout the paper to illustrate the notion of adversarial robustness, and highlight its difference with the notion of risk. We consider a binary classification task on square images of size $\sqrt{d}\times\sqrt{d}$ . Images of class $1$ (resp. class $-1$ ) contain exactly one vertical line (resp. horizontal line), and a small constant positive number $a$ (resp. negative number $-a$ ) is added to all the pixels of the images. That is, for class $1$ (resp. $-1$ ) images, background pixels are set to $a$ (resp. $-a$ ), and pixels belonging to the line are equal to $1+a$ (resp. $1-a$ ). Fig. 2 illustrates the classification problem for $d=25$ . The number of datapoints to classify is $N=2\sqrt{d}$ .

Clearly, the most relevant concept (in terms of visual appearance) that permits to separate the two classes is the orientation of the line (i.e., horizontal vs. vertical). The bias of the image (i.e., the sum of all its pixels) is also a valid concept for this task, as it separates the two classes, despite being much more difficult to detect visually. The class of an image can therefore be correctly estimated from its orientation or from the bias. Let us first consider the linear classifier defined by

where $\mathbf{1}$ is the vector of size $d$ whose entries are all equal to $1$ , and $x$ is the vectorized image, exploits the difference of bias between the two classes and achieves a perfect classification accuracy for all $a>0$ . Indeed, a simple computation gives $f_{\text{lin}}(x)=\sqrt{d}a$ (resp. $f_{\text{lin}}(x)=-\sqrt{d}a$ ) for class $1$ (resp. class $-1$ ) images. Therefore, the risk of $f_{\text{lin}}$ is $R(f_{\text{lin}})=0$ . It is important to note that $f_{\text{lin}}$ only achieves zero risk because it captures the bias, but fails to distinguish between the images based on the orientation of the line. Indeed, when $a=0$ , the datapoints are not linearly separable. Despite its perfect accuracy for any $a>0$ , $f_{\text{lin}}$ is not robust to small adversarial perturbations when $a$ is small, as a minor perturbation of the bias switches the estimated label. Indeed, a simple computation gives $\rho_{\text{adv}}(f_{\text{lin}})=\sqrt{d}a$ ; therefore, the adversarial robustness of $f_{\text{lin}}$ can be made arbitrarily small by choosing $a$ to be small enough. More than that, among all linear classifiers that satisfy $R(f)=0$ , $f_{\text{lin}}$ is the one that maximizes $\rho_{\text{adv}}(f)$ (as we show later in Section 5). Therefore, all zero-risk linear classifiers are not robust to adversarial perturbations, for this classification task.

Unlike linear classifiers, a more flexible classifier that correctly captures the orientation of the lines in the images will be robust to adversarial perturbation, unless this perturbation significantly alters the image and modifies the direction of the line. To illustrate this point, we compare the adversarial robustness of $f_{\text{lin}}$ to that of a second order polynomial classifier $f_{\text{quad}}$ that achieves zero risk in Fig. 3, for $d=4$ .We postpone the detailed analysis of $f_{\text{quad}}$ to Section 6. While a hardly perceptible change of the image is sufficient to switch the estimated label for the linear classifier, the minimal perturbation for $f_{\text{quad}}$ is one that modifies the direction of the line, to a great extent.

The above example highlights several important facts, which are summarized as follows:

Risk and adversarial robustness are two distinct properties of a classifier. While $R(f_{\text{lin}})=0$ , $f_{\text{lin}}$ is definitely not robust to small adversarial perturbations.The opposite is also possible, since a constant classifier (e.g., $f(x)=1$ for all $x$ ) is clearly robust to perturbations, but does not achieve good accuracy. This is due to the fact that $f_{\text{lin}}$ only captures the bias in the images and ignores the orientation of the line.

To capture orientation (i.e., the most visual concept), one has to use a classifier that is flexible enough for the task. Unlike the class of linear classifiers, the class of polynomial classifiers of degree $2$ correctly captures the line orientation, for $d=4$ .

The robustness to adversarial perturbations provides a quantitative measure of the strength of a concept. Since $\rho_{\text{adv}}(f_{\text{lin}})\ll\rho_{\text{adv}}(f_{\text{quad}})$ , one can confidently say that the concept captured by $f_{\text{quad}}$ is stronger than that of $f_{\text{lin}}$ , in the sense that the essence of the classification task is captured by $f_{\text{quad}}$ , but not by $f_{\text{lin}}$ (while they are equal in terms of misclassification rate). In general classification problems, the quantity $\rho_{\text{adv}}(f)$ provides a natural way to evaluate and compare the learned concept; larger values of $\rho_{\text{adv}}(f)$ indicate that stronger concepts are learned, for comparable values of the risk.

As illustrated in the above toy example, the robustness to adversarial perturbations is key to assess the strength of a concept. In real-world classification tasks, weak concepts correspond to partial information about the classification task (which are possibly sufficient to achieve a good accuracy), while strong concepts capture the essence of the classification task.

Upper limit on the adversarial robustness

We now introduce our theoretical framework for analyzing the robustness to adversarial perturbations. We first present a key assumption on the classifier $f$ for the analysis of adversarial robustness.

Assumption (A). There exist $\tau>0$ and $0<\gamma\leq 1$ such that, for all $x\in\mathcal{B}$ ,

where $\operatorname{dist}(x,S)=\min_{y}\{\|x-y\|_{2}:y\in S\}$ and $S_{+}$ (resp. $S_{-}$ ) is the set of points $x$ such that $f(x)\geq 0$ (resp. $f(x)\leq 0$ ):

In words, the assumption (A) states that for any datapoint $x$ , the residual $\max(0,f(x))$ (resp. $\max(0,-f(x))$ ) can be used to bound the distance from $x$ to a datapoint $y$ classified $-1$ (resp. $1$ ).

Bounds of the form Eq. (6) have been established for various classes of functions since the early of work of Lojasiewicz (Łojasiewicz, 1961) in algebraic geometry and have found applications in areas such as mathematical optimization (Pang, 1997; Lewis and Pang, 1998). For example, Lojasiewicz (Łojasiewicz, 1961) and later (Luo and Pang, 1994) have shown that, quite remarkably, assumption (A) holds for the general class of analytic functions. In (Ng and Zheng, 2003), (A) is shown to hold with $\gamma=1$ for piecewise linear functions. In (Luo and Luo, 1994), error bounds on polynomial systems are studied. Proving inequality (6) with explicit constants $\tau$ and $\gamma$ for different classes of functions is still an active area of research (Li et al, 2014). In Sections 5 and 6, we provide examples of function classes for which (A) holds, and explicit formulas for the parameters $\tau$ and $\gamma$ .

The following result establishes a general upper bound on the robustness to adversarial perturbations:

Let $f$ be an arbitrary classifier that satisfies (A) with parameters $(\tau,\gamma)$ . Then,

The proof can be found in Appendix A.1. The above result provides an upper bound on the adversarial robustness that depends on the risk of the classifier, as well as a measure of the separation between the expectations of the classifier values computed on distribution $\mu_{1}$ and $\mu_{-1}$ . This result is general, as we only assume that $f$ satisfies assumption $(A)$ . In the next two sections, we apply Lemma 4.1 to two classes of classifiers, and derive interpretable upper bounds in terms of a distinguishibality measure (that depends only on the dataset) which quantifies the notion of difficulty of a classification task. Studying the general result in Lemma 4.1 through two practical classes of classifiers shows the implications of such a fundamental limit on the adversarial robustness, and illustrates the methodology for deriving class-specific and practical upper bounds on adversarial robustness from the general upper bound.

Robustness of linear classifiers to adversarial and random perturbations

The goal of this section is two-fold; first, we specialize Lemma 4.1 to the class of linear functions, and derive interpretable upper bounds on the robustness of classifiers to adversarial perturbations (Section 5.1). Then, we derive a formal relation between the robustness of linear classifiers to adversarial robustness, and the robustness to random uniform noise (Section 5.2).

We define the classification function $f(x)=w^{T}x+b$ . Note that any linear classifier for which $|b|>M\|w\|_{2}$ is a trivial classifier that assigns the same label to all points, and we therefore assume that $|b|\leq M\|w\|_{2}$ .

We first show that the family of linear classifiers satisfies assumption (A), with explicit parameters $\tau$ and $\gamma$ .

Assumption (A) holds for linear classifiers $f(x)=w^{T}x+b$ with $\tau=1/\|w\|_{2}$ and $\gamma=1$ .

Let $x$ be such that $f(x)\geq 0$ , and the goal is to prove that $\text{dist}(x,S_{-})\leq\tau f(x)^{\gamma}$ (the other inequality can be handled in a similar way). We have $f(x)=w^{T}x+b$ . $\text{dist}(x,S_{-})=f(x)/\|w\|_{2}\implies\tau=1/\|w\|_{2},\gamma=1$ . ∎

Using Lemma 4.1, we now derive an interpretable upper bound on the robustness to adversarial perturbations. In particular, the following theorem bounds $\rho_{\text{adv}}(f)$ from above in terms of the first moments of the distributions $\mu_{1}$ and $\mu_{-1}$ , and the classifier’s risk:

Let $f(x)=w^{T}x+b$ such that $|b|\leq M\|w\|_{2}$ . Then,

In the balanced setting where $p_{1}=p_{-1}=1/2$ , and if the intercept $b=0$ the following inequality holds:

Using Lemma 4.1 with $\tau=1/\|w\|_{2}$ and $\gamma=1$ , we have

$b(p_{1}-p_{-1})\leq M\|w\|_{2}|p_{1}-p_{-1}|$ using the assumption $|b|\leq M\|w\|_{2}$ ,

$\|f\|_{\infty}=\max_{x:\|x\|_{2}\leq M}\{|w^{T}x+b|\}\leq 2M\|w\|_{2}$ .

By plugging the three inequalities in Eq. (9), we obtain the desired result in Eq. (7).

When $p_{1}=p_{-1}=1/2$ , and the intercept $b=0$ , inequality (iii) can be tightened to $\|f\|_{\infty}\leq M\|w\|_{2}$ , and directly leads to the stated result Eq. (8). ∎

2 Random uniform noise

We now examine the robustness of linear classifiers to random uniform noise. The following theorem compares the robustness of linear classifiers to random uniform noise with their robustness to adversarial perturbations.

Let $f(x)=w^{T}x+b$ . For any $\epsilon\in[0,1/12)$ , we have the following bounds on $\rho_{\text{unif},\epsilon}(f)$ :

with $C_{1}(\epsilon)=(2\ln(2/\epsilon))^{-1/2}$ , $\widetilde{C_{2}}(\epsilon,d)=(1-(12\epsilon)^{1/d})^{-1/2}$ and $C_{2}(\epsilon)=(1-12\epsilon)^{-1/2}$ .

Our results can be put in perspective with the empirical results of Szegedy et al (2014), that showed a large gap between the two notions of robustness on neural networks. Our analysis provides a confirmation of this high dimensional phenomenon on linear classifiers.

3 Illustration of the results on the running example

We now focus on the robustness to uniform random noise of $f_{\text{lin}}$ . For various values of $d$ , we compute the upper and lower bounds on the robustness to random uniform noise (Theorem 5.3) of $f_{\text{lin}}$ , where we fix $\epsilon$ to $0.01$ . In addition, we compute a simple empirical estimate $\widehat{\rho}_{\text{unif},\epsilon}$ of the robustness to random uniform noise of $f_{\text{lin}}$ (see Sec. 7 for details on the computation of this estimate). The results are illustrated in Fig. 5. While the adversarial noise robustness is constant with the dimension (equal to $0.1$ , as $\rho_{\text{adv}}(f_{\text{lin}})=\sqrt{d}a$ and $a=0.1/\sqrt{d}$ ), the robustness to random uniform noise increases with $d$ . For example, for $d=2500$ , the value of $\rho_{\text{unif},\epsilon}$ is at least $15$ times larger than the adversarial robustness $\rho_{\text{adv}}$ . In high dimensions, a linear classifier is therefore much more robust to random uniform noise than adversarial noise.

Adversarial robustness of quadratic classifiers

In this section, we derive specialized upper bounds on the robustness to adversarial perturbations of quadratic classifers using Lemma 4.1.

We study the robustness to adversarial perturbations of quadratic classifiers of the form $f(x)=x^{T}Ax$ , where $A$ is a symmetric matrix. Besides the practical use of quadratic classifiers in some applications (Goldberg and Elhadad, 2008; Chang et al, 2010), they represent a natural extension of linear classifiers. The study of linear vs. quadratic classifiers provides insights into how adversarial robustness depends on the family of considered classifiers. Similarly to the linear setting, we exclude the case where $f$ is a trivial classifier that assigns a constant label to all datapoints. That is, we assume that $A$ satisfies

where $\lambda_{\min}(A)$ and $\lambda_{\max}(A)$ are the smallest and largest eigenvalues of $A$ . We moreover impose that the eigenvalues of $A$ satisfy

for some constant value $K\geq 1$ . This assumption imposes an approximate symmetry around of the extremal eigenvalues of $A$ , thereby disallowing a large bias towards any of the two classes.

We first show that the assumption (A) is satisfied for quadratic classifiers, and derive explicit formulas for $\tau$ and $\gamma$ .

Assumption (A) holds for the class of quadratic classifiers $f(x)=x^{T}Ax$ where $\lambda_{\min}(A)<0$ , $\lambda_{\max}(A)>0$ with $\tau=\max(|\lambda_{\min}(A)|^{-1/2},|\lambda_{\max}(A)|^{-1/2})$ , and $\gamma=1/2$ ,

Let $x$ be such that $f(x)\geq 0$ , and the goal is to prove that $\text{dist}(x,S_{-})\leq\tau f(x)^{\gamma}$ (the other inequality can be handled in a similar way). Assume without loss of generality that $A$ is diagonal (this can be done using an appropriate change of basis). Let $\nu=-\lambda_{\min}(A)$ . We have $f(x)=\sum_{i=1}^{d-1}\lambda_{i}x_{i}^{2}-\nu x_{d}^{2}$ . By setting $r_{i}=0$ for all $i\in\{1,\dots,d-1\}$ and $r_{d}=\text{sign}(x_{d})\sqrt{f(x)/\nu}$ , (where $\text{sign}(x)=1$ if $x\geq 0$ and $-1$ otherwise) we have

Hence, $\text{dist}(x,S_{-})\leq\|r\|_{2}=\nu^{-1/2}\sqrt{f(x)}\implies\tau=\nu^{-1/2},\gamma=1/2$ . ∎

The following result builds on Lemma 4.1 and bounds the adversarial robustness of quadratic classifiers as a function of the second order moments of the distribution and the risk.

Let $f(x)=x^{T}Ax$ , where $A$ satisfies Eqs. (12) and (13). Then,

The class of classifiers under study satisfies assumption (A) with $\tau=\max(|\lambda_{\min}(A)|^{-1/2},|\lambda_{\max}(A)|^{-1/2})$ , and $\gamma=1/2$ (see Lemma 6.1). By applying Lemma 4.1, we have

$|f(x)|=|x^{T}Ax|\leq\|A\|\|x\|\leq\|A\|M$ ,

$\|A\|^{1/2}\tau=\max(|\lambda_{\min}(A)|,|\lambda_{\max}(A)|)^{1/2}\max(|\lambda_{\min}(A)|^{-1/2},|\lambda_{\max}(A)|^{-1/2})\leq\sqrt{K}$ .

Applying these three inequalities, we obtain

In words, the upper bound on the adversarial robustness depends on a distinguishability measure, defined by $\|C_{1}-C_{-1}\|_{*}$ , and the classifier’s risk. In difficult classification tasks, where $\|C_{1}-C_{-1}\|_{*}$ is small, any quadratic classifier with low risk that satisfies our assumptions in Eq. (12, 13) is not robust to adversarial perturbations.

It should be noted that, while the distinguishability is measured with the distance between the means of the two distributions in the linear case, it is defined here as the difference between the second order moments matrices $\|C_{1}-C_{-1}\|_{*}$ . Therefore, in classification tasks involving two distributions with close means, and different second order moments, any zero-risk linear classifier will not be robust to adversarial noise, while zero-risk and robust quadratic classifiers are a priori possible according to our upper bound in Theorem 6.2. This suggests that robustness to adversarial perturbations can be larger for more flexible classifiers, for comparable values of the risk.

2 Illustration of the results on the running example

We now illustrate our results on the running example of Section 3, with $d=4$ . In this case, a simple computation gives $\|C_{1}-C_{-1}\|_{*}=2+8a\geq 2$ . This term is significantly larger than the difference of means (equal to $4a$ ), and there is therefore hope to have a quadratic classifier that is accurate and robust to small adversarial perturbations, according to Theorem 6.2. In fact, the following quadratic classifier

outputs $1$ for vertical images, and $-1$ for horizontal images (independently of the bias $a$ ). Therefore, $f_{\text{quad}}$ achieves zero risk on this classification task, similarly to $f_{\text{lin}}$ . The two classifiers however have different robustness properties to adversarial perturbations. Using straightforward calculations, it can be shown that $\rho_{\text{adv}}(f_{\text{quad}})=1/\sqrt{2}$ , for any value of $a$ (see Appendix B for more details). For small values of $a$ , we therefore get $\rho_{\text{adv}}(f_{\text{lin}})\ll\rho_{\text{adv}}(f_{\text{quad}})$ . This result is intuitive, as $f_{\text{quad}}$ differentiates the images from their orientation, unlike $f_{\text{lin}}$ that uses the bias to distinguish them. The minimal perturbation required to switch the estimated label of $f_{\text{quad}}$ is therefore one that modifies the direction of the line, while a hardly perceptible perturbation that modifies the bias is enough to flip the label for $f_{\text{quad}}$ . This explains the result originally illustrated in Fig. 3.

Experimental results

In this section, we illustrate our results on practical classification examples. Specifically, through experiments on real data, we seek to confirm the identified limit on the robustness of classifiers, and we show the large gap between adversarial and random robustness on real data. We also study more general classifiers to suggest that the trends obtained with our theoretical results are not limited to linear and quadratic classifiers.

Given a binary classifier $f$ , and a datapoint $x$ , we use an approach close to that of Szegedy et al (2014) to approximate $\Delta_{\text{adv}}(x;f)$ . Specifically, we perform a line search to find the maximum $c>0$ for which the minimizer of the following problem satisfies $f(x)f(x+r)\leq 0$ :

where we set $L(x)=\max(0,x)$ . The above problem (for $c$ fixed) is solved with a subgradient procedure, and we denote by $\widehat{\Delta}_{\text{adv}}(x;f)$ the obtained solution.This procedure is not guaranteed to provide the optimal solution (for arbitrary classifiers $f$ ), as the problem is clearly non convex. Strictly speaking, the optimization procedure is only guaranteed to provide an upper bound on $\Delta_{\text{adv}}(x;f)$ . The empirical robustness to adversarial perturbations is then defined by $\widehat{\rho}_{\text{adv}}(f)=\frac{1}{m}\sum_{i=1}^{m}\widehat{\Delta}_{\text{adv}}(x_{i};f)$ , where $x_{1},\dots,x_{m}$ denote the training points. To evaluate the robustness of $f$ , we compare $\widehat{\rho}_{\text{adv}}(f)$ to the following quantity:

It represents the average norm of the minimal perturbation required to “transform” a training point to a training point of the opposite class, and can be seen as a distance measure between the two classes. The quantity $\kappa$ therefore provides a baseline for comparing the robustness to adversarial perturbations, and we say that $f$ is not robust to adversarial perturbations when $\widehat{\rho}_{\text{adv}}(f)\ll\kappa$ . We also compare the adversarial robustness of the classifiers with their robustness to random uniform noise. We estimate $\Delta_{\text{unif},\epsilon}(x;f)$ using a line search procedure that finds the largest $\eta$ for which the condition

2 Binary classification using SVM

We perform experiments on several classifiers: linear SVM (denoted L-SVM), SVM with polynomial kernels of degree $q$ (denoted poly-SVM ( $q$ )), and SVM with RBF kernel with a width parameter $\sigma^{2}$ (RBF-SVM( $\sigma^{2}$ )). To train the classifiers, we use the efficient Liblinear (Fan et al, 2008) and LibSVM (Chang and Lin, 2011) implementations, and we fix the regularization parameters using a cross-validation procedure.

We now turn to a natural image classification task, with images taken from the CIFAR-10 database (Krizhevsky and Hinton, 2009). The database contains $10$ classes of $32\times 32$ RGB images. We restrict the dataset to the first two classes (“airplane” and “automobile”), and consider a subset of the original data, with $1,000$ images for training, and $1,000$ for testing. Moreover, all images are normalized to be of unit Euclidean norm. Compared to the first dataset, this task is more difficult, as the variability of the images is much larger than for digits. We report the results in Table 3. It can be seen that all classifiers are not robust to adversarial perturbations for this experiment, as $\rho_{\text{adv}}(f)\ll\kappa=0.39$ . Despite that, all classifiers (except L-SVM) achieve an accuracy around $85\%$ , and a training accuracy above $92\%$ , and are robust to uniform random noise. Fig. 7 illustrates the robustness to adversarial and random noise of the learned classifiers, on an example image of the dataset. Compared to the digits dataset, the distinguishability measures for this task are smaller (see Table 4). Our theoretical analysis therefore predicts a lower limit on the adversarial robustness of linear and quadratic classifiers for this task (even though the bound for quadratic classifiers is far from the achieved robustness of poly-SVM(2) in this example).

The instability of all classifiers to adversarial perturbations on this task suggests that the essence of the classification task was not correctly captured by these classifiers, even if a fairly good test accuracy is reached. To reach better robustness, two possibilities exist: use a more flexible family of classifiers (as our theoretical results suggest that more flexible families of classifiers achieve better robustness), or use a better training algorithm for the tested nonlinear classifiers. The latter solution seems possible, as the theoretical limit for quadratic classifiers suggests that there is still room to improve the robustness of these classifiers.

3 Multiclass classification using CNN

Since our theoretical results suggest that more flexible classifiers achieve better robustness to adversarial perturbations in the binary case, we now explore empirically whether the same intuitions hold in scenarios that depart from the theory in two different ways: (i) we consider multiclass classification problems, and (ii) we consider convolutional neural network architectures. While classifiers’ flexibility is relatively well quantified for polynomial classifiers by the degree of the polynomials, this is not straightforward to do for neural network architectures. In this section, we examine the effect of breadth and depth on the robustness to adversarial perturbations of classifiers.

We perform experiments on the multiclass CIFAR-10 classification task, and use the recent method in (Moosavi-Dezfooli et al, 2016) to compute adversarial examples in the multiclass case. We focus on baseline CNN classifiers, and learn architectures with 1, 2 and 3 hidden layers. Specifically, each layer consists of a successive combination of convolutional, rectified linear units and pooling operations. The convolutional layers consist of 5 $\times$ 5 filters with 50 feature maps for each layer, and the pooling operations are done on a window of size 3 $\times$ 3 with a stride parameter of 2. We build the three architectures gradually, by successively stacking a new hidden layer on top of the previous architecture (kept fixed). The last hidden layer is then connected to a fully connected layer, and the softmax loss is used. All architectures are trained with stochastic gradient descent. To provide a fair comparison of the different classifiers, all three classifiers have approximately similar classification error ( $35\%$ ). To ensure similar accuracies, we perform an early stop of the training procedure when necessary. The empirical normalized robustness to adversarial perturbations of the three networks are compared in Figure 8 (a).More precisely, we report $\frac{1}{m}\sum_{i=1}^{m}\frac{\hat{\Delta}_{\text{adv}}(x_{i};f)}{\|x_{i}\|_{2}}$ .

We observe first that increasing the depth of the network leads to a significant increase in the robustness to adversarial perturbations, especially from $1$ to $2$ layers. The depth of a neural network has an important impact on the robustness of the classifier, just like the degree of a polynomial classifier is an important factor for the robustness. Going from $2$ to $3$ layers however seems to have a marginal effect on the robustness. It should be noted that, despite the increase of the robustness with the depth, the normalized robustness computed for all classifiers is relatively small, which suggests that none of these classifiers is really robust to adversarial perturbations. Note also that the results in Figure 8 (a) showing an increase of the robustness with the depth are inline with recent results showing that depth provides robustness to adversarial geometric transformations (Fawzi and Frossard, 2015). In Fig. 8 (b), we show the effect of the number of feature maps in the CNN (for a one layer CNN) on the estimated normalized robustness to adversarial perturbations. Unlike the effect of depth, we observe that the number of feature maps has barely any effect on the robustness to adversarial perturbations. Finally, a comparison of the normalized robustness measures of very deep networks VGG-16 and VGG-19 (Simonyan and Zisserman, 2014) on ImageNet shows that these two networks behave very similarly in terms of robustness (both achieve a normalized robustness of $3\cdot 10^{-3}$ ). This experiment, along with the experiment in Figure 8 (a), empirically suggest that adding layers on top of shallow network helps in terms of adversarial robustness, but if the depth of the network is already sufficiently large, then adding layers only moderately changes that robustness.

Discussion and perspectives

The existence of a general limit on the adversarial robustness of classifiers (established in Lemma 4.1) is an important phenomenon with many practical implications. To better understand the implications of this limit, we derived specialized upper bounds for two families of classifiers. For the family of linear classifiers, the established limit is very small for most problems of interest. Hence, linear classifiers are usually not robust to adversarial noise (even though robustness to random noise might be achieved). This is however different for nonlinear classifiers: for the family of quadratic classifiers, the limit on adversarial robustness is usually larger than for linear classifiers, which gives hope to have classifiers that are robust to adversarial perturbations. In fact, by using an appropriate training procedure, it might be possible to get closer to the theoretical bound. For general nonlinear classifiers (e.g., neural networks), designing training procedures that specifically take into account the robustness in the learning is an important future work. We also believe that the application of our general upper bound in Lemma 4.1 to derive explicit upper bounds that are specific to e.g., deep neural networks is an important future work. To do that, we believe that it is important to derive explicitly the parameters $(\tau,\gamma)$ of assumption (A) for the class of functions under consideration. Results from algebraic geometry seem to suggest that establishing such results might be possible for general classes of functions (e.g., piecewise linear functions). In addition, experimental results suggest that, unlike the breadth of the neural network, the depth plays a crucial role in the adversarial robustness. Identifying an upper bound on the adversarial robustness of deep neural networks in terms of the depth of the network would be a great step towards having a better understanding of such systems.

Acknowledgments

We thank Hamza Fawzi, Ian Goodfellow for discussions and comments on an early draft of the paper, and Guillaume Aubrun for pointing out a reference for Theorem A.2. We also thank Seyed Mohsen Moosavi for his help in preparing experiments.

References

Appendix A Proofs

We begin by proving the following inequality:

Let $z_{1},\dots,z_{n}$ be non-negative real numbers, and let $0\leq\gamma\leq 1$ . Then,

is bounded from above by $n^{1-\gamma}$ . To do so, let $u_{i}=\frac{z_{i}}{\sum_{i=1}^{n}z_{i}}$ , and let us determine the maximum of the concave function $g(u_{1},\dots,u_{n-1})=u_{1}^{\gamma}+\dots+(1-u_{1}-\dots-u_{n-1})^{\gamma}$ . Setting the derivative of $g$ with respect to $u_{i}$ to zero, we get

hence $u_{i}=1-u_{1}-\dots-u_{n-1}$ . We therefore get $u_{1}=\dots=u_{n-1}$ , and conclude that the maximum of $\sum_{i=1}^{n}\left(\frac{z_{i}}{\sum_{i=1}^{n}z_{i}}\right)^{\gamma}$ is reached when $z_{1}=\dots=z_{n}$ and the value of the maximum is $n^{1-\gamma}$ . ∎

Using assumption (A), the following upper bounds hold:

Hence, we obtain the following inequality on $\rho_{\text{adv}}(f)$ :

We use the result in Lemma A.1 with $n=4$ , and obtain

Note moreover that the following equality holds

A.2 Proof of Theorem 5.3

The proof of this theorem relies on the concentration of measure on the sphere. The following result from (Matoušek, 2002) precisely bounds the measure of a spherical cap.

Based on Theorem A.2, we show the following result:

Using an appropriate change of basis, we can assume that $w=(1,0,\dots,0)^{T}$ . For $\tau\in[\sqrt{2/d},1)$ , we have

where (a) uses the upper bound of Theorem A.2, and (b) uses the inequality $(1-\tau^{2})\leq\exp(-\tau^{2})$ . Note moreover that for $\tau\in[0,\sqrt{2/d})$ , the inequality $2\exp(-\tau^{2}d/2)\geq 2\exp(-1)\geq 1/2$ holds, which proves the upper bound.

We now prove the lower bound. Observe that the following lower bound holds for $(\tau\sqrt{d})^{-1}$ , for any $\tau\in[\sqrt{2/d},1)$ :

To see this, note that the maximum of the function $a\mapsto\ln(a)/a^{2}$ is equal to $1/(2e)\leq 1/2$ . Therefore, $\ln(\tau\sqrt{d})/(\tau^{2}d)\leq 1/2$ , or equivalently, $(\tau\sqrt{d})^{-1}\geq\exp(-\tau^{2}d/2)$ . Therefore, we get $\frac{1}{\tau\sqrt{d}}\geq(1-\tau^{2})^{d/2}$ , and using Theorem A.2, we obtain for any $\tau\in[\sqrt{2/d},1)$ :

Note also that this inequality holds for $\tau\in[0,\sqrt{2/d}]$ , as $\frac{1}{12}(1-\tau^{2})^{d}\leq\frac{1}{12}$ . ∎

Armed with the concentration of measure result on the sphere, we now focus on the proof of Theorem 5.3. Let $f(x)=w^{T}x+b$ . Let $x$ be fixed such that $f(x)>0$ , and let $\eta>0$ and $\epsilon\in(0,1/12)$ . Then,

Using the upper bound in Lemma A.3, we obtain:

Using the lower bound result of Lemma A.3, we have:

We also derive a lower bound on $\Delta_{\text{unif},\epsilon}(x;f)$ of the form $C_{2}(\epsilon)\sqrt{d}\Delta_{\text{adv}}(x;f)$ by noting that

where we have used the fact that $\frac{1}{\sqrt{d(1-(12\epsilon)^{1/d})}}$ is a decreasing function of $d$ . To see that this function is indeed decreasing, note that its derivative (with respect to $d$ ) can be written as $P(d)\left(d(\overline{\epsilon}^{1/d}-1)-\overline{\epsilon}^{1/d}\ln(\overline{\epsilon})\right)$ , with $P(d)$ non-negative, and $\overline{\epsilon}=12\epsilon$ . Then, by using the inequality $\ln((1/\overline{\epsilon})^{1/d})\leq\left(1/\overline{\epsilon}\right)^{1/d}-1$ , the negativity of the derivative follows.

By combining the lower and upper bounds, and taking the expectations on both sides of the inequality, we obtain:

A similar result can be proven for $x$ such that $f(x)\leq 0$ . We therefore conclude that

where we have used the inequality $\rho_{\text{unif},\epsilon}(f)\geq\rho_{\text{adv}}(f)$ .

Appendix B Vertical-horizontal example: quadratic classifier

We consider the quadratic classifier $f_{\text{quad}}(x)=x^{T}Ax$ , with

We perform a change of basis, and work in the diagonalizing basis of $A$ , denoted by $P$ . We have

Given a point $x$ and label $y$ , the following problem is solved to find the minimal perturbation that switches the estimated label:

Appendix C Discussion on the norms used to measure the magnitude of adversarial perturbations

The goal of this section is to discuss different ways of measuring the robustness to adversarial perturbations.

Given a datapoint $x$ , let $\eta>0$ be such that we know a priori that all points in the region

The classifier $f$ is said to be not robust at $x$ if

In words, this means that there exists a point $z$ in the region $\mathcal{R}(x)$ (i.e., $z$ and $x$ are classified in the same way by a human observer), but $z$ is classified differently than $x$ by $f$ . Our main theoretical result provides upper bounds to $\rho_{\text{adv}}(f)$ (the expectation of $\Delta_{\text{adv}}(x;f)$ ) in terms of interpretable quantities (i.e., distinguishability and risk): $\rho_{\text{adv}}(f)\leq U(\mu,R(f))$ . Using this upper bound and Eq. (15), we certify that $f$ is not robust to adversarial perturbations when the following sufficient condition holds:

We assume that the image $x+r$ is of the same underlying label as $x$ if $\|r\|_{2}$ is one order of magnitude smaller than $\kappa$ . This corresponds to setting $\eta=\kappa/10$ . A summary of the different choices is shown in Table 5.

For a fixed $\eta_{0}>0$ , define the regions: