Certified Defenses against Adversarial Examples

Aditi Raghunathan, Jacob Steinhardt, Percy Liang

Introduction

While a proposed defense (classifier) is often empirically shown to be successful against the set of attacks known at the time, new stronger attacks are subsequently discovered that render the defense useless. For example, defensive distillation (Papernot et al., 2016c) and adversarial training against the Fast Gradient Sign Method (Goodfellow et al., 2015) were two defenses that were later shown to be ineffective against stronger attacks (Carlini & Wagner, 2016; Tramèr et al., 2017). In order to break this arms race between attackers and defenders, we need to come up with defenses that are successful against all attacks within a certain class.

Setup

Attack model.

Certificate on the adversarial loss

Note that this bound is tight, obtained by taking $A_{\text{opt}}(x)_{i}=x_{i}+\epsilon\operatorname*{sign}(W_{1i}-W_{2i})$ .

2 General classifiers

For more general classifiers, we cannot compute $f(A_{\text{opt}}(x))$ exactly, but motivated by the above, we can use the gradient to obtain a linear approximation $g$ :

3 Two-layer neural networks

where (i) follows from the chain rule, (ii) uses the fact that $\sigma$ has bounded derivatives $\sigma^{\prime}(z)\in$ , and (iii) follows from the identity $\|z\|_{1}=\max_{t\in^{d}}t^{\top}z$ . (Note that for sigmoid networks, where $\sigma^{\prime}(z)\in[0,\frac{1}{4}]$ , we can strengthen the above bound by a corresponding factor of $\frac{1}{4}$ .) Substituting the bound (5) into (3.2), we obtain an upper bound on the adversarial loss that we call $f_{\text{QP}}$ :

Unfortunately, (6) still involves a non-convex optimization problem (since $W^{\top}\operatorname*{diag}(v)$ is not necessarily negative semidefinite). In fact, it is similar to the NP-hard MAXCUT problem, which requires maximizing $x^{\top}Lx$ over $x\in^{d}$ for a graph with Laplacian matrix $L$ .

While MAXCUT is NP-hard, it can be efficiently approximated, as shown by the celebrated semidefinite programming relaxation for MAXCUT in Goemans & Williamson (1995). We follow a similar approach here to obtain an upper bound on $f_{\text{QP}}(x)$ .

First, to make our variables lie in $^{m}$ instead of $^{m}$ , we reparametrize $s$ to produce:

In terms of these new objects, our objective takes the form:

Note that every valid vector $y\in[-1,+1]^{m+d+1}$ satisfies the constraints $yy^{\top}\succeq 0$ and $(yy^{\top})_{jj}=1$ . Defining $P=yy^{\top}$ , we obtain the following convex semidefinite relaxation of our problem:

Note that the optimization of the semidefinite program depends only on the weights $v$ and $W$ and does not depend on the inputs $x$ , so it only needs to be computed once for a model $(v,W)$ .

Semidefinite programs can be solved with off-the-shelf optimizers, although these optimizers are somewhat slow on large instances. In Section 4 we propose a fast stochastic method for training, which only requires computing the top eigenvalue of a matrix.

The preceding arguments all generalize to the pairwise margins $f^{ij}$ , to give:

Training the certificate

where $\lambda^{ij}>0$ are the regularization hyperparameters. However, computing the gradients of the above objective involves finding the optimal solution of a semidefinite program, which is slow.

Our computational burden is lifted by the beautiful theory of duality, which provides the following equivalence between the primal maximization problem over $P$ , and a dual minimization problem over new variables $c$ (see Section A for details):

The final objective.

Using (20), we end up optimizing the following training objective:

The objective in (21) can be optimized efficiently. The most expensive operation is $\lambda_{\max}^{+}$ , which requires computing the maximum eigenvector of the matrix $M^{ij}-\operatorname*{diag}(c^{ij})$ in order to take gradients. This can be done efficiently using standard implementations of iterative methods like Lanczos. Further implementation details (including tuning of $\lambda^{ij}$ ) are presented in Section 6.3.

Dual certificate of robustness.

The dual formulation is also useful because any value of the dual is an upper bound on the optimal value of the primal. Specifically, if $(W[t],V[t],c[t])$ are the parameters at iteration $t$ of training, then

for any attack $A$ . As we train the network, we obtain a quick upper bound on the worst-case adversarial loss directly from the regularization loss, without having to optimize an SDP each time.

Other upper bounds

In Section 3, we described a function $f^{ij}_{\text{SDP}}$ that yields an efficient upper bound on the adversarial loss, which we obtained using convex relaxations. One could consider other simple ways to upper bound the loss; we describe here two common ones based on the spectral and Frobenius norms.

This measure of vulnerability to adversarial examples based on the spectral norms of the weights of each layer is considered in Szegedy et al. (2014) and Cisse et al. (2017).

Frobenius bound: For ease in training, often the Frobenius norm is regularized (weight decay) instead of the spectral norm. Since $\|W\|_{\text{F}}\geq\|W\|_{2}$ , we get a corresponding upper bound $f_{\text{frobenius}}$ :

In Section 6, we empirically compare our proposed bound using $f^{ij}_{\text{SDP}}$ to these two upper bounds.

Experiments

We evaluated our method on the MNIST dataset of handwritten digits, where the task is to classify images into one of ten classes. Our results can be summarized as follows: First, in Section 6.1, we show that our certificates of robustness are tighter than those based on simpler methods such as Frobenius and spectral bounds (Section 5), but our bounds are still too high to be meaningful for general networks. Then in Section 6.2, we show that by training on the certificates, we obtain networks with much better bounds and hence meaningful robustness. This reflects an important point: while accurately analyzing the robustness of an arbitrary network is hard, training the certificate jointly leads to a network that is robust and certifiably so. In Section 6.3, we present implementation details, design choices, and empirical observations that we made while implementing our method.

In this work, we focus on two layer networks. In all our experiments, we used neural networks with $m=500$ hidden units, and TensorFlow’s implementation of Adam (Kingma & Ba, 2014) as the optimizer; we considered networks with more hidden units, but these did not substantially improve accuracy. We experimented with both the multiclass hinge loss and cross-entropy. All hyperparameters (including the choice of loss function) were tuned based on the error of the Projected Gradient Descent (PGD) attack (Madry et al., 2017) at $\epsilon=0.1$ ; we report the hyperparameter settings below. We considered the following training objectives providing 5 different networks:

Normal training (NT-NN). Cross-entropy loss and no explicit regularization.

Frobenius norm regularization (Fro-NN). Hinge loss and a regularizer $\lambda(\|W\|_{\text{F}}+\|v\|_{2})$ with $\lambda=0.08$ .

Spectral norm regularization (Spe-NN). Hinge loss and a regularizer $\lambda(\|W\|_{2}+\|v\|_{2})$ with $\lambda=0.09$ .

Adversarial training (AT-NN). Cross-entropy with the adversarial loss against PGD

as a regularizer, with the regularization parameter set to $0.5$ . We found that this regularized loss works better than optimizing only the adversarial loss, which is the defense proposed in Madry et al. (2017). We set the step size of the PGD adversary to $0.1$ , number of iterations to $40$ , and perturbation size to $0.3$ .

Proposed training objective (SDP-NN). Dual SDP objective described in Equation 21 of Section 4. Implementation details and hyperparameter values are detailed in Section 6.3.

Evaluating upper bounds.

1 Quality of the upper bound

For each of the five networks described above, we computed upper bounds on the 0-1 loss based on our certificate (which we refer to as the “SDP bound” in this section), as well as the Frobenius and spectral bounds described in Section 5. While Section 4 provides a procedure for efficiently obtaining an SDP bound as a result of training, for networks not trained with our method we need to solve an SDP at the end of training to obtain certificates. Fortunately, this only needs to be done once for every pair of classes. In our experiments, we use the modeling toolbox YALMIP (Löfberg, 2004) with Sedumi (Sturm, 1999) as a backend to solve the SDPs, using the dual form (20); this took roughly 10 minutes per SDP (around 8 hours in total for a given model).

In Figure 2, we display average values of the different upper bounds over the $10,000$ test examples, as well as the corresponding lower bound from PGD. We find that our bound is tighter than the Frobenius and spectral bounds for all the networks considered, but its tightness relative to the PGD lower bound varies across the networks. For instance, our bound is relatively tight on Fro-NN, but unfortunately Fro-NN is not very robust against adversarial examples (the PGD attack exhibits large error). In contrast, the adversarially trained network AT-NN does appear to be robust to attacks, but our certificate, despite being much tighter than the Frobenius and spectral bounds, is far away from the PGD lower bound. The only network that is both robust and has relatively tight upper bounds is SDP-NN, which was explicitly trained to be both robust and certifiable as described in Section 4; we examine this network and the effects of training in more detail in the next subsection.

2 Evaluating proposed training objective.

In the previous section, we saw that the SDP bound, while being tighter than simpler upper bounds, could still be quite loose on arbitrary networks. However, optimizing against the SDP certificate seemed to make the certificate tighter. In this section, we explore the effect of different optimization objectives in more detail. First, we plot on a single axis the best upper bound (i.e., the SDP bound) and the lower bound (from PGD) on the adversarial loss obtained with each of the five training objectives discussed above. This is given in Figure 3(a).

Neither spectral nor Frobenius norm regularization seems to be helpful for encouraging adversarial robustness—the actual performance of those networks against the PGD attack is worse than the upper bound for SDP-NN against all attacks. This shows that the SDP certificate actually provides a useful training objective for encouraging robustness compared to other regularizers.

Separately, we can ask whether SDP-NN is robust to actual attacks. We explore the robustness of our network in Figure 3(b), where we plot the performance of SDP-NN against $3$ attacks—the PGD attack from before, the Carlini-Wagner attack (Carlini & Wagner, 2017b) (another strong attack), and the weaker Fast Gradient Sign Method (FGSM) baseline. We see substantial robustness against all $3$ attacks, even though our method was not explicitly trained with any of them in mind.

Next, we compare to other bounds reported in the literature. A rough ceiling is given by the network of Madry et al. (2017), which is a relatively large four-layer convolutional network adversarially trained against PGD. While this network has no accompanying certificate of robustness, it was evaluated against a number of attack strategies and had worst-case error $11\%$ at $\epsilon=0.3$ . Another set of numbers comes from Carlini et al. (2017), who use formal verification methods to compute $A_{\text{opt}}$ exactly on $10$ input examples for a small ( $72$ -node) variant of the Madry et al. network. The authors reported to us that this network misclassifies $6$ out of $10$ examples at $\epsilon=0.05$ (we note that $4$ out of $10$ of these were misclassified to start with, but $3$ of the $4$ can also be flipped to a different wrong class with some $\epsilon<0.07$ ).

At the value $\epsilon=0.1$ for which it was tuned, SDP-NN has error $16\%$ against the PGD attack, and an upper bound of $35\%$ error against any attack. This is substantially better than the small $72$ -node network, but also much worse than the full Madry et al. network. How much of the latter looseness comes from conservatism in our method, versus the fact that our network has only two layers? We can get some idea by considering the AT-NN network, which was trained similarly to Madry et al., but uses the same architecture as SDP-NN. From Figure 3(a), we see that the error of SDP-NN against PGD ( $16\%$ ) is not much worse than that of AT-NN ( $11\%$ ), even though AT-NN was explicitly trained against the PGD attack. This suggests that most of the gap comes from the smaller network depth, rather than from conservatism in the SDP bound. We are currently in the process of extending our approach to deeper networks, and optimistic about obtaining improved bounds with such networks.

Finally, we compare with the approach proposed in Kolter & Wong (2017) whose work appeared shortly after an initial version of our paper. They provide an upper bound on the adversarial loss using linear programs (LP) followed by a method to efficiently train networks to minimize this upper bound. In order to compare with SDP-NN, the authors provided us with a network with the same architecture as SDP-NN, but trained using their LP based objective. We call this network LP-NN.

Table 1 shows that LP-NN and SDP-NN are comparable in terms of their robustness against PGD, and the robustness guarantees that they come with.

Interestingly, the SDP and LP approaches provide vacuous bounds for networks not trained to minimize the respective upper bounds (though these networks are indeed robust). This suggests that these two approaches are comparable, but complementary. Finally, we note that in contrast to this work, the approach of Kolter & Wong (2017) extends to deeper networks, which allows them to train a four-layer CNN with a provable upper bound on adversarial error of $5.7\%$ error.

3 Implementation details

We implemented our training objective in TensorFlow, and implemented $\lambda_{\text{max}}^{+}$ as a custom operator using SciPy’s implementation of the Lanczos algorithm for fast top eigenvector computation; occasionally Lanczos fails to converge due to a small eigen-gap, in which case we back off to a full SVD. We used hinge loss as the classification loss, and decayed the learning rate in steps from $10^{-3}$ to $10^{-5}$ , decreasing by a factor of $10$ every $30$ epochs. Each gradient step involves computing top eigenvectors for $45$ different matrices, one for each pair of classes $(i,j)$ . In order to speed up computation, for each update, we randomly pick $i_{t}$ and only compute gradients for pairs $(i_{t},j),j\neq i_{t}$ , requiring only $9$ top eigenvector computations in each step.

For the regularization parameters $\lambda^{ij}$ , the simplest idea is to set them all equal to the same value; this leads to the unweighted regularization scheme where $\lambda^{ij}=\lambda$ for all pairs $(i,j)$ . We tuned $\lambda$ to $0.05$ , which led to reasonably good bounds. However, we observed that certain pairs of classes tended to have larger margins $f^{ij}(x)$ than other classes, which meant that certain label pairs appeared in the maximum of (18) much more often. That led us to consider a weighted regularization scheme with $\lambda^{ij}=w^{ij}\lambda$ , where $w^{ij}$ is the fraction of training points for which the the label $i$ (or $j$ ) appears as the maximizing term in (18). We updated the values of these weights every $20$ epochs. Figure 4(a) compares the PGD lower bound and SDP upper bound for the unweighted and weighted networks. The weighted network is better than the unweighted network for both the lower and upper bounds.

Finally, we saw in Equation 22 of Section 4 that the dual variables $c^{ij}$ provide a quick-to-compute certificate of robustness. Figure 4(b) shows that the certificates provided by these dual variables are very close to what we would obtain by fully optimizing the semidefinite programs. These dual certificates made it easy to track robustness across epochs of training and to tune hyperparameters.

Discussion

In this work, we proposed a method for producing certificates of robustness for neural networks, and for training against these certificates to obtain networks that are provably robust against adversaries.

Madry et al. (2017) perform adversarial training against PGD on the MNIST and CIFAR-10 datasets, obtaining networks that they suggest are “secure against first-order adversaries”. However, this is based on an empirical observation that PGD is nearly-optimal among gradient-based attacks, and does not correspond to any formal robustness guarantee.

Finally, the notion of a certificate appears in the theory of convex optimization, but means something different in that context; specifically, it corresponds to a proof that a point is near the optimum of a convex function, whereas here our certificates provide upper bounds on non-convex functions. Additionally, while robust optimization (Bertsimas et al., 2011) provides a tool for optimizing objectives with robustness constraints, applying it directly would involve the same intractable optimization for $A_{\text{opt}}$ that we deal with here.

Other approaches to verification.

While they have not been explored in the context of neural networks, there are approaches in the control theory literature for verifying robustness of dynamical systems, based on Lyapunov functions (Lyapunov, 1892; 1992). We can think of the activations in a neural network as the evolution of a time-varying dynamical system, and attempt to prove stability around a trajectory of this system (Tedrake et al., 2010; Tobenkin et al., 2011). Such methods typically use sum-of-squares verification (Papachristodoulou & Prajna, 2002; 2005; Parrilo, 2003) and are restricted to relatively low-dimensional dynamical systems, but could plausibly scale to larger settings. Another approach is to construct families of networks that are provably robust a priori, which would remove the need to verify robustness of the learned model; to our knowledge this has not been done for any expressive model families.

Adversarial examples and secure ML.

There has been a great deal of recent work on the security of ML systems; we provide only a sampling here, and refer the reader to Barreno et al. (2010), Biggio et al. (2014a), Papernot et al. (2016b), and Gardiner & Nagaraja (2016) for some recent surveys.

Adversarial examples for neural networks were first discovered by Szegedy et al. (2014), and since then a number of attacks and defenses have been proposed. We have already discussed gradient-based methods as well as defenses based on adversarial training. There are also other attacks based on, e.g., saliency maps (Papernot et al., 2016a), KL divergence (Miyato et al., 2015), and elastic net optimization (Chen et al., 2017); many of these attacks are collated in the cleverhans repository (Goodfellow et al., 2016). For defense, rather than making networks robust to adversaries, some work has focused on simply detecting adversarial examples. However, Carlini & Wagner (2017a) recently showed that essentially all known detection methods can be subverted by strong attacks.

As explained in Barreno et al. (2010), there are a number of different attack models beyond the test-time attacks considered here, based on different attacker goals and capabilities. For instance, one can consider data poisoning attacks, where an attacker modifies the training set in an effort to affect test-time performance. Newsome et al. (2006), Laskov & Šrndic̀ (2014), and Biggio et al. (2014b) have demonstrated poisoning attacks against real-world systems.

Other types of certificates.

Certificates of performance for machine learning systems are desirable in a number of settings. This includes verifying safety properties of air traffic control systems (Katz et al., 2017a; b) and self-driving cars (O’Kelly et al., 2016; 2017), as well as security applications such as robustness to training time attacks (Steinhardt et al., 2017). More broadly, certificates of performance are likely necessary for deploying machine learning systems in critical infrastructure such as internet packet routing (Winstein & Balakrishnan, 2013; Sivaraman et al., 2014). In robotics, certificates of stability are routinely used both for safety verification (Lygeros et al., 1999; Mitchell et al., 2005) and controller synthesis (Başar & Bernhard, 2008; Tedrake et al., 2010).

In traditional verification work, Rice’s theorem (Rice, 1953) is a strong impossibility result essentially stating that most properties of most programs are undecidable. Similarly, we should expect that verifying robustness for arbitrary neural networks is hard. However, the results in this work suggest that it is possible to learn neural networks that are amenable to verification, in the same way that it is possible to write programs that can be formally verified. Optimistically, given expressive enough certification methods and model families, as well as strong enough specifications of robustness, one could even hope to train vector representations of natural images with strong robustness properties, thus finally closing the chapter on adversarial vulnerabilities in the visual domain.

References

Appendix A Duality

In this section we justify the duality relation (20). Recall that the primal program is

Rather than taking the dual directly, we first add the redundant constraint $\operatorname*{tr}(P)\leq d+m+1$ (it is redundant because the SDP is in $d+m+1$ dimensions and $\operatorname*{diag}(P)\leq 1$ ). This yields

We now form the Lagrangian for the constraints $\operatorname*{diag}(P)\leq 1$ , leaving the other two constraints as-is. This yields the equivalent optimization problem

Now, we apply minimax duality to swap the order of min and max; the value of (27) is thus equal to

This is almost the form given in (20), except that $c$ is constrained to be non-negative and we have $\mathbf{1}^{\top}c$ instead of $\mathbf{1}^{\top}\max(c,0)$ . However, note that for the $\lambda_{\max}^{+}$ term, it is always better for $c$ to be larger; therefore, replacing $c$ with $\max(c,0)$ means that the optimal value of $c$ will always be non-negative, thus allowing us to drop the $c\geq 0$ constraint and optimize $c$ in an unconstrained manner. This finally yields the claimed duality relation (20).