Scaling provable adversarial defenses

Eric Wong, Frank R. Schmidt, Jan Hendrik Metzen, J. Zico Kolter

Introduction

In this paper, we make substantial progress towards the goal of scaling these provably robust networks to realistic sizes. Specifically, we extend the techniques of Wong and Kolter (2017) in three key ways. First, while past work has only applied to pure feedforward networks, we extend the framework to deal with arbitrary residual/skip connections (a hallmark of modern deep network architectures), and arbitrary activation functions (Dvijotham et al. (2018) also worked with arbitrary activation functions, but only for feedforward networks, and just discusses network verification rather than robust training). Second, and possibly most importantly, computing the upper bound on the robust loss in (Wong and Kolter, 2017) in the worst case scales quadratically in the number of hidden units in the network, making the approach impractical for larger networks. In this work, we use a nonlinear random projection technique to estimate the bound in manner that scales only linearly in the size of the hidden units (i.e., only a constant multiple times the cost of traditional training), and which empirically can be used to train the networks with no degradation in performance from the previous work. Third, we show how to further improve robust performance of these methods, though at the expense of worse non-robust error, using multi-stage cascade models. Through these extensions, we are able to improve substantially upon the verified robust errors obtained by past work.

Background and related work

Work in adversarial defenses typically falls in one of three primary categories. First, there is ongoing work in developing heuristic defenses against adversarial examples: (Goodfellow et al., 2015; Papernot et al., 2016; Kurakin et al., 2017; Metzen et al., 2017) to name a few. While this work is largely empirical at this point, substantial progress has been made towards developing networks that seem much more robust than previous approaches. Although a distressingly large number of these defenses are quickly “broken” by more advanced attacks (Athalye et al., 2018), there have also been some methods that have proven empirically resistant to the current suite of attacks; the recent NIPS 2017 adversarial example challenge (Kurakin et al., 2018), for example, highlights some of the progress made on developing classifiers that appear much stronger in practice than many of the ad-hoc techniques developed in previous years. Many of the approaches, though not formally verified in the strict sense during training, nonetheless have substantial theoretical justification for why they may perform well: Sinha et al. (2018) uses properties of statistical robustness to develop an approach that is not much more difficult to train and which empirically does achieve some measure of resistance to attacks; Madry et al. (2017) considers robustness to a first-order adversary, and shows that a randomized projected gradient descent procedure is optimal in this setting. Indeed, in some cases the classifiers trained via these methods can be verified to be adversarially robust using the verification techniques discussed below (though only for very small networks). Despite this progress, we believe it is also crucially important to consider defenses that are provably robust, to avoid any possible attack.

Second, our work in this paper relates closely to techniques for the formal verification of neural networks systems (indeed, our approach can be viewed as a convex procedure for verification, coupled with a method for training networks via the verified bounds). In this area, most past work focuses on using exact (combinatorial) solvers to verify the robustness properties of networks, either via Satisfiability Modulo Theories (SMT) solvers (Huang et al., 2017; Ehlers, 2017; Carlini and Wagner, 2017) or integer programming approaches (Lomuscio and Maganti, 2017; Tjeng and Tedrake, 2017; Cheng et al., 2017). These methods have the benefit of being able to reason exactly about robustness, but at the cost of being combinatorial in complexity. This drawback has so far prevented these methods from effectively scaling to large models or being used within a training setting. There have also been a number of recent attempts to verify networks using non-combinatorial methods (and this current work fits broadly in this general area). For example, Gehr et al. (2018) develop a suite of verification methods based upon abstract interpretations (these can be broadly construed as relaxations of combinations of activations that are maintained as they pass through the network). Dvijotham et al. (2018) use an approach based upon analytically solving an optimization problem resulting from dual functions of the activations (which extends to activations beyond the ReLU). However, these methods apply to simple feedforward architectures without skip connections, and focus only on verification of existing networks.

This paper fits into this third category of integrating verification into training, and makes substantial progress towards scaling these methods to realistic settings. While we cannot yet reach e.g. ImageNet scales, even in this current work, we show that it is possible to overcome the main hurdles to scalability of past approaches. Specifically, we develop a provably robust training procedure, based upon the approach in (Wong and Kolter, 2017), but extending it in three key ways. The resulting method: 1) extends to general networks with skip connections, residual layers, and activations besides the ReLU; we do so by using a general formulation based on the Fenchel conjugate function of activations; 2) scales linearly in the dimensionality of the input and number of hidden units in the network, using techniques from nonlinear random projections, all while suffering minimal degradation in accuracy; and 3) further improves the quality of the bound with model cascades. We describe each of these contributions in the next section.

Scaling provably robust networks

This section presents an architecture for constructing provably robust bounds for general deep network architectures, using Fenchel duality. Importantly, we derive the dual of each network operation in a fully modular fashion, simplifying the problem of deriving robust bounds of a network to bounding the dual of individual functions. By building up a toolkit of dual operations, we can automatically construct the dual of any network architecture by iterating through the layers of the original network.

To bound the adversarial problem, we look to its dual optimization problem using the machinery of Fenchel conjugate functions (Fenchel, 1949), described in Definition 1.

The conjugate of a function $f$ is another function $f^{*}$ defined by

Let the indicator function for the $i$ th constraint be

for $i=2,\dots,k$ , and consider the joint indicator function $\sum_{i=2}^{k}\chi_{i}(z_{1:i})$ . Then, the joint indicator is lower bounded by $\max_{\nu_{1:k}}\nu_{k}^{T}z_{k}-\nu_{1}^{T}z_{1}-\sum_{i=1}^{k-1}\chi_{i}^{*}(-\nu_{i},\nu_{i+1:k})$ , where

for $i=1,\dots,k-1$ . Note that $\chi_{i}^{*}(\nu_{i:k})$ is the exact conjugate of the indicator for the set $\{x_{i:k}:x_{j}=f_{ji}(x_{i})\;\;\forall j>i\}$ , which is different from the set indicated by $\chi_{i}$ . However, when there are no skip connections (i.e. $z_{i}$ only depends on $z_{i-1}$ ), $\chi_{i}^{*}$ is exactly the conjugate of $\chi_{i}$ .

Let $g_{ij}$ and $h_{i}$ be any functions such that

for $i=1,\dots,k-1$ . Then, the adversarial problem from Equation 2 is lower bounded by

where $\|\cdot\|_{*}$ is the dual norm, and $\nu_{1:k}=g(c)$ is the output of a $k$ layer neural network $g$ on input $c$ , given by the equations

We denote the upper bound on the conjugate function from Equation 6 a dual layer, and defer the proof to Appendix A.2. To give a concrete example, we present two possible dual layers for linear operators and ReLU activations in Corollaries 1 and 2 (their derivations are in Appendix B), and we also depict an example dual residual block in Figure 1.

The dual layer for a linear operator $\hat{z}_{i+1}=W_{i}z_{i}+b_{i}$ is

where $\mathcal{I}^{-}_{i},\mathcal{I}^{+}_{i},\mathcal{I}$ denote the index sets where the bounds are negative, positive or spanning the origin respectively, and where $D_{i}$ is a diagonal matrix with entries

We briefly note that these dual layers recover the original dual network described in Wong and Kolter (2017). Furthermore, the dual linear operation is the exact conjugate and introduces no looseness to the bound, while the dual ReLU uses the same relaxation used in Ehlers (2017); Wong and Kolter (2017). More generally, the strength of the bound from Theorem 1 relies entirely on the tightness of the individual dual layers to their respective conjugate functions in Equation 6. While any $g_{ij}$ , $h_{i}$ can be chosen to upper bound the conjugate function, a tighter bound on the conjugate results in a tighter bound on the adversarial problem.

If the dual layers for all operations are linear, the bounds for all layers can be computed with a single forward pass through the dual network using a direct generalization of the form used in Wong and Kolter (2017) (due to their similarity, we defer the exact algorithm to Appendix F). By trading off tightness of the bound with computational efficiency by using linear dual layers, we can efficiently compute all bounds and construct the dual network one layer at a time. The end result is that we can automatically construct dual networks from dual layers in a fully modular fashion, completely independent of the overall network architecture (similar to how auto-differentiation tools proceed one function at a time to compute all parameter gradients using only the local gradient of each function). With a sufficiently comprehensive toolkit of dual layers, we can compute provable bounds on the adversarial problem for any network architecture.

For other dual layers, we point the reader to two resources. For the explicit form of dual layers for hardtanh, batch normalization, residual connections, we direct the reader to Appendix B. For analytical forms of conjugate functions of other activation functions such as tanh, sigmoid, and max pooling, we refer the reader to Dvijotham et al. (2018).

. Let $\nu_{1:k}$ be the dual network from Equation 1 with linear dual layers and let $r>0$ be the projection dimension. Then, we can estimate

where $R$ is a $|z_{1}|\times r$ standard Cauchy random matrix and the median is taken over the second axis. Furthermore, we can estimate

where $R$ is a $|z_{i}|\times r$ standard Cauchy random matrix, and the median is taken over the second axis.

This estimate has two main advantages: first, it is simple to compute, as evaluating $\nu_{1}^{T}R$ involves passing the random matrix forward through the dual network (similarly, the other term requires passing a modified random matrix through the dual network; the exact algorithm is detailed in 1). Second, it is memory efficient in the backward pass, as the gradient need only propagate through the median entries.

3 Bias reduction with cascading ensembles

A final major challenge of training models to minimize a robust bound on the adversarial loss, is that the robustness penalty acts as a regularization. For example, in a two-layer ReLU network, the robust loss penalizes $\epsilon\|\nu_{1}\|_{1}=\epsilon\|W_{1}D_{1}W_{2}\|_{1}$ , which effectively acts as a regularizer on the network with weight $\epsilon$ . Because of this, the resulting networks (even those with large representational capacity), are typically overregularized to the point that many filters/weights become identically zero (i.e., the network capacity is not used).

To address this point, we advocate for using a robust cascade of networks: that is, we train a sequence of robust classifiers, where later elements of the cascade are trained (and evaluated) only on those examples that the previous elements of the cascade cannot certify (i.e., those examples that lie within $\epsilon$ of the decision boundary). This procedure is formally described in the Appendix in Algorithm 2.

Experiments

In the MNIST dataset (the only data set where it is trivial to run exact training without projection), we have evaluated our approach using different projection dimensions as well as exact training (i.e., without random projections). We note that using substantially lower projection dimension does not have a significant impact on the test error. This fact is highlighted in Figure 2. Using the same convolutional architecture used by Wong and Kolter (2017), which previously required gigabytes of memory and took hours to train, it is sufficient to use only 10 random projections to achieve comparable test error performance to training with the exact bound. Each training epoch with 10 random projections takes less than a minute on a single GeForce GTX 1080 Ti graphics card, while using less than 700MB of memory, achieving significant speedup and memory reduction over Wong and Kolter (2017). The estimation quality and the corresponding speedups obtained are explored in more detail in Appendix E.6.

Finally, we consider the performance of the cascaded versus non-cascaded models. In all cases, cascading the models is able to improve the robust error performance, sometimes substantially, for instance decreasing the robust error on CIFAR10 from 46.1% to 36.4% for $\epsilon=2/255$ . However, this comes at a cost as well: the nominal error increases throughout the cascade (this is to be expected, since the cascade essentially tries to force the robust and nominal errors to match). Thus, there is substantial value to both improving the single-model networks and integrating cascades into the prediction.

Conclusion

References

Appendix A Conjugates and lower bounds with duality

Here, we derive a lower bound on $\sum_{i=2}^{k}\chi_{i}(z_{1:i})$ . It is mathematically convenient to introduce addition variables $\hat{z}_{1:k}$ such that $\hat{z}_{i}=z_{i}$ for all $i=1,\dots,k$ , and rephrase it as the equivalent constrained optimization problem.

Note that we do not optimize over $\hat{z}_{1}$ and $z_{k}$ yet, to allow for future terms on the inputs and outputs of the network, so this is analyzing just the network structure. We introduce Lagrangian variables $\nu_{1:k},\hat{\nu}_{2:k}$ to get the following Lagrangian:

Grouping up terms by $z_{i},\hat{z}_{i}$ and rearranging the double sum results in the following expression:

From the KKT stationarity conditions for the derivative with respect to $\hat{z}_{i}$ , we know that $\hat{\nu}_{i}=\nu_{i}$ . Also note that in the summand, the last term for $i=k$ has no double summand, so we move it out for clarity.

Finally, we minimize over $z_{i}$ for $i=2,\dots,k-1$ to get the conjugate form for the lower bound via weak duality.

A.2 Proof of Theorem 1

First, we rewrite the primal problem by bringing the function and input constraints into the objective with indicator functions $I$ . We can then apply Lemma 1 to get the following lower bound on the adversarial problem:

Minimizing over $z_{1}$ and $z_{k}$ , note that

Note that if $\mathcal{B}(x)=\{x+\Delta:\|\Delta\|\leq\epsilon\}$ for some norm, then $I^{*}_{\mathcal{B}(x)}(\nu_{1})=\nu_{1}^{T}x+\epsilon\|\nu_{1}\|_{*}$ where $\|\cdot\|$ is the dual norm, but any sort of input constraint can be used so long as its conjugate can be bounded. Finally, the last term can be bounded with the dual layer:

Combining these all together, we get that the adversarial problem from Equation 2 is lower bounded by

Appendix B Dual layers

In this section, we derive the dual layers for standard building blocks of deep learning.

Suppose $f_{i}(z_{i})=W_{i}z_{i}+b_{i}$ for some linear operator $W_{i}$ and bias terms $b_{i}$ . Then,

B.2 Residual linear connections

Suppose $f_{i}(z_{i},z_{j})=W_{i}z_{i}+z_{j}+b_{i}$ and $z_{j+1}=W_{j}z_{j}+b_{j}$ for some $j<i-1$ for linear operators $W_{i},W_{j}$ and bias term $b_{i},b_{j}$ . Then,

B.3 ReLU activations

The proof here is the same as that presented in Appendix A3 of Wong and Kolter , however we reproduce a simplified version here for the reader. The conjugate function for the ReLU activation is the following:

Let $\mathcal{I}_{i}^{-},\mathcal{I}_{i}^{+},\mathcal{I}$ and $D_{i}$ be as defined in the corollary. Combining these three cases together, we get the final upper bound:

B.4 Hardtanh

Here, we derive a dual layer for the hardtanh activation function. The hard tanh activation function is given by

Since this is an activation function (and has no skip connections), we only need to bound the following:

Taking the maximum over these two cases, we have our upper bound of the conjugate is

This dual layer is linear, and so we can continue to use random projections for efficient bound estimation.

B.4.2 u≤−1𝑢1u\leq-1

Then, $\mathcal{S}=\{z_{i+1}=-1\}$ and so

Then, $\mathcal{S}=\{z_{i+1}=z_{i}\}$ and so

B.5 Batch normalization

As mentioned before, we only consider the case of batch normalization with a fixed mean and variance. This is true during test time, and at training time we can use the batch statistics as a heuristic. Let $\mu_{i},\sigma_{i}$ be the fixed mean and variance statistics, so batch normalization has the following form:

where $\gamma,\beta$ are the batch normalization parameters. Then,

where $D_{i+1}=\operatorname*{diag}\left(\frac{\gamma}{\sqrt{\sigma^{2}+\epsilon}}\right)$ and $d_{i+1}=\beta-\frac{\mu}{\sqrt{\sigma^{2}+\epsilon}}$ . and so we can simply plug this into the linear case to get

Note however, that batch normalization has the effect of shifting the activations to be centered more around the origin, which is exactly the case in which the robust bound becomes looser. In practice, we find that while including batch normalization may improve convergence, it reduces the quality of the bound.

Appendix C Cascade construction

The full algorithm for constructing cascades as we describe in the main text is shown in Algorithm 2. To illustrate the use of the cascade, Figure 5 shows a two stage cascade on a few data points in two dimensional space. The boxes denote the adversarial ball around each example, and if the decision boundary is outside of the box, the example is certified.

Appendix D Estimation using Cauchy random projections

where we include the identity term to make explicit the fact that we compute this by passing an identity matrix through the network $g$ . Estimating this term is straightforward: we simply pass in a Cauchy random matrix $R$ , and take the median absolute value:

where the median is taken over the minibatch axis.

\sum_{i}[\nu_{i,:}]_{+} Recall the form of $\nu=\nu_{j}$ for some layer $j$ ,

which involves using the same median estimator and also passing in a single example of ones through the network.

Note that since $g_{j}$ is just a linear function that does a forward pass through the network, for any matrix $A,B$ ,

Similarly, we can view the summation over the index set $\mathcal{I}$ as a summation after multiplying by an indicator matrix $1_{\mathcal{I}}$ which zeros out the ignored rows. Since this is also linear, we can move it to be an operation on the input to the network.

The latter term is cheap to compute, since we only pass a single vector. We can approximate the first term using the median estimator on the compound operations $g\circ h$ for a Cauchy random matrix $R$ :

Appendix E High probability bounds

In this section, we derive high probability certificates for robustness against adversarial examples. Recall that the original certificate is of the form

so if this holds we are guaranteed that the example cannot be adversarial. What we will show is an equivalent high probability statement: for $\delta>0$ , with probability at least $(1-\delta)$ ,

While the median estimator is a good heuristic for training, it is still only an estimate of the bound. At test time, it is possible to create a provable bound that holds with high probability, which may be desired if computing the exact bound is computationally impossible.

In this section, we derive high probability certificates for robustness against adversarial examples. Recall that the original certificate is of the form

so if this holds we are guaranteed that the example cannot be adversarial. What we will show is an equivalent high probability statement: for $\delta>0$ , with probability at least $(1-\delta)$ ,

E.2 Tail bounds for the geometric estimator

From Li et al. , the authors also provide a geometric mean estimator which comes with high probability tail bounds. The geometric estimator is

Thus, if $\exp\left(-k\frac{\epsilon^{2}}{G_{L,gm}}\right)\leq\delta$ , then with probability $1-\delta$ we have that

E.3 Upper bound on J(g(c,α))𝐽𝑔𝑐𝛼J(g(c,\alpha))

In total, this is $n_{2}+2n_{3}+\dots+(k-2)n_{k-1}=N$ total estimations. In order to say that all of these estimates hold with probability $1-\delta$ , we can do the following: we bound each estimate in Equation 55 with probability $\delta/N$ , and use the union bound over all $N$ estimates. We can then conclude that with probability at most $\delta$ , any estimate is not an upper bound, and so with probability $1-\delta$ we have a proper upper bound.

E.4 Achieving δ/N𝛿𝑁\delta/N tail probability

There is a problem here: if $\delta/N$ is small, then $\epsilon$ becomes large, and the bound gets worse. In fact, since $\epsilon<1$ , when $k$ is fixed, there’s actually a lower limit to how small $\delta/N$ can be.

To overcome this problem, we take multiple samples to reduce the probability. Specifically, instead of directly using the geometric estimator, we use the maximum over multiple geometric estimators

where $R_{i}$ are independent Cauchy random matrices. If each one has a tail probability of $\delta$ , then the maximum has a tail probability of $\delta^{m}$ , which allows us to get arbitrarily small tail probabilities at a rate exponential in $m$ .

E.5 High probability tail bounds for network certificates

Putting this altogether, let $\delta>0$ , let $N>0$ be the number of estimates needed to calculate a certificate, and let $m$ be the number of geometric estimators to take a maximum over. Then with probability $(1-\delta)$ , if we bound the tail probability for each geometric estimate with $\hat{\delta}=\left(\frac{\delta}{N}\right)^{1/m}$ , then we have an upper bound on the certificate.

As an example, suppose we use the MNIST network from Wong and Kolter . Then, let $\delta=0.01$ , $m=10$ , and note that $N=6572$ . Then, $\hat{\delta}=0.26$ , which we can achieve by using $k=200$ and $\epsilon=0.22$ .

E.6 Estimation quality and speedup

In Figure 7, we benchmark the time and memory usage on a convolutional MNIST example to demonstrate the performance improvements. While the exact bound takes time and memory that is quadratic in the number of hidden units, the median estimator is instead linear, allowing it to scale up to millions of hidden units whereas the exact bound runs out of memory out at 50,280 hidden units.

Appendix F AutoDual

In this section, we describe our generalization of the bounds computation algorithm from [Wong and Kolter, 2017] to general networks using dual layers, which we call AutoDual.

The conjugate form, and consequently the dual layer, for certain activations requires knowing lower and upper bounds for the pre-activations, as was done for ReLU activations in Algorithm 1 of Wong and Kolter . While the bound in Equation 7 can be immediately used to compute all the bounds on intermediate nodes of the network one layer at a time, this requires performing a backwards pass through the dual network whenever we need to compute the bounds. However, if the operators $g_{ij}$ of the dual layers are all affine operators $g_{ij}(\nu_{i+1})=A_{ij}^{T}\nu_{i+1}$ for some affine operator $A_{ij}$ , we can apply a generalization of the lower and upper bound computation found in Wong and Kolter to compute all lower and upper bounds, and consequently the dual layers, of the entire network with a single forward pass in a layer-by-layer fashion. With the lower and upper bounds, we can also use the same algorithm to automatically construct the dual network. The resulting algorithm, which we call AutoDual, is described in Algorithm 3.

In practice, we can perform several layer-specific enhancements on top of this algorithm. First, many of the $A_{ji}$ operators will not exist simply because most architectures (with a few exceptions) don’t have a large number of skip connections, so these become no ops and can be ignored. Second, we can lazily skip the computation of layer-wise bounds until necessary, e.g. for constructing the dual layer of ReLU activations. Third, since many of the functions $h_{j}$ in the dual layers are functions of $B^{T}\nu_{i}$ for some matrix $B$ and some $i\geq j$ , we can initialize $\nu_{i}^{(i)}$ with $B$ instead of the identity matrix, typically passing a much smaller matrix through the dual network (in many cases, $B$ is a single vector).

Appendix G Experiments

In this section, we provide more details on the experimental setup, as well as more extensive experiments on the effect of model width and model depth on the performance that were not mentioned above.

We use a parameter $k$ to control the width and depth of the architectures used in the following experiments. The Wide( $k$ ) networks have two convolutional layers of $4\times k$ and $8\times k$ filters followed by a $128\times k$ fully connected layer. The Deep( $k$ ) networks have $k$ convolutional filters with 8 filters followed by $k$ convolutional filters with 16 filters.

Similar to prior work, in all of our models we use strided convolutional layers with 4 by 4 kernels to downsample. When downsampling is not needed, we use 3 by 3 kernels without striding.

G.1 MNIST

For all MNIST experiments, we use the Adam optimizer with a learning rate of 0.001 with a batch size of 50. We schedule $\epsilon$ starting from 0.01 to the desired value over the first 20 epochs, after which we decay the learning rate by a factor of 0.5 every 10 epochs for a total of 60 epochs.

We find that increasing the capacity of the model by simply making the network deeper and wider on MNIST is able boost performance. However, when the model becomes overly wide, the test robust error performance begins to degrade due to overfitting. These results are shown in Table 3.

G.2 CIFAR10

For all CIFAR10 experiments, we use the SGD optimizer with a learning rate of 0.05 with a batch size of 50. We schedule $\epsilon$ starting from 0.001 to the desired value over the first 20 epochs, after which we decay the learning rate by a factor of 0.5 every 10 epochs for a total of 60 epochs.