On the Effectiveness of Interval Bound Propagation for Training Verifiably Robust Models

Sven Gowal, Krishnamurthy Dvijotham, Robert Stanforth, Rudy Bunel, Chongli Qin, Jonathan Uesato, Relja Arandjelovic, Timothy Mann, Pushmeet Kohli

Introduction

Despite the successes of deep learning , it is well-known that neural networks are not robust. In particular, it has been shown that the addition of small but carefully chosen deviations to the input, called adversarial perturbations, can cause the neural network to make incorrect predictions with high confidence . Starting with Szegedy et al. , there has been a lot of work on understanding and generating adversarial perturbations , and on building models that are robust to such perturbations . Unfortunately, many of the defense strategies proposed in the literature are targeted towards a specific adversary (e.g., obfuscating gradients against projected gradient attacks), and as such they are easily broken by stronger adversaries . Robust optimization techniques, like the one developed by Madry et al. , overcome this problem by trying to find the worst-case adversarial examples at each training step and adding them to the training data. While the resulting models show strong empirical evidence that they are robust against many attacks, we cannot yet guarantee that a different adversary (for example, one that does brute-force enumeration to compute adversarial perturbations) cannot find inputs that cause the model to predict incorrectly. In fact, Figure 1 provides an example that motivates why projected gradient descent (PGD) – the technique at the core of \NoHyperMadry et al.\endNoHyper’s method – does not always find the worst-case attack (a phenomenon also observed by Tjeng et al. ).

This has driven the need for formal verification: a provable guarantee that neural networks are consistent with a specification for all possible inputs to the network. Substantial progress has been made: from complete methods that use Satisfiability Modulo Theory (SMT) or Mixed-Integer Programming (MIP) to incomplete methods that rely on solving a convex relaxation of the verification problem . Complete methods, which provide exact robustness bounds, are expensive and difficult to scale (since they perform exhaustive enumeration in the worst case). Incomplete methods provide robustness bounds that can be loose. However, they scale to larger models than complete methods and, as such, can be used inside the training loop to build models that are not only robust, but also intrinsically easier to verify .

We propose several enhancements that improve the performance of IBP for verified training. In particular, we differentiate ourselves from Mirman et al. by using a different loss function, and by eliding the last linear layer of the neural network, thereby improving our estimate of the worst-case logits. We also develop a curriculum that stabilizes training and improves generalization.

We train the first provably robust model on ImageNet (downscaled to 64×6464\times 64 images) at ϵ=1/255\epsilon=1/255. Using a WideResNet-10-10, we reach 93.87% top-1 verified error rate. This constitutes the largest model to be verified beyond vacuous bounds (a random or constant classifier would achieve a 99.9% verified error rate).

Finally, the code for training provably robust neural networks using IBP is available at https://github.com/deepmind/interval-bound-propagation.

Related Work

Work on training verifiably robust neural networks typically falls in one of two primary categories. First, there are empirical approaches exemplified perfectly by Xiao et al. . This work takes advantage of the nature of MIP-based verification – the critical bottleneck being the number of integer variables the solver needs to branch over. The authors design a regularizer that aims to reduce the number of ambiguous ReLU activation units (units for which bound propagation is not able to determine whether they are on or off) so that verification after training using a MIP solver is efficient. This method, while not providing any meaningful measure of the underlying verified accuracy during training, is able to reach state-of-the-art performance once verified after training with a MIP solver.

Second, there are methods that compute a differentiable upper bound on the violation of the specification to verify. This upper bound, if fast to compute, can be used within a loss (e.g., hinge loss) to optimize models through regular Stochastic Gradient Descent (SGD). In this category, we highlight the works by Raghunathan et al. , Wong et al. , Dvijotham et al. and Mirman et al. . Raghunathan et al. use a semi-definite relaxation that provides an adaptive regularizer that encourages robustness. Wong et al. extend their previous work , which considers the dual formulation of the underlying LP. Critically, any feasible dual solution provides a guaranteed upper bound on the solution of the primal problem. This allows \NoHyperWong and Kolter\endNoHyper to fix the dual solution and focus on computing tight activation bounds that, in turn, yield a tight upper bound on the specification violation. Alternatively, Dvijotham et al. fix the activation bounds and optimize the dual solution using an additional verifier network. Finally, Mirman et al. introduce geometric abstractions that bound activations as they propagate through the network. To the contrary of the conclusions from these previous works, we demonstrate that tighter relaxations (such as the dual formulation from Dvijotham et al. , or the zonotope domain from Mirman et al. ) are not necessary to reach tight verified bounds.

IBP, which often leads to loose upper bounds for arbitrary networks, has a significant computational advantage, since computing IBP bounds only requires two forward passes through the network. This enables us to apply IBP to significantly larger models and train with extensive hyper-parameter tuning. We show that thanks to this capability, a carefully tuned verified training process using IBP is able to achieve state-of-the-art verified accuracy. Perhaps surprisingly, our results show that neural networks can easily adapt to make the rather loose bound provided by IBP much tighter – this is in contrast to previous results that seemed to indicate that more expensive verification procedures are needed to improve the verified accuracy of neural networks in image classification tasks.

Methodology

We focus on feed-forward neural networks trained for classification tasks. The input to the network is denoted x0x_{0} and its output is a vector of raw un-normalized predictions (hereafter logits) corresponding to its beliefs about which class x0x_{0} belongs to. During training, the network is fed pairs of input x0x_{0} and correct output label ytruey_{\textrm{true}}, and trained to minimize a misclassification loss, such as cross-entropy.

For clarity of presentation, we assume that the neural network is defined by a sequence of transformations hkh_{k} for each of its KK layers. That is, for an input z0z_{0} (which we define formally in the next paragraph), we have

Verification problem.

We are interested in verifying that neural networks satisfy a specification by generating a proof that this specification holds. We consider specifications that require that for all inputs in some set X(x0)\mathcal{X}(x_{0}) around x0x_{0}, the network output satisfies a linear relationship

A network is adversarially robust at a point x0x_{0} if there is no choice of adversarial perturbation that changes the classification outcome away from the true label ytruey_{\textrm{true}}, i.e., argmaxizK,i=ytrue\operatorname*{argmax}_{i}z_{K,i}=y_{\textrm{true}} for all elements z0X(x0)z_{0}\in\mathcal{X}(x_{0}). Formally, we want to verify that for each class yy:

where eie_{i} is the standard ithi^{\textrm{th}} basis vector and ϵ\epsilon is the perturbation radius.

Verifying a specification like (2) can be done by searching for a counter-example that violates the specification constraint:

If the optimal value of the above optimization problem is smaller than 0, the specification (2) is satisfied.

Interval bound propagation.

where z0(ϵ)=x0ϵ1\underline{z}_{0}(\epsilon)=x_{0}-\epsilon\mathbf{1} and z0(ϵ)=x0+ϵ1\overline{z}_{0}(\epsilon)=x_{0}+\epsilon\mathbf{1}. The above optimization problems can be solved quickly and in closed form for affine layers and monotonic activation functions. An illustration of IBP is shown in Figure 2.

For the affine layers (e.g., fully connected layers, convolutions) that can be represented in the form hk(zk1)=Wzk1+bh_{k}(z_{k-1})=Wz_{k-1}+b, solving the optimization problems (5) can be done efficiently with only two matrix multiplies:

where |\cdot| is the element-wise absolute value operator. Propagating bounds through any element-wise monotonic activation function (e.g., ReLU, tanh, sigmoid) is trivial. Concretely, if hkh_{k} is an element-wise increasing function, we have:

Notice how for element-wise non-linearities the (zk,zk)(\underline{z}_{k},\overline{z}_{k}) formulation is better, while for affine transformations (μk,rk)(\mu_{k},r_{k}) is more efficient (requiring two matrix multiplies instead of four). Switching between parametrizations depending on hkh_{k} incurs a slight computational overhead, but since affine layers are typically more computationally intensive, the formulation (6) is worth it.

Finally, the upper and lower bounds of the output logits zKz_{K} can be used to construct an upper bound on the solution of (4):

Overall, the adversarial specification (3) is upper-bounded by zK,y(ϵ)zK,ytrue(ϵ)\overline{z}_{K,y}(\epsilon)-\underline{z}_{K,y_{\textrm{true}}}(\epsilon). It corresponds to an upper bound on the worst-case logit difference between the true class ytruey_{\textrm{true}} and any other class yy.

Elision of the last layer.

Bound propagation is not necessary for the last linear layer of the network. Indeed, we can find an upper bound to the solution of (4) that is tighter than proposed by (8) by eliding the final linear layer with the specification. Assuming hK(zK1)=WzK1+bh_{K}(z_{K-1})=Wz_{K-1}+b, we have:

with c=WTcc^{\prime}=W^{\textsf{T}}c and d=cTb+dd^{\prime}=c^{\textsf{T}}b+d, which bypasses the additional relaxation induced by the last linear layer.

Loss.

In the context of classification under adversarial perturbation, solving the optimization problem (8) for each target class yytruey\neq y_{\textrm{true}} results in a set of worst-case logits where the logit of the true class is equal to its lower bound and the other logits are equal to their upper bound:

That is for all yytruey\neq y_{\textrm{true}}, we have

We can then formulate our training loss as

Training procedure.

To stabilize the training process and get a good trade-off between nominal and verified accuracy under adversarial perturbation, we create a learning curriculum by scheduling the values of κ\kappa and ϵ\epsilon:

κ\kappa controls the relative weight of satisfying the specification versus fitting the data. Hence, we found that starting with κ=1\kappa=1 and slowly reducing it throughout training helps get more balanced models with higher nominal accuracy. In practice, we found that using a final value of κ=1/2\kappa=1/2 works well on Mnist, Cifar-10, Svhn and ImageNet.

More importantly, we found that starting with ϵ=0\epsilon=0 and slowly raising it up to a target perturbation radius ϵtrain\epsilon_{\textrm{train}} is necessary. We note that ϵtrain\epsilon_{\textrm{train}} does not need to be equal to the perturbation radius used during testing, using higher values creates robust models that generalize better.

Additional details that relate to specific datasets are available in the supplementary material in Appendix A.

Results

We demonstrate that IBP can train verifiably robust networks and compare its performance to state-of-the-art methods on Mnist, Cifar-10 and Svhn. Highlights include an improvement of the verified error rate from 3.67% to 2.23% on Mnist at ϵ=0.1\epsilon=0.1, from 19.32% to 8.05% on Mnist at ϵ=0.3\epsilon=0.3, and from 78.22% to 67.96% on Cifar-10 at ϵ=8/255\epsilon=8/255. We also show that IBP can scale to larger networks by training a model on downscaled ImageNet that reaches a non-vacuous verified error rate of 93.87%93.87\% at ϵ=1/255\epsilon=1/255. Finally, Section 4.3 illustrates how training with the loss function and curriculum from Section 3 allows the training process to adapt the model to ensure that the bound computed by IBP is tight.

Unless stated otherwise, we compute the empirical adversarial accuracy (or error rate) on the test set using 200 untargeted PGD steps and 10 random restarts. As the verified error rate computed for a network varies greatly with the verification method, we calculate it using an exact solver. Several previous works have shown that training a network with a loss function derived from a specific verification procedure renders the network amenable to verification using that specific procedure only . In order to circumvent this issue and present a fair comparison, we use a complete verification algorithm based on solving a MIP – such an algorithm is expensive as it performs a brute force enumeration in the worst case. However, in practice, we find that commercial MIP solvers like Gurobi can handle verification problems from moderately sized networks. In particular, we use the MIP formulation from Tjeng et al. . For each example of the test set, a MIP is solved using Gurobi with a timeout of 10 minutes. Upon timeout, we fallback to solving a relaxation of the verification problem with a LP using Gurobi again. When both approaches fail to provide a solution within the imparted time, we count the example as attackable. Thus, the verified error rate reported may be over-estimating the exact verified error rate.As an example, for the small model trained using \NoHyperWong et al.\endNoHyper, there are 3 timeouts at ϵ=0.1\epsilon=0.1, 18 timeouts at ϵ=0.2\epsilon=0.2 and 58 timeouts at ϵ=0.3\epsilon=0.3 for the 10K examples of the Mnist test set. These timeouts would amount to a maximal over-estimation of 0.03%, 0.18% and 0.58% in verified error rate, respectively. We always report results with respect to the complete test set of 10K images for both Mnist and Cifar-10, and 26K images for Svhn. For downscaled ImageNet, we report results on the validation set of 10K images.

We compare IBP to three alternative approaches: the nominal method, which corresponds to standard training with cross-entropy loss; adversarial training, following Madry et al. , which generates adversarial examples on the fly during training; and Wong et al. , which trains models that are provably robust. We train three different model architectures for each of the four methods (see Table 1). The first two models (i.e., small and medium) are equivalent to the small and large models in Wong et al. . We do not train our large model with \NoHyperWong et al.\endNoHyper as we could not scale this method beyond the medium sized model. The third model (i.e., large) is significantly larger (in terms of number of hidden units) than any other verified model presented in the literature. On Mnist, for each model and each method, we trained models that are robust to a wide range of perturbation radii by setting ϵtrain\epsilon_{\textrm{train}} to 0.1,0.2,0.30.1,0.2,0.3 or 0.40.4. During testing, we test each of these 12 models against ϵ[0,0.45]\epsilon\in[0,0.45]. On Cifar-10, we train the same models and methods with ϵtrain{2/255,8/255}\epsilon_{\textrm{train}}\in\{2/255,8/255\} and test on the same ϵ=ϵtrain\epsilon=\epsilon_{\textrm{train}} value. On Svhn we used ϵtrain=0.01\epsilon_{\textrm{train}}=0.01 and only test on ϵ=ϵtrain\epsilon=\epsilon_{\textrm{train}}.

Figures 3(a) and 3(b) compare IBP to \NoHyperWong et al.\endNoHyper on Mnist for all perturbation radii between 0 and 0.45 across all models. Remember that we trained each model architecture against many ϵtrain\epsilon_{\textrm{train}} values. The bold lines show for each model architecture, the model trained with the perturbation radius ϵtrain\epsilon_{\textrm{train}} that performed best for a given ϵ\epsilon (i.e., x-axis) The faded lines show all individual models. Across the full spectrum, IBP achieves good accuracy under PGD attacks and higher provable accuracy (computed by an exact verifier). We observe that while \NoHyperWong et al.\endNoHyper is competitive at small perturbation radii (when both ϵ\epsilon and ϵtrain\epsilon_{\textrm{train}} are small), it quickly degrades as the perturbation radius increases (when ϵtrain\epsilon_{\textrm{train}} is large). For completeness, Figure 3(c) also compares IBP to \NoHyperMadry et al.\endNoHyper with respect to the empirical accuracy against PGD attacks of varying intensities. We observe that IBP tends to be slightly worse than \NoHyperMadry et al.\endNoHyper for similar network sizes – except for the large model where \NoHyperMadry et al.\endNoHyper is likely overfitting (as it performs worse than the medium-sized model).

Finally, we note that when training the small network on Mnist with a Titan Xp GPU (where standard training takes 1.5 seconds per epoch), IBP only takes 3.5 seconds per epoch compared to 8.5 seconds for \NoHyperMadry et al.\endNoHyper and 2 minutes for \NoHyperWong et al.\endNoHyper (using random projection of 50 dimensions). Indeed, as detailed in Section 3 (under the paragraph interval bound propagation), IBP creates only two additional passes through the network compared to \NoHyperMadry et al.\endNoHyper for which we used seven PGD steps.

2 Downscaled ImageNet

This section demonstrates the scalability of IBP by training, to the best of our knowledge, the first model with non-vacuous verifiable bounds on ImageNet. We train a WideResNet-10-10 with 8M parameters and 1.8M hidden units, almost an order of magnitude greater than the number of hidden units in our large network. The results in Table 2 are obtained through standard non-robust training, adversarial training, and robust training using IBP on downscaled images (i.e., 64×6464\times 64). We use all 1000 classes and measure robustness (either empirical or verifiably) using the same one-vs-all scheme used for Mnist, Cifar-10 and Svhn. Additional details are available in the supplementary material in Appendix A. We realize that these results are pale in comparison to the nominal accuracy obtained by larger non-robust models (i.e., Real et al. achieving 16.1% top-1 error rate). However, we emphasize that no other work has formally demonstrated robustness to norm-bounded perturbation on ImageNet, even for small perturbations like ϵ=1/255\epsilon=1/255.

3 Tightness

To judge the tightness of IBP quantitatively, we compare the final verified error rate obtained using the MIP/LP cascade describe earlier with the upper bound estimates from IBP only. Table 3 shows the differences. We observe that IBP itself is a good estimate of the verified error rate and provides estimates that are competitive with more sophisticated solvers (when models are trained using IBP). While intuitive, it is surprising to see that the IBP bounds are so close to the MIP bounds. This highlights that verification becomes easier when models are trained to be verifiable as a simple method like IBP can verify a large proportion of the MIP verified samples. This phenomenon was already observed by Dvijotham et al. and Xiao et al. and explains why some methods cannot be verified beyond trivial bounds within a reasonable computational budget.

Conclusion

We have presented an approach for training verifiable models and provided strong baseline results for Mnist, Cifar-10, Svhn and downscaled ImageNet. Our experiments have shown that the proposed approach outperforms competing techniques in terms of verified bounds on adversarial error rates in image classification problems, while also training faster. In the future, we hope that these results can serve as a useful baseline. We believe that this is an important step towards the vision of specification-driven ML.

References

Appendix A Training parameters

For IBP, across all datasets, the networks were trained using the Adam algorithm with an initial learning rate of 10310^{-3}. We linearly ramp-down the value of κ\kappa between 1 and κfinal\kappa_{\textrm{final}} after a fixed warm-up period (κfinal\kappa_{\textrm{final}} is set to both 0 or 0.5 and the best result is used). Simultaneously, we lineary ramp-up the value of ϵ\epsilon between 0 and ϵtrain\epsilon_{\textrm{train}} (for Cifar-10 and Svhn, we use a value of ϵtrain\epsilon_{\textrm{train}} that is 10% higher than the desired robustness radius). Mnist is trained on a single Nvidia V100 GPU. Cifar-10, Svhn and ImageNet are trained on 32 tensor processing units (TPU) with 2 workers with 16 TPU cores each.

For Mnist, we train on a single Nvidia V100 GPU for 100 epochs with batch sizes of 100. The total number of training steps is 60K. We decay the learning rate by 10×\times at steps 15K and 25K. We use warm-up and ramp-up durations of 2K and 10K steps, respectively. We do not use any data augmentation techniques and use full 28×2828\times 28 images without any normalization.

Cifar-10, we train for 3200 epochs with batch sizes of 1600 (training for 350 epochs with batch sizes of 50 reaches 71.70% verified error rate when ϵ=8/255\epsilon=8/255). The total number of training steps is 100K. We decay the learning rate by 10×\times at steps 60K and 90K. We use warm-up and ramp-up durations of 5K and 50K steps, respectively. During training, we add random translations and flips, and normalize each image channel (using the channel statistics from the train set).

For Svhn, we train for 2200 epochs with batch sizes of 50 (training for 240 epochs with batch sizes of 50 reaches within 1% of the verified error rate). The total number of training steps is 100K. The rest of the schedule is identical to Cifar-10. During training, we add random translations, and normalize each image channel (using the channel statistics from the train set).

For ImageNet, we train for 160 epochs with batch sizes of 1024. The total number of training steps is 200K. We decay the learning rate by 10×\times at steps 120K and 180K. We use warm-up and ramp-up durations of 10K and 100K steps, respectively. We use images downscaled to 64×6464\times 64 (resampled using pixel area relation, which gives moiré-free images). During training, we use random crops of 56×5656\times 56 and random flips. During testing, we use a central 56×5656\times 56 crop. We also normalize each image channel (using the channel statistics from the train set).

The networks trained using Wong et al. were trained using the schedule and learning rate proposed by the authors. For Madry et al. , we used a learning rate schedule identical to IBP and, for the inner optimization, adversarial examples are generated by 77 steps of PGD with Adam and a learning rate of 10110^{-1}. Note that our reported results for these two methods closely match or beat published results, giving us confidence that we performed a fair comparison.

Figure 5 shows how the empirical PGD accuracy (on the test set) increases as training progresses for IBP and \NoHyperMadry et al.\endNoHyper. This plot shows the median performance (along with the 25th and 75th percentiles across 10 independent training processes) and confirms that IBP is stable and produces consistent results. Additionally, for IBP, we clearly see the effect of ramping the value of ϵ\epsilon up during training (which happens between steps 2K and 12K).

Appendix B Convolutional filters

Figure 6 shows the first layer convolutional filters resulting from training a small robust model on Mnist against a perturbation radius of ϵ=0.1\epsilon=0.1. Overall, the filters tend to be extremely sparse – at least when compared to the filters obtained by training a nominal non-robust model (this observation is consistent with ). We can qualitatively observe that \NoHyperWong et al.\endNoHyper produces the sparsest set of filters.

Similarly, as shown in Figure 7, robust models trained on Cifar-10 exhibit high levels of sparsity in their convolutional filters. \NoHyperMadry et al.\endNoHyper seems to produce more meaningful filters, but they remain sparse compared to the non-robust model.

This analysis suggests that IBP strongly limits the capacity of the underlying network. As a consequence, larger models are often preferable. Larger models, however, can lead to the explosion of intermediate interval bounds – and this is the main reason why values of ϵ\epsilon must be carefully scheduled. Techniques that combine tighter (but slower) relaxations with IBP could be used in the initial training stages when training deeper and wider models.

Appendix C Ablation study

Table 5 reports the performance of each model for different training methods on Mnist with ϵ=0.4\epsilon=0.4 (averaged over 10 independent training processes). We chose Mnist for ease of experimentation and a large perturbation radius to stress the limit of IBP. The table shows the effect of the elision of the last layer and the cross-entropy loss independently. We do not report results without the schedule on ϵ\epsilon as all models without the schedule are unable to train (i.e., reaching only 11.35% accuracy at best).

We observe that all components contribute to obtaining a good final verified error rate. This is particularly visible for the small model, where adding elision improves the bound by 3.88% (12.9% relative improvement) when using a softplus loss. Using cross-entropy (instead of softplus) also results in a significant improvement of 4.42% (14.7% relative improvement). Ultimately, the combination of cross-entropy and elision shows an improvement of 5.15%. Although present for larger models, these improvements are less visible. This is another example of how models with larger capacity are able to adapt to get tight bounds. Finally, it is worth noting that elision tends to provide results that are more consistent (as shown by the 25th and 75th percentiles).

Appendix D Additional results

Table 6 provides complementary results to the ones in Table 4. It includes results for each individual model architecture, as well as results from the literature (for matching model architectures). The test error corresponds to the test set error rate when there is no adversarial perturbation. For models that we trained ourselves, the PGD error rate is calculated using 200 iterations of PGD and 10 random restarts. The verified bound on the error rate is obtained using the MIP/LP cascade described in Section 4.

We observe that with the exception of Cifar-10 with ϵ=2/255\epsilon=2/255, IBP outperforms all other models by a significant margin (even for equivalent model sizes). For Cifar-10 with ϵ=2/255\epsilon=2/255, it was important to increase the model size to obtain competitive results. However, since IBP is cheap to run (i.e., only two additional passes), we can afford to run it on much larger models.

Appendix E Runtime

When training the small network on Mnist with a Titan Xp GPU (where standard training takes 1.5 seconds per epoch), IBP only takes 3.5 seconds per epoch compared to 8.5 seconds for \NoHyperMadry et al.\endNoHyper and 2 minutes for \NoHyperWong et al.\endNoHyper (using random projection of 50 dimensions). Indeed, as detailed in Section 3 (under the paragraph “interval bound propagation”), IBP creates only two additional passes through the network compared to \NoHyperMadry et al.\endNoHyper for which we used seven PGD steps during training. \NoHyperXiao et al.\endNoHyper’s method takes the same amount of time as IBP as it needs to perform bound propagation too.

Appendix F When Projected Gradient Descent is not enough

For a given example in Mnist, this section compares the worst-case attack found by PGD with the one found using a complete solver. The underlying model is a medium sized network trained using IBP with ϵ=0.1\epsilon=0.1. The nominal image, visible in Figure 8(a), has the label “eight”, and corresponds to the 1365th image of the test set.

Figure 9 shows the untargeted adversarial loss (optimized by PGD) around the nominal image. In these loss landscapes, we vary the input along a linear space defined by the worse perturbations found by PGD and the MIP solver. The uu and vv axes represent the magnitude of the perturbation added in each of these directions respectively and the zz axis represents the loss. Typical cases where PGD is not optimal are often a combination of two factors that are qualitatively visible in this figure: