Provable Robustness of ReLU networks via Maximization of Linear Regions

Francesco Croce, Maksym Andriushchenko, Matthias Hein

Introduction

In recent years it has been highlighted that state-of-the-art neural networks are highly non-robust: small changes to an input image, which are almost non-perceivable for humans, change the classifier decision and the wrong decision has even high confidence . This calls into question the use of neural networks in safety-critical systems e.g. medical diagnosis systems or self-driving cars and opens up possibilities to actively attack an ML system in an adversarial way . Moreover, this non-robustness has also implications on follow-up processes like interpretability. How should we be able to interpret classifier decisions if very small changes of the input lead to different decisions?

The finding of the non-robustness initiated a competition where on the one hand increasingly more sophisticated attack procedures were proposed and on the other hand research was focused to develop stronger defenses against these attacks . In the end it turned out that for many proposed defenses there exists still a way to attack the classifier successfully , with the notable exception of which was shown to be empirically robust wrt the $l_{\infty}$ -norm. Thus considering the high importance of robustness in safety-critical machine learning applications, we need robustness guarantees, where one can provide for each test point the radius of a ball on which the classifier will not change the decision and thus no attack whatsoever will be able to create an adversarial example inside this ball. Therefore recent research has focused on such provable guarantees of a classifier with respect to the $l_{1}$ -norm , $l_{2}$ -norm and $l_{\infty}$ -norm . Some works try to solve the combinatorial problem of computing for each test instance the norm of the minimal perturbation necessary to change the decision . Unfortunately, these approaches do not scale with normal training to large networks. Another line of research thus focuses on lower bounds on the norm of the minimal perturbation necessary to change the decision . *Equal contribution.

Moreover, in recent years several new ways to regularize neural networks or new forms of losses have been proposed with the idea of enforcing a large margin, that is a large distance between training instances and decision boundaries. However, these papers do not directly optimize a robustness guarantee. In spirit our paper is closest to . All of them are aiming at providing robustness guarantees and at the same time they propose a new way how one can optimize the robustness guarantee during training. Currently, up to our knowledge only can optimize robustness wrt to multiple $p$ -norms, whereas is restricted to $l_{2}$ and to $l_{\infty}$ .

In this paper we propose a regularization scheme for the class of ReLU networks (feedforward networks with ReLU activation functions including convolutional and residual architectures with max- or sum-pooling layers) which provably increases the robustness of classifiers. We use the fact that these networks lead to continuous piecewise affine classifier functions and show how to get either the optimal minimal perturbation or a lower bound using the properties of the linear region in which the point lies. As a result of this analysis we propose a new regularization scheme which directly maximizes the lower bound on the robustness guarantee. This allows us to get classifiers with good test error and good robustness guarantees at the same time. While we focus on robustness with respect to $l_{2}$ - and $l_{\infty}$ -distance, our approach applies to all $l_{p}$ -norms. Finally, we show in experiments on four datasets that our approach improves lower bounds as well as upper bounds on the robust test error and can be successfully integrated with adversarial training . Our main observation is that our proposed regularizer significantly improves provable robustness evaluated with the certification method of , and especially with the combinatorial solver of . We note that at the same time, models with adversarial training alone are not certifiable, i.e. they often have vacuous upper bounds on the robust test error.

Local properties of ReLU networks

Feedforward neural networks which use piecewise affine activation functions (e.g. ReLU, leaky ReLU) and are linear in the output layer can be rewritten as continuous piecewise affine functions .

so that the resulting classifier is obtained as $f^{(L+1)}(x)=W^{(L+1)}g^{(L)}(x)+b^{(L+1)}$ .

This allows us to write $f^{(k)}(x)$ as composition of affine functions, that is

Note that a forward pass through the network is sufficient to compute $V^{(k)}$ and $b^{(k)}$ for every $k$ , which results in only a small overhead compared to the usual effort necessary to compute the output of $f$ at $x$ . We are then able to characterize the polytope $Q(x)$ as intersection of $N=\sum_{l=1}^{L}n_{l}$ half spaces given by

for $l=1,\ldots,L$ , $i=1,\ldots,n_{l}$ , namely

Note that $N$ is also the number of hidden units of the network. Finally, we can write

which represents the affine restriction of $f$ to $Q(x)$ . One can further distinguish the subset $Q_{c}(x)$ of $Q(x)$ assigned to a specific class $c$ , among the $K$ available ones, which is given by

where $V^{(L+1)}_{r}$ is the $r$ -th row of $V^{(L+1)}$ . The set $Q_{c}(x)$ is again a polytope as it is an intersection of polytopes and it holds $Q(x)=\bigcup_{c=1,...,K}Q_{c}(x)$ . We refer to how to integrate other operations/layer types e.g. max-pooling, residual and dense nets or other piecewise linear activation functions like leaky ReLU.

Robustness guarantees for ReLU networks

In the following we first define the problem of the minimal adversarial perturbation and then derive robustness guarantees for ReLU networks. We call a decision of a classifier $f$ robust at $x$ if small changes of the input do not alter the decision. Formally, this can be described as optimization problem (1) . If the classifier outputs class $c$ for input $x$ , assuming a unique decision, the robustness of $f$ at $x$ is given by

where $C$ is a constraint set which the generated point $x+\delta$ has to satisfy, e.g., an image has to be in $^{d}$ . The complexity of the optimization problem (1) depends on the classifier $f$ , but it is typically non-convex, see for a hardness result for neural networks. For standard neural networks $\left\|\delta\right\|_{p}$ is very small for almost any input $x$ of the data generating distribution, which questions the use of such classifiers in safety-critical systems. The solutions of (1), $x+\delta$ , are called adversarial samples.

where $\left\|\cdot\right\|_{q}$ is the dual norm of $\left\|\cdot\right\|_{p}$ , that is $\frac{1}{p}+\frac{1}{q}=1$ . In it has been shown that box constraints $C=[a,b]^{d}$ can be integrated for linear classifiers which results in simple convex optimization problems. In the following we use the intuition from linear classifiers and the particular structures derived in Section 2 to come up with robustness guarantees, that means lower bounds on the optimal solution of (1), for ReLU networks. Moreover, we show that it is possible to compute the minimal perturbation for some of the inputs $x$ even though the general problem is NP-hard .

Let us start analyzing how we can solve efficiently problem (1) inside each linear region $Q(x)$ . We first need the definition of two important quantities:

The $l_{p}$ -distance $d_{B}(x)=\mathop{\rm min}\nolimits_{z\in\partial Q(x)}\left\|z-x\right\|_{p}$ of $x$ to the boundary of the polytope $Q(x)$ is given by

where $V_{j}^{(l)}$ is the $j$ -th row of $V^{(l)}$ and $\left\|\cdot\right\|_{q}$ is the dual norm of $\left\|\cdot\right\|_{p}$ ( $\frac{1}{p}+\frac{1}{q}=1$ ).

Due to the polytope structure of $Q(x)$ it holds that $d_{B}(x)$ is the minimum distance to the hyperplanes, $\left\langle V^{(l)}_{j},z\right\rangle+a^{(l)}_{j}=0$ , which constitute the boundary of $Q(x)$ . It is straightforward to check that the minimum $l_{p}$ -distance of a hyperplane $H=\{z\,|\,\left\langle w,z\right\rangle+b=0\}$ to $x$ is given by $\frac{|\left\langle w,x\right\rangle+b|}{\left\|w\right\|_{q}}$ where $\left\|\cdot\right\|_{q}$ is the dual norm. This can be obtained as follows

Note that by Hölder inequality one has $-\left\|w\right\|_{q}\left\|\delta\right\|_{p}\leq\left\langle w,\delta\right\rangle\leq\left\|w\right\|_{q}\left\|\delta\right\|_{p}$ , which yields $\left\|\delta\right\|_{p}\geq\frac{|\left\langle w,\delta\right\rangle|}{\left\|w\right\|_{p}}$ . Noting that $\left\langle w,\delta\right\rangle=-(\left\langle w,x\right\rangle+b)$ we get $\left\|\delta\right\|_{p}\geq\frac{|\left\langle w,x\right\rangle+b|}{\left\|w\right\|_{p}}$ and equality is achieved when one has equality in the Hölder inequality. ∎

For the decision boundaries in $Q(x)$ , with $c=\mathop{\rm arg\,max}\limits_{r=1,\ldots,K}f^{(L+1)}_{r}(x)$ , we define the quantity $d_{D}(x)$ as

If $d_{D}(x)\leq d_{B}(x)$ , then $d_{D}(x)$ is the minimal $l_{p}$ -distance of $x$ to the decision boundary of the ReLU network $f^{(L+1)}$ .

First we note that $d_{D}(x)$ is the distance of $x$ to the decision boundary for the linear multi-class classifier $g(x)=V^{(L+1)}x+a^{(L+1)}$ which is equal to $f^{(L+1)}$ on $Q(x)$ . If $d_{D}(x)\leq d_{B}(x)$ then the point realizing the minimal distance to the decision boundary $d_{D}(x)$ is inside $Q(x)$ and as $d_{D}(x)<d_{B}(x)$ there cannot exist another point on the decision boundary of $f^{(L+1)}$ outside of $Q(x)$ having a smaller $l_{p}$ -distance to $x$ . Thus $d_{D}(x)$ is the minimal $l_{p}$ -distance of $x$ to the decision boundary of $f^{(L+1)}$ . ∎

The next theorem combines both results to give a lower bound or the optimal solution to the optimization problem (1).

We get the following robustness guarantees:

If $d_{B}(x)\leq d_{D}(x)$ , then $d_{B}(x)$ is a lower bound on the minimal $l_{p}$ -norm of the perturbation necessary to change the class (optimal solution of (1)).

If $d_{D}(x)\leq d_{B}(x)$ , then $d_{D}(x)$ is equal to the minimal $l_{p}$ -norm necessary to change the class (optimal solution of optimization problem (1)).

If $d_{B}(x)\leq d_{D}(x)$ , then the ReLU classifier does not change on $B_{p}(x,d_{B}(x))$ and thus $d_{B}(x)$ is a lower bound on the minimal $l_{p}$ -norm perturbation necessary to change the class. The second statement follows directly by Lemma 3.2. ∎

In Figure 1 we illustrate the different cases for $p=2$ . On the left hand side $d_{B}(x)<d_{D}(x)$ and thus we get that on the ball $B_{2}(x,d_{B}(x))$ the decision does not change, whereas in the rightmost plot we have $d_{D}(x)<d_{B}(x)$ and thus we obtain the minimum distance to the decision boundary. Using Theorem 3.1 we can provide robustness guarantees for every point and for some even compute the optimal robustness guarantees. Finally, we describe in the appendix (Section A.2) how one can improve the lower bounds by checking neighboring regions of $Q(x)$ . Compared to the bounds of ours are slightly worse (see Table 4 in the appendix). However, our bounds can be directly maximized and have a clear geometrical meaning and thus motivate directly a regularization scheme for ReLU networks which we propose in the next section.

Large margin and region boundary regularization

Using the results of Section 3, a classifier with guaranteed robustness requires large distance to the boundaries of the linear region as well as to the decision boundary. Even the optimal guarantee (solution of (1)) can be obtained in some cases. Unfortunately, as illustrated in Figures 2a and 2c for simple one hidden layer networks, the linear regions $Q(x)$ are small for normally-trained networks and thus no meaningful guarantees can be obtained. Thus we propose a new regularization scheme which simultaneously aims for each training point to achieve large distance to the boundary of the linear region it lies in, as well as to the decision boundary. Using Theorem 3.1 this directly leads to non-trivial robustness guarantees of the resulting classifier.

However, note that just maximizing the distance to the decision boundary might be misleading during training as this does not discriminate between points which are correctly (correct side of the decision hyperplane) or wrongly classified (wrong side of the decision hyperplane). Thus we introduce the signed version of $d_{D}(x)$ , where $y$ is the true label of the point $x$ ,

Please note that if $\overline{d_{D}}(x)\geq 0$ , then $x$ is correctly classified, whereas $\overline{d_{D}}(x)<0$ if $x$ is wrongly classified. If $|\overline{d_{D}}(x)|\leq d_{B}(x)$ , then it follows from Lemma 3.2 that $|\overline{d_{D}}(x)|$ is the distance to the decision hyperplane. If $|\overline{d_{D}}(x)|>d_{B}(x)$ this does not need to be any longer true, but $|\overline{d_{D}}(x)|$ is at least a good proxy as $\left\|V^{(L+1)}_{y}-V^{(L+1)}_{s}\right\|_{q}$ is an estimate of the local cross Lipschitz constant . Finally, we propose to use the following regularization scheme:

The MMR penalizes distances to the boundary of the polytope if $d_{B}(x)\leq\gamma_{B}$ and positive distances ( $x$ is correctly classified) if $d_{D}(x)\leq\gamma_{D}$ . Notice that wrongly classified points are always penalized. The part of the regularizer corresponding to $d_{D}(x)$ has been suggested in in a similar form as a loss function for general neural networks without motivation from robustness guarantees. They have an additional loss penalizing difference in the outputs with respect to changes at the hidden layers which is completely different from our geometrically motivated regularizer penalizing the distance to the boundary of the linear region. The choice of $\gamma_{B},\gamma_{D}$ allows different trade-off between the terms. In particular $\gamma_{D}<\gamma_{B}$ (stronger maximization of $d_{B}(x)$ ) leads to more points for which the optimal robustness guarantee (case $d_{D}(x)\leq d_{B}(x)$ ) can be proved. For practical reasons, we also propose a variation of our MMR regularizer in (5):

where $d_{B}^{i}(x)$ is the distance of $x$ to the $i$ -th closest hyperplane of the boundary of the polytope and $d_{D}^{i}(x)$ is the analogue for the decision boundaries. Basically, we are optimizing, instead of the closest decision hyperplane, the $k_{D}$ -closest ones and analogously the $k_{B}$ -closest hyperplanes defining the linear region $Q(x)$ of $x$ . This speeds up the training time as more hyperplanes are moved in each update. Moreover, when deriving lower bounds using more than one linear region, one needs to consider more than just the closest boundary hyperplane. Finally, many state-of-the-art schemes for proving lower bounds work well only if the activation status of most neurons is constant for small changes of the points. This basically amounts to ensure that all the hyperplanes are sufficiently far away, which is exactly what our regularization scheme is aiming at. Thus our regularization scheme also helps to achieve state-of-the-art provable robustness with other certification methods (). This is also the reason why the term pushing the polytope boundaries away is essential. Just penalizing the distance to the decision boundary is not sufficient to prove good lower bounds on the minimal distance to the decision boundary as we will show in our experiments (Table 2). Compared to the regularization scheme in using a dual feasible point of the robust loss, our approach has a direct geometric interpretation and allows to derive the exact minimal perturbation for some fraction of the test points varying from dataset to dataset but it can be as high as $99\%$ . In practice, we gradually decrease $k_{B}$ and $k_{D}$ in (6) during training so that only the closest hyperplanes of each training point influence the regularizer.

Thus, denoting the cross entropy loss $CE(f(x),y)$ , the final objective of our models is

Experiments

We provide a variety of experiments aiming to show the state-of-the-art performance of MMR to achieve provably robust classifiers. We make our code and models publicly availablehttps://github.com/max-andr/provable-robustness-max-linear-regions. We use four datasets: MNIST, German Traffic Signs (GTS) , Fashion MNIST and CIFAR-10. We consider in the paper robustness wrt to both $l_{2}$ and $l_{\infty}$ distances. We use two criteria. First, upper and lower bounds on the robust test error for a given threshold $\epsilon$ , that is the largest achievable test error if every test input can be modified with a perturbation $\delta$ such that $\left\|\delta\right\|_{p}\leq\epsilon$ and $\delta$ is chosen so that $x+\delta$ is classified wrongly. Second, we show in Section 5.2 and in the appendix results for lower and upper bounds on the minimal $\left\|\delta\right\|_{p}$ from (1).

Lower bounds on the robust test error for $l_{2}$ and $l_{\infty}$ are computed using the attack via Projected Gradient Descent (PGD) . Upper bounds on the robust test error are computed using the approach of . Additionally, yields upper and lower bounds on the robust test error via mixed integer programming (MIP). We use the solver for the MIP with a timeout of 120s per point, that is if the point has not been verified in this time the solver is stopped. This technique is currently effective for $l_{\infty}$ but not for $l_{2}$ , where basically in almost every case the timeout is reached and thus we discard for $l_{2}$ the MIP evaluation. Finally, we combine the upper bounds on the robust test error found by and by counting the fraction of points that can be certified with at least one of the two methods. We compare our own guarantees from Theorem 3.1 with the ones of in the appendix. Moreover, we also combine the lower bounds on the robust test error found by PGD, the MIP and the attack of .

We compare six methods: plain training, adversarial training of which has been shown to significantly increase robustness, the robust loss of which supports both $l_{2}$ and $l_{\infty}$ norms (denoted as KW), the training scheme of (denoted as Xiao et al), our regularization scheme MMR and MMR together with adversarial training again as in . All schemes are evaluated on a one hidden layer fully connected network with 1024 hidden units (FC1) and a convolutional network (CNN) with 2 convolutional and 2 dense layers as used in . For more details see Appendix B. Note that since code for is not available, we just show the $l_{\infty}$ results from their paper (evaluated on full test sets with PGD lower bounds, and MIP upper bounds), which are available only for MNIST and CIFAR-10 for the same CNN architecture.

Improvement of robustness: The results can be found in Table 1. For CIFAR-10 we show only CNN models as fully connected networks do not have good test performance. We report clean test error, lower and upper bounds on robust test error at the threshold indicated in Table 1, computed on 1000 points for $l_{\infty}$ , and on the full test sets for $l_{2}$ (as we do not do the MIP evaluation there). Both KW and MMR+at achieve similar performance regarding lower and upper bounds on the robust test error. For $l_{\infty}$ our MMR+at achieves overall better performance than KW, sometimes with significantly better clean test error like on F-MNIST. In some cases, e.g. on CIFAR-10, MMR+at provides slightly worse upper bounds, but instead preserves better test error. For $l_{2}$ KW performs better than MMR+at with the exception of GTS. This is to be expected as we use for the evaluation of the upper bounds on the robust test error the approach of which are directly optimized by their robust training procedure. Note that both KW and MMR/MMR+at outperform by large margin plain and adversarial training regarding provable robustness (upper bounds on the robust test error). With the exception of the CNN- $l_{2}$ models for MNIST and F-MNIST, the upper bound on robust test error of MMR+at is smaller than the lower bound of the plain model. Thus our models are provably better than the plain models regarding robust test error. Moreover, regarding robustness wrt to $l_{\infty}$ the gaps between lower and upper bounds are often very small for KW and MMR/MMR+at showing that both techniques lead to models which are easier to check via the MIP which we discuss next in more detail.

Enhancing verifiability: A key aspect of any robust training should be the ability of producing models both resistant to adversarial manipulation and being verifiable, in the sense of having guarantees on the minimal perturbation changing the decision for a test input. In fact, even if empirically model seems to be robust, only computing certificates allows to completely trust it. In our experiments, we could use successfully the method of to get meaningful guarantees, which so far, as noted in , could be achieved only on models provided by the specific training of . Moreover, the MIP is too slow to run on standard models, so that ad hoc techniques have been developed to train classifiers verifiable by MIP . Our MMR training produces also models which can be checked by MIP. This is due to the fact that the hyperplanes representing the boundary of the polytope are actually the boundary between the different regimes (identity or zero function) of ReLU units, so that pushing the hyperplanes implies that many inputs of ReLU units have constant sign in a wide region around the data points (and having ReLU units with unstable signs is the main slowdown factor for the MIP solver). In this context, develop a specific technique aiming at inducing stability of ReLU signs. Therefore, we provide a comparison of MMR to the ReLU-stability loss of in Table 1. On MNIST our MMR+at model of the same CNN architecture has much tighter upper bounds on the robust error – 3.6% compared to 5.7% of Xiao et al. Moreover, we have no gap between the lower and upper bounds on the robust test error, and our upper bounds are lower than the lower bounds of Xiao et al. This suggests that our training scheme is better at both improving robustness and enhancing verifiability than the approach of Xiao et al . Additionally, on CIFAR-10 we have similar upper bounds, but better test error and better lower bounds. To illustrate the speedup in verification time for MMR models, we show in Figure 3 the time MIP needs to run on 1000 points (with timeout of 120s per point) and lower and upper bounds on robust test error wrt $l_{\infty}$ -distance at $\epsilon=0.1$ . We use CNNs trained on MNIST with $\gamma_{B}=\gamma_{D}=0.15$ and different values of $\lambda$ representing the weight of our regularizer in the loss (7) (for $\lambda=0$ one gets the plain model). It is clear that MIP performs poorly both in runtime (almost 2000 minutes) and performance (LB=1%, UB=100%) on the plain model. In contrast, the MMR models are verified quickly (between 35 and 79 minutes) and almost completely (the rate of certified points is between 99.3% and 100%), that is lower and upper bounds are close or even equal. Please note that both statistics improve with increasing $\lambda$ up to 2, while runtime gets worse with a larger value (as at $\lambda=3$ the classifier becomes less robust and then requires a higher computational effort to be certified).

2 Further experiments

We present a series of experiments for a detailed understanding of how MMR works. For this section we consider models trained to be $l_{2}$ -robust and evaluate robustness as the average $l_{2}$ -norm of the perturbations necessary to change the class. Lower bounds on this perturbation are computed either with or our method (Theorem 3.1), while upper bounds are provided by Carlini-Wagner $l_{2}$ -attack (CW) . We also introduce a second fully connected architecture, FC10, with 10 hidden layers (see Appendix for details).

Importance of linear regions maximization: In order to highlight the importance of both parts of the MMR regularization, i) penalization of the distance to decision boundary and ii) penalization of the distance to boundary of the polytope, we train, for each dataset/architecture, models penalizing only the distance to the decision boundary, that is the second term in the r.h.s. of (5) and (6). We call this partial regularizer MMR- $d_{D}$ , in contrast to the full version MMR-full. Then we compare the lower and upper bounds on the solution of (1) for MMR- $d_{D}$ and MMR-full models. For a fair comparison we consider models with similar test error. We clearly see in Table 2 that the lower bounds, computed by the method presented in , are always significantly better when MMR-full is used, while the behavior of the upper bounds does not clearly favor one of the two. This result shows that in order to get good lower bounds one has to increase the distance of the points to the boundaries of the polytope.

Guaranteed optimal solutions via MMR: Theorem 3.1 provides a simple and efficient way to obtain in certain cases the solution of (1). Although for normally trained networks the conditions are rarely satisfied, we show in Table 3 that for the MMR-models for fully connected networks for a significant fraction of the test set we obtain the globally optimal solution of (1), that is the true $l_{2}$ -robustness. Moreover, we report how much better our globally optimal solutions are compared to the lower bounds of . Interestingly, we can provably get the true robustness for around $10\%$ of points for F-MNIST and for over $98\%$ of the points on GTS for the case of fully connected networks. For these cases the optimal solutions have roughly $7\%$ larger $l_{2}$ -norm for F-MNIST and $0.5\%$ larger for GTS than the lower bounds. Note that currently the MIP is not efficient for $l_{2}$ even though there is room for improvement. Globally optimal solutions for larger networks achieved via our method can serve as a test both for lower and upper bounds. This is an important issue as currently large parts of the community relies just on upper bounds of robustness using attack schemes like the CW-attack which we address in the the next paragraph.

Evaluation of CW-attack: The CW-attack which we use for our upper bounds in Tables 2 and 5 is considered state-of-the-art. Thus it is interesting to see how close it is to the globally optimal solution. On GTS FC1 we find for the MMR model with best test error from Table 5 the globally optimal solution of (1) for 12596 out of 12630 test points. We compare on this subset the optimal norm $\left\|\delta_{opt}\right\|_{2}$ to $\left\|\delta_{CW}\right\|_{2}$ obtained by the CW-attack and plot the ratios $\left\|\delta_{CW}\right\|_{2}$ / $\left\|\delta_{opt}\right\|_{2}$ in descending order in Figure 4 (note that we truncate at 4000 points). While the CW-attack performs in general well, there are 2330 points ( $18.5\%$ of the test set) where the CW-attack has at least 10% larger norms and 1145 points (around $9.1\%$ ) with at least 20% larger norms. The maximal relative difference is 235%. Thus at least on a pointwise basis evaluating robustness with respect to an attack scheme can significantly overestimate robustness, see also . This shows the importance of techniques to prove lower bounds. Moreover, the time to compute the adversarial examples by the CW-attack is 16327s, while our technique provides both lower bounds and optimal solutions in 1701s.

Conclusion

We have introduced a geometrically motivated regularization scheme which leads to models which are provably robust according to the current state-of-the-art certification methods of . In particular, it performs as well as the state-of-the-art models of in terms of certified robust test error, and better or similarly to . Finally, our method based on the linear regions allows to obtain the globally minimal adversarial perturbation in a significant fraction of cases for large fully connected networks which can be used to test lower and upper bounds.

Acknowledgements

We would like to thank Vincent Tjeng for helping to set up the MIP evaluation. Furthermore, we thank Eric Wong and Zico Kolter for adapting their code to the $l_{2}$ -norm, as well as for many helpful discussions.

References

Appendix A Improving lower bounds

The additional box constraints lead to the following optimization problem,

which is convex but has no analytical solution. However, its dual is just a one-dimensional convex optimization problem which can be solved efficiently. In fact a reformulation of this problem has been considered in , where fast solvers for $p\in\{1,2,\infty\}$ are proposed. Moreover, when computing the distance to the boundary of the polytope or the decision boundaries one does not need to solve always the box-constrained distance problem (9). It suffices to compute first the distances (8) as they are smaller or equal to the ones of (9) and sort them in ascending order. Then one computes the box-constrained distances in the given order and stops when the smallest computed box-constrained distance is smaller than the next original distance in the sorted list. In this way one typically just needs to solve a very small fraction of all box-constrained problems. The integration of the box constraints is important as the lower bounds improve on average by 20% and this can make the difference between having a certified optimal solution and just a lower bound.

A.2 Checking neighboring regions

In order to improve the lower bounds we can use not only the linear region $Q(x)$ where $x$ lies but also some neighboring regions. The following description is just a sketch as one has to handle several case distinctions.

Let $x$ be the original point and $H=\{\pi_{1},...,\pi_{n}\}$ the set of hyperplanes defining the polytope $Q(x)$ sorted so that $d_{C}(x,\pi_{i})<d_{C}(x,\pi_{j})$ if $i<j$ , where $d_{C}$ is the distance including box constraints. If we do not directly get the guaranteed optimal solution, we get an upper bound ( $u$ , namely the distance to the decision boundary inside $Q(x)$ ) and a lower bound for the norm of the adversarial perturbation ( $l=d_{C}(x,\pi_{1})$ ). If $l<u$ , we can check the region that we find on the other side of $\pi_{1}$ . In order to get the corresponding description of the polytope on the other side, we just have to change the corresponding entry in the activation matrix $\Sigma$ of the layer where $\pi_{1}$ belongs to and recompute the hyperplanes of the new linear region $R$ . Solving (1) on the second region we get a new upper bound if the distance of $x$ to the decision boundary in $R$ is smaller than $u$ . Moreover we update $H$ with the hyperplanes given by the second region. Finally, if $u<d_{C}(x,\pi_{2})$ then $u$ is the optimal solution, otherwise $l=d_{C}(x,\pi_{2})$ and we can repeat this process with the next closest hyperplane. At the moment we stop after checking maximally $5$ neighboring linear regions.

Appendix B Main experiments

By FC1 we denote a one hidden layer fully connected network with 1024 hidden units. By FC10 we denote a 10 hidden layers network that has 1 layer with 124 units, seven layers with 104 units and two layers with 86 units (so that the total number of units is again 1024). The convolutional architecture that we use is identical to , which consists of two convolutional layers with 16 and 32 filters of size $4\times 4$ and stride 2, followed by a fully connected layer with 100 hidden units. For all experiments we use batch size 128 and we train the models for 100 epochs. Moreover, we use Adam optimizer with the default learning rate 0.001 for all models except for the $l_{2}$ models on MNIST and F-MNIST where we use the learning rate of $10^{-4}$ for MMR and $5\cdot 10^{-5}$ for MMR+at. We also reduce the learning rate by a factor of 10 for the last 10 epochs. On CIFAR-10 dataset we apply random crops and random mirroring of the images as data augmentation. For training we use MMR regularizer in the formulation (6) with $k_{D}$ equal to the number of classes, and with $k_{B}$ linearly (wrt the epoch) decreasing from 10% to 2% of the total number of hidden units of the particular network architecture. We also use a training schedule for $\lambda$ where we linearly increase it from $\lambda/10$ to $\lambda$ during the first 10 epochs. We employ both schemes since they increase the stability of training with MMR. In order to determine the best set of hyperparameters $\lambda$ , $\gamma_{B}$ , and $\gamma_{D}$ of MMR, we perform a grid search over them for every dataset and network architecture. In particular, we empirically found that the optimal values of $\gamma_{B}$ and $\gamma_{D}$ are equal and are usually 1.5-2 times higher than the $\epsilon$ of robust error. All the reported MMR models and the final set of hyperparameters can be found at https://github.com/max-andr/provable-robustness-max-linear-regions. In order to make a comparison to the robust training of we either take their publicly available models or retrain them using their code. For the main experiments, we set the parameter for KW training equal to the $\epsilon$ for which we check the robust error. For experiments where robustness is evaluated as average lower bound (see Section 5.2), we performed a grid search over the radius of the $l_{2}$ -norm used in their robust loss, aiming at a model with non-trivial lower bounds with little or no loss in test error.

We perform adversarial training using the PGD attack of with 50% clean and 50% adversarial examples in every batch. For the $l_{2}$ -norm, we adapted the implementation from to perform the gradient update normalized by its $l_{2}$ norm, instead of the gradient sign (which corresponds to $l_{\infty}$ -norm and thus irrelevant for $l_{2}$ case) on every iteration. We use the same $l_{2}$ -bound on the perturbation as the $\epsilon$ used in robust error. During training, we perform 40 iterations of the PGD attack for MNIST and F-MNIST, and 7 for GTS and CIFAR-10. During evaluation, we use 40 iterations for all datasets. The step size is selected as $\epsilon$ divided by the number of iterations and multiplied by 2. We use the untargeted formulation of the Carlini-Wagner $l_{2}$ attack in order to evaluate the upper bounds on the $l_{2}$ -norm required to change the class. We use the settings provided in the original paper and in their code, including 20 iterations of the binary search, 10000 iterations of the optimizer, learning rate 0.01 and initial constant of 0.001. For the Mixed Integer Programming evaluation we use the library of with the settings of , which we obtained via private correspondence. Namely, we use Gurobi as the back-end, LP as the tightening algorithm, and we set 5s timeout for the presolver, and 120s timeout for the main solver.

Appendix C Further experiments

Comparison to Cross-Lipschitz regularization: We also consider models trained with Cross-Lipschitz regularization of since they also consider the robustness wrt the $l_{2}$ norm. We evaluated upper bounds on the robust error of MNIST-FC1 models using the method of , which gives tighter bounds than the original Cross-Lipschitz guarantee of . As a result, their most provably robust model obtained test error of 1.38%, and the robust error bounded between 2.5% given by the PGD attack and 7.2% by . Thus we can observe that Cross-Lipschitz regularization also enhances certifiability in this case since the upper bound is significantly better than the one for adversarial training, 16.9%. However, both KW and MMR+at provide a better upper bound on robust error, 5.2% and 6.4% respectively.

Comparison of lower bounds: In Table 4 we compare, for fully connected models, the lower bounds on the distance to the decision boundary computed by and our technique using Theorem 3.1 with integration of box constraints once just checking the initial linear region $Q(x)$ where the point $x$ lies versus also checking neighboring linear regions. We see that obtain better lower bounds, this is why we use their method for the main evaluation of provable robustness in Table 1. Nevertheless, the gap is not too large and while the lower bounds are worse, the achieved robustness using our MMR regularization is mostly better as shown in Table 1.

Analysing $l_{2}$ -robustness with different metrics: We want here to repeat the experiment of Section 5 wrt $l_{2}$ -norm but evaluating robustness as the average norm of the perturbation necessary to change the classification of a point. We can compute for every input a lower bound on the $l_{2}$ -norm of the minimal adversarial perturbation thanks to the method of and an upper bound with CW-attack . However, when our technique (Theorem 3.1) provides the optimal solution, we set both lower and upper bounds to this value. We report test error of the model and the average lower and upper bounds in Tables 5 and 6, computed on 1000 points of the test set. For KW, MMR and MMR + adversarial training we report the solutions which achieve similar test error than the plain model. There are several interesting observations. First of all, while adversarial training improves the upper bounds compared to the plain setting often quite significantly, the lower bounds almost never improve, often they get even worse. This is in contrast to the methods, KW and our MMR, which optimize the robustness guarantees. For MMR we see in all cases significant improvements of the lower bounds over plain and adversarial training, for KW this is also true but the improvements on GTS are much smaller. Notably, for the fully connected networks FC1 and FC10 on GTS and F-MNIST , the lower bounds achieved by MMR and/or MMR+at are larger than the upper bounds of the plain training for F-MNIST and better than plain and adversarial training as well as KW on GTS. Thus MMR is provably more robust than the competing methods in these configurations. Moreover, the achieved lower bounds of MMR are only worse than the ones of KW on MNIST for FC10. Also for the achieved upper bounds MMR is most of the time better than KW and always improves over adversarial training. For the CNNs the improvements of KW and MMR over plain and adversarial training in terms of lower and upper bounds are smaller than for the fully connected networks and it is harder to maintain similar test performance. The differences between KW and MMR for the lower bounds are very small so that for CNNs both robust methods perform on a similar level.

Appendix D Visualizing the structure of provably robust models

Here we analyse the effect of MMR regularization on the gradient of cross entropy loss wrt the input and on the structure of the convolutional filters. We focus on models trained for $l_{\infty}$ -robustness on MNIST, GTS, and CIFAR-10. We provide the plain and adversarially trained model as reference. In Figures 5, 7, 9 we visualize the gradients computed for 10 images of the corresponding test sets (shown in the first row) for the different training procedures. For MNIST, zero values are represented in white, negative in blue and positive in red, where every image is rescaled independently. For GTS and CIFAR-10, we follow and clip the gradient values to $\pm$ 3 standard deviations and then rescale them to $^{d}$ . We visualize models obtained with plain training (second row), adversarial training (third), KW robust training (fourth), MMR (fifth), MMR + adversarial training (last row). Figures 6, 8, 10 show the structure of the weights of the filters of the two convolutional layers. The top five rows are the filters from the first layer, either all of them for MNIST, or the first five for GTS and CIFAR-10 (reshaped from $4\times 4\times 3$ to $4\times 12$ ). The bottom five are the filters from the second convolution layer (reshaped from $4\times 4\times 16$ to $16\times 16$ ). We present the filters for plain training (first and sixth row), adversarial training (second and seventh), KW robust training (third and eighth), MMR (fourth and ninth), MMR + adversarial training (fifth and last row). The white color corresponds to zero weights, and the farther the value is from zero, the more intense is the color. Note that each row is also scaled independently.

MNIST (Figures 5, 6): As noted in , adversarial training leads to more interpretable gradients that focus on salient features of the digits. All robust training schemes have the effect of creating both interpretable and sparse gradients, which is reasonable considering that a sparse gradient implies that there are less directions along which abrupt variations in the value of the loss are possible. While KW model seems to highlight more the borders of the images, MMR models have the most concentrated gradients. In line with what has been reported in , the filters of adversarially trained and provably robust models (Figure 6) have significantly higher sparsity than the plain model. Interestingly, in contrast to the others, MMR models preserve sparsity also in the filters on the second convolutional layer. This seems to be important for obtaining tight certificates by the KW method and for fast verification with the MIP solver.

GTS (Figures 7, 8): For GTS we also observe that all robust training schemes lead to a qualitatively different behaviour of the gradient. In particular, the models become less sensitive to variations of the background, and instead they highlight more the traffic signs. We observe that the gradients for the MMR and adversarially trained models are the most interpretable. We note that the similarity between the gradients of the KW model and MMR+at is possibly due to their similar test error and robustness (Table 1). Similarly to MNIST, we observe that the convolutional filters are sparser for the adversarially trained model than for the plain model. The filters of KW and MMR models are the sparsest, while MMR+at is less sparse and resembles more the adversarially trained model.

CIFAR-10 (Figures 9, 10): First of all, we note that the considered shallow CNNs are not able to achieve very low clean test error (the best is 24.63% for the plain model and 34.61% for provably robust models). This may explain why we cannot observe interpretable gradients as clearly as in even for the adversarially trained model. However, we note that the gradients for the robustly trained models are still qualitatively different from the plain model. While the gradients of the plain model just consist of high-frequency noise, e.g. the MMR model concentrates more on the objects rather than on the background. For the convolutional filters of the CIFAR-10 models, we make conclusions similar to GTS. In particular, the filters of KW and MMR models are the sparsest, while MMR+at is less sparse and is more similar to the adversarially trained model.