Measuring Neural Net Robustness with Constraints

Osbert Bastani, Yani Ioannou, Leonidas Lampropoulos, Dimitrios Vytiniotis, Aditya Nori, Antonio Criminisi

Introduction

Recent work shows that it is often possible to construct an input mislabeled by a neural net by perturbing a correctly labeled input by a tiny amount in a carefully chosen direction. Lack of robustness can be problematic in a variety of settings, such as changing camera lens or lighting conditions, successive frames in a video, or adversarial attacks in security-critical applications .

A number of approaches have since been proposed to improve robustness . However, work in this direction has been handicapped by the lack of objective measures of robustness. A typical approach to improving the robustness of a neural net $f$ is to use an algorithm $\mathcal{A}$ to find adversarial examples, augment the training set with these examples, and train a new neural net $f^{\prime}$ . Robustness is then evaluated by using the same algorithm $\mathcal{A}$ to find adversarial examples for $f^{\prime}$ —if $\mathcal{A}$ discovers fewer adversarial examples for $f^{\prime}$ than for $f$ , then $f^{\prime}$ is concluded to be more robust than $f$ . However, $f^{\prime}$ may have overfit to adversarial examples generated by $\mathcal{A}$ —in particular, a different algorithm $\mathcal{A}^{\prime}$ may find as many adversarial examples for $f^{\prime}$ as for $f$ . Having an objective robustness measure is vital not only to reliably compare different algorithms, but also to understand robustness of production neural nets—e.g., when deploying a login system based on face recognition, a security team may need to evaluate the risk of an attack using adversarial examples.

In this paper, we study the problem of measuring robustness. We propose to use two statistics of the robustness $\rho(f,\mathbf{x}_{*})$ of $f$ at point $\mathbf{x}_{*}$ (i.e., the $L_{\infty}$ distance from $\mathbf{x}_{*}$ to the nearest adversarial example) . The first one measures the frequency with which adversarial examples occur; the other measures the severity of such adversarial examples. Both statistics depend on a parameter $\epsilon$ , which intuitively specifies the threshold below which adversarial examples should not exist (i.e., points $\mathbf{x}$ with $L_{\infty}$ distance to $\mathbf{x}_{*}$ less than $\epsilon$ should be assigned the same label as $\mathbf{x}_{*}$ ).

The key challenge is efficiently computing $\rho(f,\mathbf{x}_{*})$ . We give an exact formulation of this problem as an intractable optimization problem. To recover tractability, we approximate this optimization problem by constraining the search to a convex region $\mathcal{Z}(\mathbf{x}_{*})$ around $\mathbf{x}_{*}$ . Furthermore, we devise an iterative approach to solving the resulting linear program that produces an order of magnitude speed-up. Common neural nets (specifically, those using rectified linear units as activation functions) are in fact piecewise linear functions ; we choose $\mathcal{Z}(\mathbf{x}_{*})$ to be the region around $\mathbf{x}_{*}$ on which $f$ is linear. Since the linear nature of neural nets is often the cause of adversarial examples , our choice of $\mathcal{Z}(\mathbf{x}_{*})$ focuses the search where adversarial examples are most likely to exist.

We evaluate our approach on a deep convolutional neural network $f$ for MNIST. We estimate $\rho(f,\mathbf{x}_{*})$ using both our algorithm $\mathcal{A}_{\text{LP}}$ and (as a baseline) the algorithm $\mathcal{A}_{\text{L-BFGS}}$ introduced by . We show that $\mathcal{A}_{\text{LP}}$ produces a substantially more accurate estimate of $\rho(f,\mathbf{x}_{*})$ than $\mathcal{A}_{\text{L-BFGS}}$ . We then use data augmentation with each algorithm to improve the robustness of $f$ , resulting in fine-tuned neural nets $f_{\text{LP}}$ and $f_{\text{L-BFGS}}$ . According to $\mathcal{A}_{\text{L-BFGS}}$ , $f_{\text{L-BFGS}}$ is more robust than $f$ , but not according to $\mathcal{A}_{\text{LP}}$ . In other words, $f_{\text{L-BFGS}}$ overfits to adversarial examples computed using $\mathcal{A}_{\text{L-BFGS}}$ . In contrast, $f_{\text{LP}}$ is more robust according to both $\mathcal{A}_{\text{L-BFGS}}$ and $\mathcal{A}_{\text{LP}}$ . Furthermore, to demonstrate scalability, we apply our approach to evaluate the robustness of the 23-layer network-in-network (NiN) neural net for CIFAR-10, and reveal a surprising lack of robustness. We fine-tune NiN and show that robustness improves, albeit only by a small amount. In summary, our contributions are:

We formalize the notion of pointwise robustness studied in previous work and propose two statistics for measuring robustness based on this notion (§2).

We show how computing pointwise robustness can be encoded as a constraint system (§3). We approximate this constraint system with a tractable linear program and devise an optimization for solving this linear program an order of magnitude faster (§4).

We demonstrate experimentally that our algorithm produces substantially more accurate measures of robustness compared to algorithms based on previous work, and show evidence that neural nets fine-tuned to improve robustness (§5) can overfit to adversarial examples identified by a specific algorithm (§6).

Our formalization of the robustness $\rho(f,\mathbf{x}_{*})$ of $f$ at $\mathbf{x}_{*}$ corresponds to the notion in of finding the minimal $\|\mathbf{r}\|_{\infty}$ . We propose an exact algorithm for computing $\rho(f,\mathbf{x}_{*})$ as well as a tractable approximation. The algorithm in can also be used to approximate $\rho(f,\mathbf{x}_{*})$ ; we show experimentally that our algorithm is substantially more accurate than .

There has been a range of subsequent work studying robustness; devises an algorithm for finding purely synthetic adversarial examples (i.e., no initial image $\mathbf{x}_{*}$ ), searches for adversarial examples using random perturbations, showing that adversarial examples in fact exist in large regions of the pixel space, shows that even intermediate layers of neural nets are not robust to adversarial noise, and seeks to explain why neural nets may generalize well despite poor robustness properties.

A key shortcoming of these lines of work is that robustness is typically measured using the same algorithm used to find adversarial examples, in which case the resulting neural net may have overfit to adversarial examples generating using that algorithm. For example, shows improved accuracy to adversarial examples generated using their own signed gradient method, but do not consider whether robustness increases for adversarial examples generated using more precise approaches such as . Similarly, compares accuracy to adversarial examples generated using both itself and (but not ), and only considers accuracy on adversarial examples generated using their own approach on the baseline network. The aim of our paper is to provide metrics for evaluating robustness, and to demonstrate the importance of using such impartial measures to compare robustness.

Additionally, there has been work on designing neural network architectures and learning procedures that improve robustness to adversarial perturbations, though they do not obtain state-of-the-art accuracy on the unperturbed test sets. There has also been work using smoothness regularization related to to train neural nets, focusing on improving accuracy rather than robustness .

Robustness has also been studied in more general contexts; studies the connection between robustness and generalization, establishes theoretical lower bounds on the robustness of linear and quadratic classifiers, and seeks to improve robustness by promoting resiliance to deleting features during training. More broadly, robustness has been identified as a desirable property of classifiers beyond prediction accuracy. Traditional metrics such as (out-of-sample) accuracy, precision, and recall help users assess prediction accuracy of trained models; our work aims to develop analogous metrics for assessing robustness.

Robustness Metrics

Pointwise robustness. Intuitively, $f$ is robust at $\mathbf{x}_{*}\in\mathcal{X}$ if a “small” perturbation to $\mathbf{x}_{*}$ does not affect the assigned label. We are interested in perturbations sufficiently small that they do not affect human classification; an established condition is $\|\mathbf{x}-\mathbf{x}_{*}\|_{\infty}\leq\epsilon$ for some parameter $\epsilon$ . Formally, we say $f$ is $(\mathbf{x}_{*},\epsilon)$ -robust if for every $\mathbf{x}$ such that $\|\mathbf{x}-\mathbf{x}_{*}\|_{\infty}\leq\epsilon$ , $f(\mathbf{x})=f(\mathbf{x}_{*})$ . Finally, the pointwise robustness $\rho(f,\mathbf{x}_{*})$ of $f$ at $\mathbf{x}_{*}$ is the minimum $\epsilon$ for which $f$ fails to be $(\mathbf{x}_{*},\epsilon)$ -robust:

This definition formalizes the notion of robustness in .

Given a parameter $\epsilon$ , the adversarial frequency

measures how often $f$ fails to be $(\mathbf{x}_{*},\epsilon)$ -robust. In other words, if $f$ has high adversarial frequency, then it fails to be $(\mathbf{x}_{*},\epsilon)$ -robust for many inputs $\mathbf{x}_{*}$ .

Adversarial severity.

Given a parameter $\epsilon$ , the adversarial severity

measures the severity with which $f$ fails to be robust at $\mathbf{x}_{*}$ conditioned on $f$ not being $(\mathbf{x}_{*},\epsilon)$ -robust. We condition on pointwise robustness since once $f$ is $(\mathbf{x}_{*},\epsilon)$ -robust at $\mathbf{x}_{*}$ , then the degree to which $f$ is robust at $\mathbf{x}_{*}$ does not matter. Smaller $\mu(f,\epsilon)$ corresponds to worse adversarial severity, since $f$ is more susceptible to adversarial examples if the distances to the nearest adversarial example are small.

The frequency and severity capture different robustness behaviors. A neural net may have high adversarial frequency but low adversarial severity, indicating that most adversarial examples are about $\epsilon$ distance away from the original point $\mathbf{x}_{*}$ . Conversely, a neural net may have low adversarial frequency but high adversarial severity, indicating that it is typically robust, but occasionally severely fails to be robust. Frequency is typically the more important metric, since a neural net with low adversarial frequency is robust most of the time. Indeed, adversarial frequency corresponds to the accuracy on adversarial examples used to measure robustness in . Severity can be used to differentiate between neural nets with similar adversarial frequency.

Given a set of samples $X\subseteq\mathcal{X}$ drawn i.i.d. from $\mathcal{D}$ , we can estimate $\phi(f,\epsilon)$ and $\mu(f,\epsilon)$ using the following standard estimators, assuming we can compute $\rho$ :

An approximation $\hat{\rho}(f,\mathbf{x}_{*})\approx\rho(f,\mathbf{x}_{*})$ of $\rho$ , such as the one we describe in Section 4, can be used in place of $\rho$ . In practice, $X$ is taken to be the test set $X_{\text{test}}$ .

Computing Pointwise Robustness

2 Formulation as Optimization

We compute $\rho(f,\epsilon)$ by expressing (1) as constraints $\mathcal{C}$ , which consist of

Conjunctions $\mathcal{C}\equiv\mathcal{C}_{1}\wedge\mathcal{C}_{2}$ , where $\mathcal{C}_{1}$ and $\mathcal{C}_{2}$ are themselves constraints. Both constraints must be satisfied for the conjunction to be satisfied.

Disjunctions $\mathcal{C}\equiv\mathcal{C}_{1}\vee\mathcal{C}_{2}$ , where $\mathcal{C}_{1}$ and $\mathcal{C}_{2}$ are themselves constraints. One of the constraints must be satisfied for the disjunction to be satisfied.

The optimization problem is typically intractable; we describe a tractable approximation in §4.

3 Encoding a Neural Network

In this case, $\mathbf{x}^{(i)}=f^{(i)}(\mathbf{x}^{(i-1)})=W^{(i)}\mathbf{x}^{(i-1)}+\mathbf{b}^{(i)}$ , which we encode using the constraints $\mathcal{C}_{i}\equiv\bigwedge_{j=1}^{n_{i}}\left\{\mathbf{x}^{(i)}_{j}=W^{(i)}_{j}\mathbf{x}^{(i-1)}+\mathbf{b}^{(i)}_{j}\right\}$ , where $W^{(i)}_{j}$ is the $j$ -th row of $W^{(i)}$ .

ReLU layer.

In this case, $\mathbf{x}^{(i)}_{j}=\max~{}\{\mathbf{x}^{(i-1)}_{j},0\}$ (for each $1\leq j\leq n_{i}$ ), which we encode using the constraints $\mathcal{C}_{i}\equiv\bigwedge_{j=1}^{n_{i}}\mathcal{C}_{ij}$ , where $\mathcal{C}_{ij}=(\mathbf{x}^{(i-1)}_{j}{<}0\wedge\mathbf{x}^{(i)}_{j}{=}0)\vee(\mathbf{x}^{(i-1)}_{j}\geq 0\wedge\mathbf{x}^{(i)}_{j}{=}\mathbf{x}^{(i-1)}_{j})$ .

Approximate Computation of Pointwise Robustness

The objective is optimized over $\mathbf{x}\in\mathcal{Z}(\mathbf{x}_{*})$ , which approximates the optimum over $\mathbf{x}\in\mathcal{X}$ .

We construct $\mathcal{Z}(\mathbf{\mathbf{x}}_{*})$ as the feasible set of constraints $\mathcal{D}(\mathbf{x}_{*})$ ; i.e., $\mathcal{Z}(\mathbf{x}_{*})=\mathcal{F}(\mathcal{D}(\mathbf{x}_{*}))$ . We now describe how to construct $\mathcal{D}(\mathbf{x}_{*})$ .

Note that $\mathcal{F}(\mathbf{w}^{T}\mathbf{x}+b=0)$ and $\mathcal{F}(\mathbf{w}^{T}\mathbf{x}+b\geq 0)$ are convex sets. Furthermore, if $\mathcal{F}(\mathcal{C}_{1})$ and $\mathcal{F}(\mathcal{C}_{2})$ are convex, then so is their conjunction $\mathcal{F}(\mathcal{C}_{1}\wedge\mathcal{C}_{2})$ . However, their disjunction $\mathcal{F}(\mathcal{C}_{1}\vee\mathcal{C}_{2})$ may not be convex; for example, $\mathcal{F}((x\geq 0)\vee(y\geq 0))$ . The potential non-convexity of disjunctions makes (3) difficult to optimize.

We can eliminate disjunction operations by choosing one of the two disjuncts to hold. For example, note that for $\mathcal{C}_{1}\equiv\mathcal{C}_{2}\vee\mathcal{C}_{3}$ , we have both $\mathcal{F}(\mathcal{C}_{2})\subseteq\mathcal{F}(\mathcal{C}_{1})$ and $\mathcal{F}(\mathcal{C}_{3})\subseteq\mathcal{F}(\mathcal{C}_{1})$ . In other words, if we replace $\mathcal{C}_{1}$ with either $\mathcal{C}_{2}$ or $\mathcal{C}_{3}$ , the feasible set of the resulting constraints can only become smaller. Taking $\mathcal{D}(\mathbf{x}_{*})\equiv\mathcal{C}_{2}$ (resp., $\mathcal{D}(\mathbf{x}_{*})\equiv\mathcal{C}_{3}$ ) effectively replaces $\mathcal{C}_{1}$ with $\mathcal{C}_{2}$ (resp., $\mathcal{C}_{3}$ ).

For example, consider a rectified linear layer (as before, max pooling layers are similar). The original constraint added for unit $j$ of rectified linear layer $f^{(i)}$ is

To restrict this constraint, we evaluate the neural network on the seed input $\mathbf{x}_{*}$ and look at the input to $f^{(i)}$ , which equals $\mathbf{x}^{(i-1)}_{*}=f^{(i-1)}(...(f^{(1)}(\mathbf{x}_{*}))...)$ . Then, for each $1\leq j\leq n_{i}$ :

Iterative constraint solving.

We implement an optimization for solving LPs by lazily adding constraints as necessary. Given all constraints $\mathcal{C}$ , we start off solving the LP with the subset of equality constraints $\hat{\mathcal{C}}\subseteq\mathcal{C}$ , which yields a (possibly infeasible) solution $\mathbf{z}$ . If $\mathbf{z}$ is feasible, then $\mathbf{z}$ is also an optimal solution to the original LP; otherwise, we add to $\hat{\mathcal{C}}$ the constraints in $\mathcal{C}$ that are not satisfied by $\mathbf{z}$ and repeat the process. This process always yields the correct solution, since in the worst case $\hat{\mathcal{C}}$ becomes equal to $\mathcal{C}$ . In practice, this optimization is an order of magnitude faster than directly solving the LP with constraints $\mathcal{C}$ .

Single target label.

Approximate robustness statistics.

We can use $\hat{\rho}$ in our statistics $\hat{\phi}$ and $\hat{\mu}$ defined in §2. Because $\hat{\rho}$ is an overapproximation of $\rho$ (i.e., $\hat{\rho}(f,\mathbf{x}_{*})\geq\rho(f,\mathbf{x}_{*})$ ), the estimates $\hat{\phi}$ and $\hat{\mu}$ may not be unbiased (in particular, $\hat{\phi}(f,\epsilon)\leq\phi(f,\epsilon)$ ). In §6, we show empirically that our algorithm produces substantially less biased estimates than existing algorithms for finding adversarial examples.

Improving Neural Net Robustness

We can use our algorithm for estimating $\hat{\rho}(f,\mathbf{x}_{*})$ to compute adversarial examples. Given $\mathbf{x}_{*}$ , the value of $\mathbf{x}$ computed by the optimization procedure used to solve (5) is an adversarial example for $\mathbf{x}_{*}$ with $\|\mathbf{x}-\mathbf{x}_{*}\|_{\infty}=\hat{\rho}(f,\mathbf{x}_{*})$ .

Finetuning.

We use fine-tuning to reduce a neural net’s susceptability to adversarial examples. First, we use an algorithm $\mathcal{A}$ to compute adversarial examples for each $\mathbf{x}_{*}\in X_{\text{train}}$ and add them to the training set. Then, we continue training the network on a the augmented training set at a reduced training rate. We can repeat this process multiple rounds (denoted $T$ ); at each round, we only consider $\mathbf{x}_{*}$ in the original training set (rather than the augmented training set).

Rounding errors.

Experiments

We find adversarial examples for the neural net LeNet (modified to use ReLUs instead of sigmoids) trained to classify MNIST , and for the network-in-network (NiN) neural net trained to classify CIFAR-10 . Both neural nets are trained using Caffe . For MNIST, Figure 2 (b) shows an adversarial example (labeled 1) we find for the image in Figure 2 (a) labeled 3, and Figure 2 (c) shows the corresponding adversarial perturbation scaled so the difference is visible (it has $L_{\infty}$ norm $17$ ). For CIFAR-10, Figure 2 (e) shows an adversarial example labeled “truck” for the image in Figure 2 (d) labeled “automobile”, and Figure 2 (f) shows the corresponding scaled adversarial perturbation (which has $L_{\infty}$ norm $3$ ).

2 Comparison to Other Algorithms on MNIST

We performed a similar comparison to the signed gradient algorithm proposed by (with the signed gradient multiplied by $\epsilon=20$ pixels). For LeNet, this algorithm found only one adversarial example on the MNIST test set (out of 10,000) and four adversarial examples on the MNIST training set (out of 60,000), so we omit results Futhermore, the signed gradient algorithm cannot be used to estimate adversarial severity since all the adversarial examples it finds have $L_{\infty}$ norm $\epsilon$ ..

In Figure 3, we plot the number of test points $\mathbf{x}_{*}$ for which $\hat{\rho}(f,\mathbf{x}_{*})\leq\epsilon$ , as a function of $\epsilon$ , where $\hat{\rho}(f,\mathbf{x}_{*})$ is estimated using (a) the baseline and (b) our algorithm. These plots compare the robustness of each neural network as a function of $\epsilon$ . In Table 1, we show results evaluating the robustness of each neural net, including the adversarial frequency and the adversarial severity. The running time of our algorithm and the baseline algorithm are very similar; in both cases, computing $\hat{\rho}(f,\mathbf{x}_{*})$ for a single input $\mathbf{x}_{*}$ takes about 1.5 seconds. For comparison, without our iterative constraint solving optimization, our algorithm took more than two minutes to run.

Discussion.

For every neural net, our algorithm produces substantially higher estimates of the adversarial frequency. In other words, our algorithm estimates $\hat{\rho}(f,\mathbf{x}_{*})$ with substantially better accuracy compared to the baseline.

According to the baseline metrics shown in Figure 3 (a), the baseline neural net (red) is similarly robust to our neural net (blue), and both are more robust than the original LeNet (black). Our neural net is actually more robust than the baseline neural net for smaller values of $\epsilon$ , whereas the baseline neural net eventually becomes slightly more robust (i.e., where the red line dips below the blue line). This behavior is captured by our robustness statistics—the baseline neural net has lower adversarial frequency (so it has fewer adversarial examples with $\hat{\rho}(f,\mathbf{x}_{*})\leq\epsilon$ ) but also has worse adversarial severity (since its adversarial examples are on average closer to the original points $\mathbf{x}_{*}$ ).

However, according to our metrics shown in Figure 3 (b), our neural net is substantially more robust than the baseline neural net. Again, this is reflected by our statistics—our neural net has substantially lower adversarial frequency compared to the baseline neural net, while maintaining similar adversarial severity. Taken together, our results suggest that the baseline neural net is overfitting to the adversarial examples found by the baseline algorithm. In particular, the baseline neural net does not learn the adversarial examples found by our algorithm. On the other hand, our neural net learns both the adversarial examples found by our algorithm and those found by the baseline algorithm.

3 Scaling to CIFAR-10

We also implemented our approach for the for the CIFAR-10 network-in-network (NiN) neural net , which obtains 91.31% test set accuracy. Computing $\hat{\rho}(f,\mathbf{x}_{*})$ for a single input on NiN takes about 10-15 seconds on an 8-core CPU. Unlike LeNet, NiN suffers severely from adversarial examples—we measure a 61.5% adversarial frequency and an adversarial severity of 2.82 pixels. Our neural net (NiN fine-tuned using our algorithm and $T=1$ ) has test set accuracy 90.35%, which is similar to the test set accuracy of the original NiN. As can be seen in Figure 3 (c), our neural net improves slightly in terms of robustness, especially for smaller $\epsilon$ . As before, these improvements are reflected in our metrics—the adversarial frequency of our neural net drops slightly to 59.6%, and the adversarial severity improves to 3.88. Nevertheless, unlike LeNet, our fine-tuned version of NiN remains very prone to adversarial examples. In this case, we believe that new techniques are required to significantly improve robustness.

Conclusion

We have shown how to formulate, efficiently estimate, and improve the robustness of neural nets using an encoding of the robustness property as a constraint system. Future work includes devising better approaches to improving robustness on large neural nets such as NiN and studying properties beyond robustness.