Towards Fast Computation of Certified Robustness for ReLU Networks

Tsui-Wei Weng, Huan Zhang, Hongge Chen, Zhao Song, Cho-Jui Hsieh, Duane Boning, Inderjit S. Dhillon, Luca Daniel

Introduction

Since the discovery of adversarial examples in deep neural network (DNN) image classifiers (Szegedy et al., 2013), researchers have successfully found adversarial examples in many machine learning tasks applied to different areas, including object detection (Xie et al., 2017), image captioning (Chen et al., 2018a), speech recognition (Cisse et al., 2017), malware detection (Wang et al., 2017) and reading comprehension (Jia & Liang, 2017). Moreover, black-box attacks have also been shown to be possible, where an attacker can find adversarial examples without knowing the architecture and parameters of the DNN (Chen et al., 2017; Papernot et al., 2017; Liu et al., 2017b).

The existence of adversarial examples poses a huge threat to the application of DNNs in mission-critical tasks including security cameras, self-driving cars and aircraft control systems. Many researchers have thus proposed defensive or detection methods in order to increase the robustness of DNNs. Notable examples are defensive distillation (Papernot et al., 2016), adversarial retraining/training (Kurakin et al., 2017; Madry et al., 2018) and model ensembles (Tramèr et al., 2018; Liu et al., 2017a). Despite many published contributions that aim at increasing the robustness of DNNs, theoretical results are rarely given and there is no guarantee that the proposed defensive methods can reliably improve the robustness. Indeed, many of these defensive mechanism have been shown to be ineffective when more advanced attacks are used (Carlini & Wagner, 2017c, a, b; He et al., 2017).

On the other hand, a lower bound βL\beta_{L} of radius r0r_{0} can be given, which guarantees that no examples within a ball of radius βL\beta_{L} can ever change the network classification outcome. (Hein & Andriushchenko, 2017) is a pioneering work on giving such a lower bound for neural networks that are continuously differentiable, although only a 2-layer MLP network with differentiable activations is investigated. (Weng et al., 2018) has extended theoretical result to ReLU activation functions and proposed a robustness score, CLEVER, based on extreme value theory. Their approach is feasible for large state-of-the-art DNNs but CLEVER is an estimate of βL\beta_{L} without certificates. Ideally, we would like to obtain a certified (which guarantees that βLr0\beta_{L}\leq r_{0}) and non-trivial (a trivial βL\beta_{L} is 0) lower bound βL\beta_{L} that is reasonably close to r0r_{0} within reasonable amount of computational time.

In this paper, we develop two fast algorithms for obtaining a tight and certified lower bound βL\beta_{L} on ReLU networks. In addition, we also provide a complementary theoretical result to (Katz et al., 2017; Sinha et al., 2018) by further showing there does not even exist a polynomial time algorithm that can approximately find the minimum adversarial distortion with a 0.99lnn0.99\ln n approximation ratio. Our contributions are:

We fully exploit the ReLU networks to give two computationally efficient methods of computing tighter and guaranteed robustness lower bounds via (1) linear approximation on the ReLU units (see Sec 3.3, Algorithm 1, Fast-Lin) and (2) bounding network local Lipschitz constant (see Sec 3.4, Algorithm LABEL:alg:fast-lip, Fast-Lip). Unlike the per-layer operator-norm-based lower bounds which are often very loose (close to 0, as verified in our experiments) for deep networks, our bounds are much closer to the upper bound given by the best adversarial examples, and thus can be used to evaluate the robustness of DNNs with theoretical guarantee.

We show that our proposed method is at least four orders of magnitude faster than finding the exact minimum distortion (with Reluplex), and also around two orders of magnitude (or more) faster than linear programming (LP) based methods. We can compute a reasonable robustness lower bound within a minute for a ReLU network with up to 7 layers or over ten thousands neurons, which is so far the best available result in the literature to our best knowledge.

Background and related work

2 Computing lower bounds of minimum distortion

3 Hardness and approximation algorithms

NPP\mathsf{NP}\neq\mathsf{P} is the most important and popular assumption in computational complexity in the last several decades. It can be used to show that the decision of the exact case of a problem is hard. However, in several cases, solving one problem approximately is much easier than solving it exactly. For example, there is no polynomial time algorithm to solve the MAX\mathsf{MAX}-CUT\mathsf{CUT} problem, but there is a simple 0.50.5-approximation polynomial time algorithm. Previous works (Katz et al., 2017; Sinha et al., 2018) show that there is no polynomial time algorithm to find the minimum adversarial distortion r0r_{0} exactly. A natural question to ask is: does there exist a polynomial time algorithm to solve the robustness problem approximately? In other words, can we give a lower bound of r0r_{0} with a guaranteed approximation ratio?

From another perspective, NPP\mathsf{NP}\neq\mathsf{P} only rules out the polynomial running time. Some problems might not even have a sub-exponential time algorithm. To rule out that, the most well-known assumption used is the “Exponential Time Hypothesis” (Impagliazzo et al., 1998). The hypothesis states that 3SAT\mathsf{3SAT} cannot be solved in sub-exponential time in the worst case. Another example is that while tensor rank calculation is NP-hard (Håstad, 1990), a recent work (Song et al., 2017b) proved that there is no 2o(n1o(1))2^{o(n^{1-o(1)})} time algorithm to give a constant approximation of the rank of the tensor. There are also some stronger versions of the hypothesis than ETH, e.g., Strong ETH (Impagliazzo & Paturi, 2001), Gap ETH (Dinur, 2016; Manurangsi & Raghavendra, 2017), and average case ETH (Feige, 2002; Razenshteyn et al., 2016).

Robustness guarantees for ReLU networks

1 Finding the minimum distortion with a 0.99lnnfragments0.99n0.99\ln n approximation ratio is hard

2 ReLU Networks and Their Activation Patterns

3 Approach 1 (Fast-Lin): Certified lower bounds via linear approximations

In this section, we propose a methodology to directly derive upper bounds and lower bounds of the output of an mm-layer feed-forward ReLU network. The central idea is to derive an explicit upper/lower bound based on the linear approximations for the neurons in category (iii) and the signs of the weights associated with the activations.

We start with a 2-layers network and then extend it to mm layers. The jj-th output of a 2-layer network is:

For neurons rI1+r\in\mathcal{I}^{+}_{1}, we have σ(Wr,:(1)x+br(1))=Wr,:(1)x+br(1)\sigma(\mathbf{W}^{(1)}_{r,:}\bm{x}+\bm{b}^{(1)}_{r})=\mathbf{W}^{(1)}_{r,:}\bm{x}+\bm{b}^{(1)}_{r}; for neurons rI1r\in\mathcal{I}^{-}_{1}, we have σ(Wr,:(1)x+br(1))=0.\sigma(\mathbf{W}^{(1)}_{r,:}\bm{x}+\bm{b}^{(1)}_{r})=0. For the neurons in category (iii), we propose to use the following linear upper bound and a linear lower bound to replace the ReLU activation σ(y)\sigma(y):

Let dr(1):=ur(1)ur(1)lr(1)\bm{d}^{(1)}_{r}:=\frac{\bm{u}^{(1)}_{r}}{\bm{u}^{(1)}_{r}-\bm{l}^{(1)}_{r}}, we have

To obtain an upper bound and lower bound of fj(x)f_{j}(\bm{x}) with (1), set dr(1)=1\bm{d}^{(1)}_{r}=1 for rI1+r\in\mathcal{I}^{+}_{1}, and we have

where fjL(x)fj(x)fjU(x)f_{j}^{L}(\bm{x})\leq f_{j}(\bm{x})\leq f_{j}^{U}(\bm{x}). To obtain fjU(x)f_{j}^{U}(\bm{x}), we take the upper bound of σ(Wr,:(1)x+br(1))\sigma(\mathbf{W}^{(1)}_{r,:}\bm{x}+\bm{b}^{(1)}_{r}) for rI1,Wj,r(2)>0r\in\mathcal{I}_{1},\mathbf{W}^{(2)}_{j,r}>0 and its lower bound for rI1,Wj,r(2)0r\in\mathcal{I}_{1},\mathbf{W}^{(2)}_{j,r}\leq 0. Both cases share a common term of dr(1)(Wr,:(1)x+br(1))\bm{d}^{(1)}_{r}(\mathbf{W}^{(1)}_{r,:}\bm{x}+\bm{b}^{(1)}_{r}), which is combined into the first summation term in (3) with rI1r\in\mathcal{I}_{1}. Similarly we get the bound for fjL(x)f_{j}^{L}(\bm{x}).

Now, we are ready to state our main theorem,

We first formally define the global upper bound γjU\gamma_{j}^{U} and lower bound γjL\gamma_{j}^{L} of fj(x)f_{j}(\bm{x}), and then obtain Corollary 3.7.

where 1/p+1/q=11/p+1/q=1 and νj,μj+,μj\nu_{j},\mu_{j}^{+},\mu_{j}^{-} are defined as

3.2 Computing pre-ReLU activation bounds

Theorem 3.5 and Corollary 3.7 give us a global lower bound γjL\gamma_{j}^{L} and upper bound γjU\gamma_{j}^{U} of the jj-th neuron at the mm-th layer if we know all the pre-ReLU activation bounds l(k)\bm{l}^{(k)} and u(k)\bm{u}^{(k)}, from layer 11 to m1m-1, as the construction of D(k)\mathbf{D}^{(k)}, H(k)\mathbf{H}^{(k)} and T(k)\mathbf{T}^{(k)} requires l(k)\bm{l}^{(k)} and u(k)\bm{u}^{(k)} (see Definition 3.3). Here, we show how this can be done easily and layer-by-layer. We start from m=1m=1 where A(0)=W(1),fU(x)=fL(x)=A(0)x+b(1)\mathbf{A}^{(0)}=\mathbf{W}^{(1)},f^{U}(\bm{x})=f^{L}(\bm{x})=\mathbf{A}^{(0)}\bm{x}+\bm{b}^{(1)}. Then, we can apply Corollary 3.7 to get the output bounds of each neuron and set them as l(1)\bm{l}^{(1)} and u(1)\bm{u}^{(1)}. Then, we can proceed to m=2m=2 with l(1)\bm{l}^{(1)} and u(1)\bm{u}^{(1)} and compute the output bounds of second layer by Corollary 3.7 and set them as l(2)\bm{l}^{(2)} and u(2)\bm{u}^{(2)}. Repeating this procedure for all m1m-1 layers, we will get all the l(k)\bm{l}^{(k)} and u(k)\bm{u}^{(k)} needed to compute the output range of the mm-th layer.

Note that when computing l(k)\bm{l}^{(k)} and u(k)\bm{u}^{(k)}, the constructed W(k)D(k1)\mathbf{W}^{(k)}\mathbf{D}^{(k-1)} can be saved and reused for bounding the next layer, which facilitates efficient implementations. Moreover, the time complexity of computing the output bounds of an mm-layer ReLU network with Theorem 3.5 and Corollary 3.7 is polynomial time in contrast to the approaches in (Katz et al., 2017) and (Lomuscio & Maganti, 2017) where SMT solvers and MIO solvers have exponential time complexity. The major computation cost is to form A(0)\mathbf{A}^{(0)} for the mm-th layer, which involves multiplications of layer weights in a similar cost of forward propagation. See the “ComputeTwoSideBounds” procedure in Algorithm 1 in Appendix D.

3.3 Deriving maximum certified lower bounds of minimum adversarial distortion

Suppose cc is the predicted class of the input data point x0\bm{x_{0}} and the class is jj. With Theorem 3.5, the maximum possible lower bound for the targeted attacks ϵ~j\widetilde{\epsilon}_{j} and un-targeted attacks ϵ~\widetilde{\epsilon} are

Though it is hard to get analytic forms of γcL(ϵ)\gamma_{c}^{L}(\epsilon) and γjU(ϵ)\gamma_{j}^{U}(\epsilon) in terms of ϵ\epsilon, fortunately, we can still obtain ϵ~j\widetilde{\epsilon}_{j} via a binary search. This is because Corollary 3.7 allows us to efficiently compute the numerical values of γcL(ϵ)\gamma_{c}^{L}(\epsilon) and γjU(ϵ)\gamma_{j}^{U}(\epsilon) given ϵ\epsilon. It is worth noting that we can further improve the bound by considering g(x):=fc(x)fj(x)g(\bm{x}):=f_{c}(\bm{x})-f_{j}(\bm{x}) at the last layer and apply the same procedure to compute the lower bound of g(x)g(\bm{x}) (denoted as γ~L\widetilde{\gamma}^{L}); this can be done easily by redefining the last layer’s weights to be a row vector wˉWc,:(m)Wj,:(m)\bm{\bar{w}}\coloneqq\mathbf{W}^{(m)}_{c,:}-\mathbf{W}^{(m)}_{j,:}. The corresponding maximum possible lower bound for the targeted attacks is ϵ~j=maxϵ  s.t.  γ~L(ϵ)>0\widetilde{\epsilon}_{j}=\max\epsilon\;\text{s.t.}\;\widetilde{\gamma}^{L}(\epsilon)>0. We list our complete algorithm, Fast-Lin, in Appendix D.

3.4 Discussions

We have shown how to derive explicit output bounds of ReLU network (Theorem 3.5) with the proposed linear approximations and obtain analytical certified lower bounds (Corollary 3.7), which is the key of our proposed algorithm Fast-Lin. (Wong & Kolter, 2018) presents a similar algorithmic result on computing certified bounds, but our framework and theirs are entirely different – we use direct computation of layer-wise linear upper/lower bounds in Sec 3.3 with binary search on ϵ\epsilon, while their results is achieved via the lens of dual LP formulation with Newton’s method. Interestingly, when we choose a special set of lower and upper bounds as in (2) and they choose a special dual LP variable in their equation (8), the two different frameworks coincidentally produce the same procedure for computing layer-wise bounds (the “ComputeTwoSideBounds” procedure in Fast-Lin and Algorithm 1 in (Wong & Kolter, 2018)). However, our choice of bounds (2) is due to computation efficiency, while (Wong & Kolter, 2018) gives a quite different justification. We encourage the readers to read Appendix A.3 in their paper on the justifications for this specific selection of dual variables and understand this robustness verification problem from different perspectives.

4 Approach 2 (Fast-Lip): Certified lower bounds via bounding the local Lipschitz constant

(Weng et al., 2018) shows a non-trivial lower bound of minimum adversarial distortion for an input example x0\bm{x_{0}} in targeted attacks is min(g(x0)/Lq,x0j,ϵ)\min\left(g(\bm{x_{0}})/L_{q,x_{0}}^{j},\epsilon\right), where g(x)=fc(x)fj(x),Lq,x0jg(\bm{x})=f_{c}(\bm{x})-f_{j}(\bm{x}),\,L_{q,x_{0}}^{j} is the local Lipschitz constant of g(x)g(\bm{x}) in Bp(x0,ϵ)B_{p}(\bm{x_{0}},\epsilon), j\,j is the target class, cc is the original class, and 1/p+1/q=11/p+1/q=1. For un-targeted attacks, the lower bound can be presented in a similar form. (Weng et al., 2018) uses sampling techniques to estimate the local Lipschitz constant and compute an estimated lower bound without certificates.

Here, we propose a new algorithm to compute a certified lower bound of the minimum adversarial distortion by upper bounding the local Lipschitz constant. To start with, let us rewrite the relations of subsequent layers in the following form: ϕk(x)=Λ(k)(W(k)ϕk1(x)+b(k))\phi_{k}(\bm{x})=\bm{\Lambda}^{(k)}(\mathbf{W}^{(k)}\phi_{k-1}(\bm{x})+\bm{b}^{(k)}), where σ()\sigma(\cdot) is replaced by the diagonal activation pattern matrix Λ(k)\bm{\Lambda}^{(k)} that encodes the status of neurons rr in kk-th layer:

and Λ(m)=Inm\bm{\Lambda}^{(m)}=\bm{I}_{n_{m}}. With a slight abuse of notation, let us define Λa(k)\bm{\Lambda}^{(k)}_{a} as a diagonal activation matrix for neurons in the kk-th layer who are always activated, i.e. the rr-th diagonal is 11 if rIk+r\in\mathcal{I}^{+}_{k} and otherwise, and Λu(k)\bm{\Lambda}^{(k)}_{u} as the diagonal activation matrix for kk-th layer neurons whose status are uncertain, i.e. the rr-th diagonal is 11 or (to be determined) if rIkr\in\mathcal{I}_{k}, and otherwise. Therefore, we have Λ(k)=Λa(k)+Λu(k)\bm{\Lambda}^{(k)}=\bm{\Lambda}^{(k)}_{a}+\bm{\Lambda}^{(k)}_{u}. We can obtain Λ(k)\bm{\Lambda}^{(k)} for xBp(x0,ϵ)\bm{x}\in B_{p}(\bm{x_{0}},\epsilon) by applying Algorithm 1 and check the lower and upper bounds for each neuron rr in layer kk.

The central idea is to compute upper bounds of Lq,x0jL_{q,x_{0}}^{j} by exploiting the three categories of activation patterns in ReLU networks when the allowable inputs are in Bp(x0,ϵ)B_{p}(\bm{x_{0}},\epsilon). Lq,x0jL_{q,\bm{x}_{0}}^{j} can be defined as the maximum norm of directional derivative as shown in (Weng et al., 2018). For the ReLU network, the maximum directional derivative norm can be found by examining all the possible activation patterns and take the one (the worst-case) that results in the largest gradient norm. However, as all possible activation patterns grow exponentially with the number of the neurons, it is impossible to examine all of them in brute-force. Fortunately, computing the worst-case pattern on each element of g(x)\nabla g(\bm{x}) (i.e. [g(x)]k,k[n0][\nabla g(x)]_{k},\,k\in[n_{0}]) is much easier and more efficient. In addition, we apply a simple fact that the maximum norm of a vector (which is g(x),xBp(x0,ϵ)\nabla g(\bm{x}),\bm{x}\in B_{p}(\bm{x_{0}},\epsilon) in our case) is upper bounded by the norm of the maximum value for each components. By computing the worst-case pattern on [g(x)]k[\nabla g(\bm{x})]_{k} and its norm, we can obtain an upper bound of the local Lipschitz constant, which results in a certified lower bound of minimum distortion.

Below, we first show how to derive an upper bound of the Lipschitz constant by computing the worst-case activation pattern on [g(x)]k[\nabla g(\bm{x})]_{k} for 22 layers. Next, we will show how to apply it repeatedly for a general mm-layer network, and the algorithm is named Fast-Lip. Note that for simplicity, we will use [fj(x)]k[\nabla f_{j}(\bm{x})]_{k} to illustrate our derivation; however, it is easy to extend to [g(x)]k[\nabla g(\bm{x})]_{k} as g(x)=fc(x)fj(x)g(\bm{x})=f_{c}(\bm{x})-f_{j}(\bm{x}) by simply replacing last layer weight vector by Wc,:(m)Wj,:(m)\mathbf{W}^{(m)}_{c,:}-\mathbf{W}^{(m)}_{j,:}.

The first term Wj,:(2)Λa(1)W:,k(1)\mathbf{W}^{(2)}_{j,:}\bm{\Lambda}^{(1)}_{a}\mathbf{W}^{(1)}_{:,k} is a constant and all we need to bound is the second term Wj,:(2)Λu(1)W:,k(1)\mathbf{W}^{(2)}_{j,:}\bm{\Lambda}^{(1)}_{u}\mathbf{W}^{(1)}_{:,k}. Let Cj,k(1)=Wj,:(2)Λa(1)W:,k(1),\mathbf{C}^{(1)}_{j,k}=\mathbf{W}^{(2)}_{j,:}\bm{\Lambda}^{(1)}_{a}\mathbf{W}^{(1)}_{:,k},\, Lj,k(1)\mathbf{L}^{(1)}_{j,k} and Uj,k(1)\mathbf{U}^{(1)}_{j,k} be the lower and upper bounds of the second term, we have

For 3 or more layers, we can apply the above 2-layer results recursively, layer-by-layer. For example, for a 3-layer ReLU network,

if we let Y:,k(1)=W(2)Λ(1)W:,k(1)\mathbf{Y}^{(1)}_{:,k}=\mathbf{W}^{(2)}\bm{\Lambda}^{(1)}\mathbf{W}^{(1)}_{:,k}, then [fj(x)]k[\nabla f_{j}(\bm{x})]_{k} is reduced to the following form that is similar to 2 layers:

To obtain the bound in (9), we first need to obtain a lower bound and upper bound of Y:,k(1)\mathbf{Y}^{(1)}_{:,k}, where we can directly apply the 2-layer results to get an upper and an lower bound for each component ii as Ci,k(1)+Li,k(1)Yi,k(1)Ci,k(1)+Ui,k(1)\mathbf{C}^{(1)}_{i,k}+\mathbf{L}^{(1)}_{i,k}\leq\mathbf{Y}^{(1)}_{i,k}\leq\mathbf{C}^{(1)}_{i,k}+\mathbf{U}^{(1)}_{i,k}. Next, the first term Wj,:(3)Λa(2)Y:,k(1)\mathbf{W}^{(3)}_{j,:}\bm{\Lambda}^{(2)}_{a}\mathbf{Y}^{(1)}_{:,k} in (10) can be lower bounded and upper bounded respectively by

whereas the second term Wj,:(3)Λu(2)Y:,k(1)\mathbf{W}^{(3)}_{j,:}\bm{\Lambda}^{(2)}_{u}\mathbf{Y}^{(1)}_{:,k} in (10) is bounded by iPWj,i(3)(Ci,k(1)+Li,k(1))+iQWj,i(3)(Ci,k(1)+Ui,k(1))\sum_{i\in\mathcal{P}}\mathbf{W}^{(3)}_{j,i}(\mathbf{C}^{(1)}_{i,k}+\mathbf{L}^{(1)}_{i,k})+\sum_{i\in\mathcal{Q}}\mathbf{W}^{(3)}_{j,i}(\mathbf{C}^{(1)}_{i,k}+\mathbf{U}^{(1)}_{i,k}) with lower/upper bound index sets PL,QL\mathcal{P}_{L},\mathcal{Q}_{L} and PU,QU\mathcal{P}_{U},\mathcal{Q}_{U}:

Let Cj,k(2)=iI2+Wj,i(3)Ci,k(1)\mathbf{C}^{(2)}_{j,k}=\sum_{i\in\mathcal{I}^{+}_{2}}\mathbf{W}^{(3)}_{j,i}\mathbf{C}^{(1)}_{i,k}, Uj,k(2)+Cj,k(2)\mathbf{U}^{(2)}_{j,k}+\mathbf{C}^{(2)}_{j,k} and Lj,k(2)+Cj,k(2)\mathbf{L}^{(2)}_{j,k}+\mathbf{C}^{(2)}_{j,k} be the upper and lower bound of [fj(x)]k[\nabla f_{j}(\bm{x})]_{k}, we have

Thus, this technique can be used iteratively to solve maxxBp(x0,ϵ)[fj(x)]k\max_{\bm{x}\in B_{p}(\bm{x_{0}},\epsilon)}|[\nabla f_{j}(\bm{x})]_{k}| for a general mm-layer network, and we can easily bound any qq norm of fj(x)\nabla f_{j}(\bm{x}) by the qq norm of the vector of maximum values. For example,

We list our full procedure, Fast-Lip, in Appendix D.

Note that in the 3-layer example, we compute the bounds from right to left, i.e. we first get the bound of W(2)Λ(1)W:,k(1)\mathbf{W}^{(2)}\bm{\Lambda}^{(1)}\mathbf{W}^{(1)}_{:,k}, and then bound Wj,:(3)Λ(2)Y:,k(1)\mathbf{W}^{(3)}_{j,:}\bm{\Lambda}^{(2)}\mathbf{Y}^{(1)}_{:,k}. Similarly, we can compute the bounds from left to right – get the bound of Wj,:(3)Λ(2)W(2)\mathbf{W}^{(3)}_{j,:}\bm{\Lambda}^{(2)}\mathbf{W}^{(2)} first, and then bound Yj,:(2)Λ(1)W:,k(1)\mathbf{Y}^{(2)}_{j,:}\bm{\Lambda}^{(1)}\mathbf{W}^{(1)}_{:,k}, where Yj,:(2)=Wj,:(3)Λ(2)W(2)\mathbf{Y}^{(2)}_{j,:}=\mathbf{W}^{(3)}_{j,:}\bm{\Lambda}^{(2)}\mathbf{W}^{(2)}. Since the dimension of the output layer (nmn_{m}) is typically far less than the dimension of the input vector (n0n_{0}), computing the bounds from left to right is more efficient as the matrix Y\mathbf{Y} has a smaller dimension of nm×nkn_{m}\times n_{k} rather than nk×n0n_{k}\times n_{0}.

Experiments

In this section, we perform extensive experiments to evaluate the performance of our proposed two lower-bound based robustness certificates on networks with different sizes and with different defending techniques during training process. Specifically, we compare our proposed boundshttps://github.com/huanzhang12/CertifiedReLURobustness (Fast-Lin, Fast-Lip) with Linear Programming (LP) based methods (LP, LP-Full), formal verification methods (Reluplex), lower bound by global Lipschitz constant (Op-norm), estimated lower bounds (CLEVER) and attack algorithms (Attacks) for toy networks (2-3 layers with 20 neurons in each layer) and large networks (2-7 layers with 1024 or 2048 neurons in each layer) in Table 1. The evaluation on the effects of defending techniques is presented in Table 2. All bound numbers are the average of 100 random test images with random attack targets, and running time (per image) for all methods is measured on a single CPU core. We include detailed setup of experiments, descriptions of each method, additional experiments and discussions in Appendix LABEL:app:exp (See Tables LABEL:tb:smallnetwork and LABEL:tb:largenetwork_app). The results suggest that our proposed robustness certificates are of high qualities and are computationally efficient even in large networks up to 7 layers or more than 10,000 neurons. In particular, we show that:

Our certified lower bounds (Fast-Lin, Fast-Lip) are close to (gap is only 2-3X) the exact minimum distortion computed by Reluplex for small networks (Reluplex is only feasible for networks with less 100 neurons for MNIST), but our algorithm is more than 10,000 times faster than Reluplex. See Table 1(a) and Table LABEL:tb:smallnetwork.

Our certified lower bounds (Fast-Lin, Fast-Lip) give similar quality (the gap is within 35%, and usually around 10%; sometimes our bounds are even better) compared with the LP-based methods (LP, LP-Full); however, our algorithm is 33 - 14,000 times faster. The LP-based methods are infeasible for networks with more than 4,000 neurons. See Table 1(b) and Table LABEL:tb:largenetwork_app.

When the network goes larger and deeper, our proposed methods can still give non-trivial lower bounds comparing to the upper bounds founded by attack algorithms on large networks. See Table 1(b) and Table LABEL:tb:largenetwork_app.

For defended networks, especially for adversarial training (Madry et al., 2018), our methods give significantly larger bounds, validating the effectiveness of this defending method. Our algorithms can thus be used for evaluating defending techniques. See Table 2.

Conclusions

In this paper we have considered the problem of verifying the robustness property of ReLU networks. By exploiting the special properties of ReLU networks, we have here presented two computational efficient methods Fast-Lin and Fast-Lip for this problem. Our algorithms are two orders of magnitude (or more) faster than LP-based methods, while obtaining solutions with similar quality; meanwhile, our bounds qualities are much better than the previously proposed operator-norm based methods. Additionally, our methods are efficient and easy to implement: we compute the bounds layer-by-layer, and the computation cost for each layer is similar to the cost of matrix products in forward propagation; moreover, we do not need to solve any integer programming, linear programming problems or their duals. Future work could extend our algorithm to handle the structure of convolutional layers and apply our algorithm to evaluate the robustness property of large DNNs such as ResNet on the ImageNet dataset.

Acknowledgment

The authors sincerely thank Aviad Rubinstein for the suggestion of using set-cover to prove hardness. The authors sincerely thank Dana Moshkovitz for pointing out some references about the hardness result of set-cover. The authors would also like to thank Mika Göös, Rasmus Kyng, Zico Kolter, Jelani Nelson, Eric Price, Milan Rubinstein, Jacob Steinhardt, Zhengyu Wang, Eric Wong and David P. Woodruff for useful discussions. Luca Daniel and Tsui-Wei Weng acknowledge the partial support of MIT-Skoltech program and MIT-IBM Watson AI Lab. Huan Zhang and Cho-Jui Hsieh acknowledge the support of NSF via IIS-1719097 and the computing resources provided by Google Cloud and NVIDIA.

References

Appendix A Hardness

In this section we show that finding the minimum adversarial distortion with a certified approximation ratio is hard. We first introduce some basic definitions and theorems in Section A.1. We provide some backgrounds about in-approximability reduction in Section A.2. Section A.3 gives a warmup proof for boolean case and then Section A.4 provides the proof of our main hardness result (for network with real inputs).

We provide some basic definitions and theorems in this section. First, we define the classic 3SAT\mathsf{3SAT} problem.

Given nn variables and mm clauses in a conjunctive normal form CNF\mathsf{CNF} formula with the size of each clause at most 33, the goal is to decide whether there exists an assignment to the nn Boolean variables to make the CNF\mathsf{CNF} formula to be satisfied.

For the 3SAT\mathsf{3SAT} problem in Definition A.1, we introduce the Exponential Time Hypothesis (ETH), which is a common concept in complexity field.

There is a δ>0\delta>0 such that the 3SAT\mathsf{3SAT} problem defined in Definition A.1 cannot be solved in O(2δn)O(2^{\delta n}) time.

ETH had been used in many different problems, e.g. clustering (Ailon et al., 2018; Cohen-Addad et al., 2018), low-rank approximation (Razenshteyn et al., 2016; Song et al., 2017a, b, 2018). For more details, we refer the readers to a survey (Lokshtanov et al., 2013).

Then we define another classical question in complexity theory, the SET\mathsf{SET}-COVER\mathsf{COVER} problem, which we will use in our proof. The exact SET\mathsf{SET}-COVER\mathsf{COVER} problem is one of Karp’s 21 NP-complete problems known to be NP-complete in 1972:

The inputs are U,SU,S; U={1,2,,n}U=\{1,2,\cdots,n\} is a universe, P(U)P(U) is the power set of UU, and S={S1,,Sm}P(U)S=\{S_{1},\cdots,S_{m}\}\subseteq P(U) is a family of subsets, j[m]Sj=U\cup_{j\in[m]}S_{j}=U. The goal is to give a YES/NO answer to the follow decision problem:

Does there exist a set-cover of size tt, i.e., C[m]\exists C\subseteq[m], such that jCSj=U\cup_{j\in C}S_{j}=U with C=t|C|=t?

Alternatively, we can also state the problem as finding the minimum set cover size t0t_{0}, via a binary search on tt using the answers of the decision problem in A.3. The Approximate SET\mathsf{SET}-COVER\mathsf{COVER} problem is defined as follows.

The inputs are U,SU,S; U={1,2,,n}U=\{1,2,\cdots,n\} is a universe, P(U)P(U) is the power set of UU, and S={S1,,Sm}P(U)S=\{S_{1},\cdots,S_{m}\}\subseteq P(U) is a family of subsets, j[m]Sj=U\cup_{j\in[m]}S_{j}=U. The goal is to distinguish between the following two cases: (\@slowromancapi@): There exists a small set-cover, i.e., C[m]\exists C\subseteq[m], such that jCSj=U\cup_{j\in C}S_{j}=U with Ct|C|\leq t. (\@slowromancapii@): Every set-cover is large, i.e., every C[m]C\subseteq[m] with jCSj=U\cup_{j\in C}S_{j}=U satisfies that C>αt|C|>\alpha t, where α>1\alpha>1.

An oracle that solves the Approximate SET\mathsf{SET}-COVER\mathsf{COVER} problem outputs an answer tUt0t_{U}\geq t_{0} but tUαt0t_{U}\leq\alpha t_{0} using a binary search, where tUt_{U} is an upper bound of t0t_{0} with a guaranteed approximation ratio α\alpha. For example, we can use a greedy (rather than exact) algorithm to solve the SET\mathsf{SET}-COVER\mathsf{COVER} problem, which cannot always find the smallest size of set cover t0t_{0}, but the size tUt_{U} given by the greedy algorithm is at most α\alpha times as large as t0t_{0}.

In our setting, we want to investigate the hardness of finding the lower bound with a guaranteed approximation ration, but an approximate algorithm for SET\mathsf{SET}-COVER\mathsf{COVER} gives us an upper bound of t0t_{0} instead of an lower bound of t0t_{0}. However, in the following proposition, we show that finding an lower bound with an approximation ratio of α\alpha is as hard as finding an upper bound with an approximation ratio of α\alpha.

Finding a lower bound tLt_{L} for the size of the minimal set-cover (that has size t0t_{0}) with an approximation ratio α\alpha is as hard as finding an upper bound tUt_{U} with an approximation ratio α\alpha.

If we find a lower bound tLt_{L} with t0αtLt0\frac{t_{0}}{\alpha}\leq t_{L}\leq t_{0}, by multiplying both sides by α\alpha, we also find an upper bound tU=αtLt_{U}=\alpha t_{L} which satisfies that t0tUαt0t_{0}\leq t_{U}\leq\alpha t_{0}. So finding an lower bound with an approximation ratio α\alpha is at least as hard as finding an upper bound with an approximation ratio α\alpha. The converse is also true. ∎

SET\mathsf{SET}-COVER\mathsf{COVER} is a well-studied problem in the literature. Here we introduce a theorem from (Raz & Safra, 1997; Alon et al., 2006; Dinur & Steurer, 2014) which implies the hardness of approximating SET\mathsf{SET}-COVER\mathsf{COVER}.

Unless NP=P\mathsf{NP}=\mathsf{P}, there is no polynomial time algorithm that gives a (1o(1))lnn(1-o(1))\ln n-approximation to SET\mathsf{SET}-COVER\mathsf{COVER} problem with universe size nn.

We now formally define our neural network robustness verification problems.

Does there exist a yy with xy1r\|x-y\|_{1}\leq r such that F(y)>0F(y)>0?

With an oracle of the decision problem available, we can figure out the smallest rr (defined as r0r_{0}) such that there exists a vector yy with xy1r\|x-y\|_{1}\leq r and F(y)>0F(y)>0 via a binary search.

Given an nn hidden nodes ReLU neural network F(x):{0,1}d{0,1}F(x):\{0,1\}^{d}\rightarrow\{0,1\} where weights are all fixed, for a query input vector x{0,1}dx\in\{0,1\}^{d} with F(x)=0F(x)=0. The goal is to give a YES/NO answer to the following decision problem:

Does there exist a yy with xy1r\|x-y\|_{1}\leq r such that F(y)=1F(y)=1?

Then, we define the approximate version of our neural network robustness verification problems.

Given an nn hidden nodes ReLU neural network F(x):{0,1}d{0,1}F(x):\{0,1\}^{d}\rightarrow\{0,1\} where weights are all fixed, for a query input vector x{0,1}dx\in\{0,1\}^{d} with F(x)=0F(x)=0. The goal is to distinguish the following two cases : (\@slowromancapi@): There exists a point yy such that xy1r\|x-y\|_{1}\leq r and F(y)=1F(y)=1. (\@slowromancapii@): For all yy satisfies xy1αr\|x-y\|_{1}\leq\alpha r, the F(y)=0F(y)=0, where α>1\alpha>1.

A.2 Background of the PCP theorem

The famous Probabilistically Checkable Proofs (PCP) theorem is the cornerstone of the theory of computational hardness of approximation, which investigates the inherent difficulty in designing efficient approximation algorithms for various optimization problems.https://en.wikipedia.org/wiki/PCP_theorem The formal definition can be stated as follows,

Given a SAT\mathsf{SAT} formula ϕ\phi of size nn we can in time polynomial in nn construct a set of MM tests satisfying the following: (\@slowromancapi@) : Each test queries a constant number dd of bits from a proof, and based on the outcome of the queries it either acceptes or reject ϕ\phi. (\@slowromancapii@) : (Yes Case / Completeness) If ϕ\phi is satisfiable, then there exists a proof so that all tests accept ϕ\phi. (\@slowromancapiii@) : (No Case / Soundness) If ϕ\phi is not satifiable, then no proof will cause more than M/2M/2 tests to accept ϕ\phi.

Note that PCP kind of reduction is slightly different from NP reduction, for more examples (e.g. maximum edge biclique, sparsest cut) about how to use PCP theorem to prove inapproximibility results, we refer the readers to (Ambühl et al., 2011).

A.3 Warm-up

Consider a set-cover instance, let SS denote a set of sets {S1,S2,,Sd}\{S_{1},S_{2},\cdots,S_{d}\} where sj[n],j[d]s_{j}\subseteq[n],\forall j\in[d].

For each set SjS_{j} we create an input node uju_{j}. For each element i[n]i\in[n], we create a hidden node viv_{i}. For each i[n]i\in[n] and j[d]j\in[d], if iSji\in S_{j}, then we connect uju_{j} and viv_{i}. We also create an output node ww, for each i[n]i\in[n], we connect node viv_{i} and node ww.

Since uj{0,1}u_{j}\in\{0,1\}, ϕi\phi_{i} can be implemented in this way using ReLU activations:

Note that j=1d1iSj=j=1duj\sum_{j=1}^{d}{\bf 1}_{i\in S_{j}}=\sum_{j=1}^{d}u_{j}, because uj=1u_{j}=1 indicates choosing set SjS_{j} and uj=0u_{j}=0 otherwise.

For final output node ww, we define an activation function ψ\psi satisfies that

Since vi[n]v_{i}\in[n], ψ\psi can be implemented as

We use vector xx to denote {0}d\{0\}^{d} vector and it is to easy to see that F(x)=0F(x)=0. Let α>1\alpha>1 denote a fixed parameter. Also, we have F(y)>0F(y)>0 if and only if C={jyj=1}C=\{j|y_{j}=1\} is a set-cover. According to our construction, we can have the following two claims,

If there exists a set-cover C[d]C\subseteq[d] with jCSj=[n]\cup_{j\in C}S_{j}=[n] and Cr|C|\leq r, then there exists a point y{0,1}dy\in\{0,1\}^{d} such that xy1r\|x-y\|_{1}\leq r and F(y)>0F(y)>0.

If for every C[d]C\subseteq[d] with jCSj=U\cup_{j\in C}S_{j}=U satisfies that C>αt|C|>\alpha\cdot t, then for all y{0,1}dy\in\{0,1\}^{d} satisfies that xy1αr\|x-y\|_{1}\leq\alpha r, F(y)0F(y)\leq 0 holds.

Therefore, using Theorem A.11, Theorem A.6, Claim A.13 and Claim A.14 completes the proof. ∎

A.4 Main result

Consider a set-cover instance, let SS denote a set of sets {S1,S2,,Sd}\{S_{1},S_{2},\cdots,S_{d}\} where Sj[n],j[d]S_{j}\subseteq[n],\forall j\in[d]. For each set SjS_{j} we create an input node uju_{j}. For each j[d]j\in[d], we create a hidden node tjt_{j} and connect uju_{j} and tjt_{j}.

For each element i[n]i\in[n], we create a hidden node viv_{i}. For each i[n]i\in[n] and j[d]j\in[d], if iSji\in S_{j}, then we connect uju_{j} and viv_{i}. Finally, we create an output node ww and for each i[n]i\in[n], we connect node viv_{i} and node ww.

Let δ=1/d\delta=1/d. For each j[n]j\in[n], we apply an activation function ϕ1,j\phi_{1,j} on tjt_{j} such that

This can be implemented in the following way,

For the final output node ww, we define it as

We use vector xx to denote {0}d\{0\}^{d} vector and it is to easy to see that F(x)=δ<0F(x)=-\delta<0. Let α>1\alpha>1 denote a fixed parameter.

According to our construction, we can have the following two claims.

If there exists a set-cover C[d]C\subseteq[d] with jCSj=[n]\cup_{j\in C}S_{j}=[n] and Cr|C|\leq r, then there exists a point ydy\in^{d} such that xy1r\|x-y\|_{1}\leq r and F(y)>0F(y)>0.

Without loss of generality, we let the set cover to be {S1,S2,...,Sr}\{S_{1},S_{2},...,S_{r}\}. Let y1=y2==yr=1y_{1}=y_{2}=\cdots=y_{r}=1 and yr+1=yr+2=...=yd=0.y_{r+1}=y_{r+2}=...=y_{d}=0. By the definition of tjt_{j}, we have t1=t2==tr=δ.t_{1}=t_{2}=\cdots=t_{r}=\delta. Since {S1,S2,,Sr}\{S_{1},S_{2},\cdots,S_{r}\} is a set-cover, we know that vi=δv_{i}=\delta for all i[n]i\in[n]. Then F(y)=w=mini[n]vi=δ>0.F(y)=w=\min_{i\in[n]}v_{i}=\delta>0. Since we also have y1=r,\|y\|_{1}=r, the adversarial point is found. ∎

If for every C[d]C\subseteq[d] with jCSj=U\cup_{j\in C}S_{j}=U satisfies that C>αr|C|>\alpha\cdot r, then for all ydy\in^{d} satisfies that xy1αr(11/d)\|x-y\|_{1}\leq\alpha r(1-1/d), F(y)0F(y)\leq 0 holds.

Proof by contradiction. We assume that there exists yy such that F(y)>0F(y)>0 and y1αr(11/d)\|y\|_{1}\leq\alpha r(1-1/d). Since F(y)>0F(y)>0, we have for all ii, vi>0v_{i}>0. Thus there exists jTij\in T_{i} such that tj>0t_{j}>0. Let π:[n]Q\pi:[n]\rightarrow Q denote a mapping (Q[d]Q\subseteq[d] will be decided later). This means that for each i[n]i\in[n], there exists jTij\in T_{i}, such that 1δ<yj11-\delta<y_{j}\leq 1, and we let π(i)\pi(i) denote that jj.

Since j[d]yj=y1αr(11/d)\sum_{j\in[d]}|y_{j}|=\|y\|_{1}\leq\alpha r(1-1/d), we have

where the first step follows by Qd|Q|\leq d.

Because for all jQj\in Q, yj>1δ=11/d|y_{j}|>1-\delta=1-1/d, we have

So {Sj}jQ\{S_{j}\}_{j\in Q} is a set-cover with size less than or equal to αr\alpha\cdot r, which is a contradiction. ∎

Therefore, using Theorem A.11, Theorem A.6, Claim A.16 and Claim A.17 completes the proof. ∎

By making a stronger assumption of ETH\mathsf{ETH}, we can have the following stronger result which excludes all 2o(nc)2^{o(n^{c})} time algorithms, where c>0c>0 is some fixed constant:

Assuming Exponential Time Hypothesis (ETH\mathsf{ETH}, see Hypothesis A.2), there is no 2o(nc)2^{o(n^{c})} time algorithm that gives a (1o(1))lnn(1-o(1))\ln n-approximation to ROBUST\mathsf{ROBUST}-NET\mathsf{NET} problem with nn hidden nodes, where c>0c>0 is some fixed constant.

It follows by the construction in Theorem A.15 and (Moshkovitz, 2012a, b). ∎

Note that in (Moshkovitz, 2012a), an additional conjecture, Projection Games Conjecture (PGC\mathsf{PGC}) is required for the proof, but the result was improved in (Moshkovitz, 2012b) and PGC\mathsf{PGC} is not a requirement any more.

Appendix B Proof of Theorem 3.5

For a mm-layer ReLU network, assume we know all the pre-ReLU activation bounds l(k)\bm{l}^{(k)} and u(k)\bm{u}^{(k)}, k[m1]\forall k\in[m-1] for a mm-layer ReLU network and we want to compute the bounds of the the jj th output at mm th layer.

For neurons belonging to category (i), i.e., kIm1+k\in\mathcal{I}^{+}_{m-1},

For neurons belonging to category (ii), i.e., kIm1k\in\mathcal{I}^{-}_{m-1},

Finally, for neurons belonging to Category (iii), i.e., kIm1k\in\mathcal{I}_{m-1}, we bound their outputs. If we adopt the linear upper and lower bounds in (1) and let dk(m1):=uk(m1)uk(m1)lk(m1)\bm{d}^{(m-1)}_{k}:=\frac{\bm{u}^{(m-1)}_{k}}{\bm{u}^{(m-1)}_{k}-\bm{l}^{(m-1)}_{k}}, we have

The goal of this section is to prove Lemma B.1.

Notice that (18) can be used to construct an upper bound and lower bound of fj(x)f_{j}(\bm{x}) by considering the signs of the weights Wj,k(m)\mathbf{W}^{(m)}_{j,k}. Let fjU,m1(x)f_{j}^{U,m-1}(\bm{x}) be an upper bound of fj(x)f_{j}(\bm{x}); fjU,m1(x)f_{j}^{U,m-1}(\bm{x}) can be constructed by taking the right-hand-side (RHS) of (18) if Wj,k(m)>0\mathbf{W}^{(m)}_{j,k}>0 and taking the left-hand-side (LHS) of (18) if Wj,k(m)<0\mathbf{W}^{(m)}_{j,k}<0:

where we set dk(m1)=1\bm{d}^{(m-1)}_{k}=1 for kIm1+k\in\mathcal{I}^{+}_{m-1} and set dk(m1)=0\bm{d}^{(m-1)}_{k}=0 for kIm1k\in\mathcal{I}^{-}_{m-1} from (19) to (20) and collect the constant terms (independent of x\bm{x}) in the parenthesis from (20) to (21).

If we let A(m1)=W(m)D(m1)\mathbf{A}^{(m-1)}=\mathbf{W}^{(m)}\mathbf{D}^{(m-1)}, where D(m1)\mathbf{D}^{(m-1)} is a diagonal matrix with diagonals being dk(m1)\bm{d}^{(m-1)}_{k}, then we can rewrite fjU,m1(x)f_{j}^{U,m-1}(\bm{x}) into the following:

From (21) to (22), we rewrite the summation terms in the parenthesis into matrix-vector multiplications and for each j[nm]j\in[n_{m}] let

since 0dk(m1)10\leq\bm{d}^{(m-1)}_{k}\leq 1, Wj,k(m)>0\mathbf{W}^{(m)}_{j,k}>0 is equivalent to Aj,k(m1)>0\mathbf{A}^{(m-1)}_{j,k}>0.

From (22) to (23), we simply write out the inner product Wk,:(m1)ϕm2(x)\mathbf{W}^{(m-1)}_{k,:}\phi_{m-2}(\bm{x}) into a summation form, and from (23) to (24), we exchange the summation order of kk and rr. From (24) to (25), we let

and now we have (25) in the same form as (15).

Indeed, in (15), the running index is kk and we are looking at the mm th layer, with weights Wj,k(m)\mathbf{W}^{(m)}_{j,k}, activation functions ϕm1(x)\phi_{m-1}(\bm{x}) and bias term bj(m)\bm{b}^{(m)}_{j}; in (25), the running index is rr and we are looking at the m1m-1 th layer with equivalent weights W~j,r(m1)\widetilde{\bm{W}}^{(m-1)}_{j,r}, activation functions ϕm2(x)\phi_{m-2}(\bm{x}) and equivalent bias b~j(m1)\widetilde{\bm{b}}^{(m-1)}_{j}. Thus, we can use the same technique from (15) to (25) and obtain an upper bound on the fjU,m1(x)f_{j}^{U,m-1}(\bm{x}) and repeat this procedure until obtaining fjU,1(x)f_{j}^{U,1}(\bm{x}), where

Let the final upper bound fjU(x)=fjU,1(x)f_{j}^{U}(\bm{x})=f_{j}^{U,1}(\bm{x}), and now we have

where fjU(x)=[fU(x)]jf^{U}_{j}(\bm{x})=[f^{U}(\bm{x})]_{j},

B.2 Lower bound

The goal of this section is to prove Lemma B.2.

Similar to deriving the upper bound of fj(x)f_{j}(\bm{x}), we consider the signs of the weights Wj,k(m)\mathbf{W}^{(m)}_{j,k} to derive the lower bound. Let fjL,m1(x)f_{j}^{L,m-1}(\bm{x}) be a lower bound of fj(x)f_{j}(\bm{x}); fjL,m1(x)f_{j}^{L,m-1}(\bm{x}) can be constructed by taking the right-hand-side (RHS) of (18) if Wj,k(m)<0\mathbf{W}^{(m)}_{j,k}<0 and taking the left-hand-side (LHS) of (18) if Wj,k(m)>0\mathbf{W}^{(m)}_{j,k}>0. Following the procedure in (19) to (25) (except that now the additional bias term is from the set kIm1,Wj,k(m)<0k\in\mathcal{I}_{m-1},\mathbf{W}^{(m)}_{j,k}<0), the lower bound is similar to the upper bound we have derived but but replace T(m1)\mathbf{T}^{(m-1)} by H(m1)\mathbf{H}^{(m-1)}, where for each j[nm]j\in[n_{m}],

It is because the linear upper and lower bounds in (1) has the same slope uul\frac{u}{u-l} on both sides (i.e. σ(y)\sigma(y) is bounded by two lines with the same slope but different intercept), which gives the same A\mathbf{A} matrix and D\mathbf{D} matrix in computing the upper bound and lower bound of fj(x)f_{j}(\bm{x}). This is the key to facilitate a faster computation under this linear approximation (1). Thus, the lower bound for fj(x)f_{j}(\bm{x}) is:

where fjL(x)=[fL(x)]jf^{L}_{j}(\bm{x})=[f^{L}(\bm{x})]_{j},

Appendix C Proof of Corollary 3.7

By Theorem 3.5, for xBp(x0,ϵ)\bm{x}\in B_{p}(\bm{x_{0}},\epsilon), we have fjL(x)fj(x)fjU(x)f_{j}^{L}(\bm{x})\leq f_{j}(\bm{x})\leq f_{j}^{U}(\bm{x}). Thus,

Since fjU(x)=Aj,:(0)x+bj(m)+k=1m1Aj,:(k)(b(k)T:,j(k))f_{j}^{U}(\bm{x})=\mathbf{A}^{(0)}_{j,:}\bm{x}+\bm{b}^{(m)}_{j}+\sum_{k=1}^{m-1}\mathbf{A}^{(k)}_{j,:}(\bm{b}^{(k)}-\mathbf{T}^{(k)}_{:,j}),

From (30) to (31), we do a transformation of variable y:=xx0ϵ\bm{y}:=\frac{\bm{x}-\bm{x_{0}}}{\epsilon} and therefore yBp(0,1)\bm{y}\in B_{p}(\bm{0},1). By the definition of dual norm \|\cdot\|_{*}:

Again, from (33) to (34), we simply replace (maxyBp(0,1)Aj,:(0)y)\left(\max_{\bm{y}\in B_{p}(\bm{0},1)}-\mathbf{A}^{(0)}_{j,:}\bm{y}\right) by Aj,:(0)q=Aj,:(0)q\|-\mathbf{A}^{(0)}_{j,:}\|_{q}=\|\mathbf{A}^{(0)}_{j,:}\|_{q}. Thus, if we use νj\nu_{j} to denote the common term Aj,:(0)x0+bj(m)+k=1m1Aj,:(k)b(k)\mathbf{A}^{(0)}_{j,:}\bm{x_{0}}+\bm{b}^{(m)}_{j}+\sum_{k=1}^{m-1}\mathbf{A}^{(k)}_{j,:}\bm{b}^{(k)}, we have

Appendix D Algorithms

We present our full algorithms, Fast-Lin in Algorithm 1 and Fast-Lip in Algorithm LABEL:alg:fast-lip.