Improved Sample Complexities for Deep Networks and Robust Classification via an All-Layer Margin

Colin Wei, Tengyu Ma

Introduction

The most popular classification objectives for deep learning, such as cross entropy loss, encourage a larger output margin – the gap between predictions on the true label and and next most confident label. These objectives have been popular long before deep learning was prevalent, and there is a long line of work showing they enjoy strong statistical guarantees for linear and kernel methods (Bartlett and Mendelson, 2002; Koltchinskii et al., 2002; Hofmann et al., 2008; Kakade et al., 2009). These guarantees have been used to explain the successes of popular algorithms such as SVM (Boser et al., 1992; Cortes and Vapnik, 1995).

For linear classifiers, the relationship between output margin and generalization is simple and direct – generalization error is controlled by the output margins normalized by the classifier norm. Concretely, suppose we have nn training data points each with norm 1, and let γi\gamma_{i} be the output margin on the ii-th example. With high probability, if the classifier perfectly fits the training data, we obtainThis is a stronger version of the classical textbook bound which involves the min margin on the training examples. We present this stronger version because it motivates our work better. It can be derived from the results of Srebro et al. (2010).

For deeper models, the relationship between output margin and generalization is unfortunately less clear and interpretable. Known bounds for deep nets normalize the output margin by a quantity that either scales exponentially in depth or depends on complex properties of the network (Neyshabur et al., 2015; Bartlett et al., 2017; Neyshabur et al., 2017b; Golowich et al., 2017; Nagarajan and Kolter, 2019; Wei and Ma, 2019). This is evidently more complicated than the linear case in (1.1). These complications arise because for deep nets, it is unclear how to properly normalize the output margin. In this work, we remedy this issue by proposing a new notion of margin, called "all-layer margin", which we use to obtain simple guarantees like (1.1) for deep models. Let mim_{i} be the all-layer margin for the ii-th example. We can simply normalize it by the sum of the complexities of the weights (often measured by the norms or the covering number) and obtain a bound of the following form:

As the name suggests, the all-layer margin considers all layers of the network simultaneously, unlike the output margin which only considers the last layer. We note that the definition of the all-layer margin is the key insight for deriving (1.2) – given the definition, the rest of the proof follows naturally with standard tools. (Please see equation (2.2) for a formal definition of the margin, and Theorem 2.3 for a formal version of bound (1.2).) To further highlight the good statistical properties of the all-layer margin, we present three of its concrete applications in this paper.

1. By relating all-layer margin to output margin and other quantities, we obtain improved generalization bounds for neural nets. In Section 3, we derive an analytic lower bound on the all-layer margin for neural nets with smooth activations which depends on the output margin normalized by other data-dependent quantities. By substituting this lower bound into (1.2), we obtain a generalization bound in Theorem 3.1 which avoids the exponential depth dependency and has tighter data-dependent guarantees than (Nagarajan and Kolter, 2019; Wei and Ma, 2019) in several ways. First, their bounds use the same normalizing quantity for each example, whereas our bounds are tighter and more natural because we use a different normalizer for each training example – the local Lipschitzness for that particular example. Second, our bound depends on the empirical distribution of some complexity measure computed for each training example. When these complexities are small for each training example, we can obtain convergence rates faster than 1/n1/\sqrt{n}. We provide a more thorough comparison to prior work in Section 3.

Furthermore, for relu networks, we give a tighter generalization bound which removes the dependency on inverse pre-activations suffered by (Nagarajan and Kolter, 2019), which they showed to be large empirically (see Section B.1). The techniques of (Wei and Ma, 2019) could not remove this dependency because they relied on smooth activations.

2. We extend our tools to give generalization bounds for adversarially robust classification error which are analogous to our bounds in the standard setting. In Section 4, we provide a natural extension of our all-layer margin to adversarially robust classification. This allows us to translate our neural net generalization bound, Theorem 3.1, directly to adversarially robust classification (see Theorem 4.1). The resulting bound takes a very similar form as our generalization bound for clean accuracy – it simply replaces the data-dependent quantities in Theorem 3.1 with their worst-case values in the adversarial neighborhood of the training example. As a result, it also avoids explicit exponential dependencies on depth. As our bound is the first direct analysis of the robust test error, it presents a marked improvement over existing work which analyze loose relaxations of the adversarial error (Khim and Loh, 2018; Yin et al., 2018). Finally, our analysis of generalization for the clean setting translates directly to the adversarial setting with almost no additional steps. This is an additional advantage of our all-layer margin definition.

3. We design a training algorithm that encourages a larger all-layer margin and demonstrate that it improves empirical performance over strong baselines. In Section 5, we apply our regularizer to WideResNet models (Zagoruyko and Komodakis, 2016) trained on the CIFAR datasets and demonstrate improved generalization performance for both clean and adversarially robust classification. We hope that these promising empirical results can inspire researchers to develop new methods for optimizing the all-layer margin and related quantities. Our code is available online at the following link: https://github.com/cwein3/all-layer-margin-opt.

Zhang et al. (2016); Neyshabur et al. (2017a) note that deep learning often exhibits statistical properties that are counterintuitive to conventional wisdom. This has prompted a variety of new perspectives for studying generalization in deep learning, such as implicit or algorithmic regularization (Gunasekar et al., 2017; Li et al., 2017; Soudry et al., 2018; Gunasekar et al., 2018), new analyses of interpolating classifiers (Belkin et al., 2018; Liang and Rakhlin, 2018; Hastie et al., 2019; Bartlett et al., 2019), and the noise and stability of SGD (Hardt et al., 2015; Keskar et al., 2016; Chaudhari et al., 2016). In this work, we adopt a different perspective for analyzing generalization by studying a novel definition of margin for deep models which differs from the well-studied notion of output margin. We hope that our generalization bounds can inspire the design of new regularizers tailored towards deep learning. Classical results have bounded generalization error in terms of the model’s output margin and the complexity of its prediction (Bartlett and Mendelson, 2002; Koltchinskii et al., 2002), but for deep models this complexity grows exponentially in depth (Neyshabur et al., 2015; Bartlett et al., 2017; Neyshabur et al., 2017b; Golowich et al., 2017). Recently, Nagarajan and Kolter (2019); Wei and Ma (2019) derived complexity measures in terms of hidden layer and Jacobian norms which avoid the exponential dependence on depth, but their proofs require complicated techniques for controlling the complexity of the output margin. Li et al. (2018) also derive Jacobian-based generalization bounds.

Dziugaite and Roy (2017) directly optimize a PAC-Bayes bound based on distance from initialization and stability to random perturbations, obtaining a nonvacuous bound on generalization error of a neural net. Neyshabur et al. (2017a); Arora et al. (2018) provide complexity measures related to the data-dependent stability of the network, but the resulting bounds only apply to a randomized or compressed version of the original classifier. We provide a simple framework which derives such bounds for the original classifier. Long and Sedghi (2019) prove bounds for convolutional networks which scale with the number of parameters and their distance from initialization. Novak et al. (2018); Javadi et al. (2019) study stability-related complexity measures empirically.

Keskar et al. (2016); Chaudhari et al. (2016); Yao et al. (2018); Jastrzebski et al. (2018) study the relationship between generalization and “sharpness” of local minima in the context of small v.s. large batch SGD. They propose measures of sharpness related to the second derivatives of the loss with respect to the model parameters. Our all-layer margin can also be viewed as a notion of “sharpness” of the local minima around the model, as it measures the stability of the model’s prediction.

A recent line of work establishes rigorous equivalences between logistic loss and output margin maximization. Soudry et al. (2018); Ji and Telgarsky (2018) show that gradient descent implicitly maximizes the margin for linearly separable data, and Lyu and Li (2019) prove gradient descent converges to a stationary point of the max-margin formulation for deep homogeneous networks. Other works show global minimizers of regularized logistic loss are equivalent to margin maximizers, in linear cases (Rosset et al., 2004) and for deep networks (Wei et al., 2018; Nacson et al., 2019). A number of empirical works also suggest alternatives to the logistic loss which optimize variants of the output margin (Sun et al., 2014; Wen et al., 2016; Liu, ; Liang et al., 2017; Cao et al., 2019). The neural net margin at intermediate and input layers has also been studied. Elsayed et al. (2018) design an algorithm to maximize a notion of margin at intermediate layers of the network, and Jiang et al. (2018) demonstrate that the generalization gap of popular architectures can empirically be predicted using statistics of intermediate margin distributions. Verma et al. (2018) propose a regularization technique which they empirically show improves the structure of the decision boundary at intermediate layers. Yan et al. (2019) use adversarial perturbations to input to optimize a notion of margin in the input space, and Xie et al. (2019) demonstrate that using adversarial examples as data augmentation can improve clean generalization performance. Sokolić et al. (2017) provide generalization bounds based on the input margin of the neural net, but these bounds depend exponentially on the dimension of the data manifold. These papers study margins defined for individual network layers, whereas our all-layer margin simultaneously considers all layers. This distinction is crucial for deriving our statistical guarantees.

A number of recent works provide negative results for adversarially robust generalization (Tsipras et al., 2018; Montasser et al., 2019; Yin et al., 2018; Raghunathan et al., 2019). We provide positive results stating that adversarial test accuracy can be good if the adversarial all-layer margin is large on the training data. Schmidt et al. (2018) demonstrate that more data may be required for generalization on adversarial inputs than on clean data. Montasser et al. (2019) provide impossiblity results for robust PAC learning with proper learning rules, even for finite VC dimension hypothesis classes. Zhang et al. (2019) consider the trade-off between the robust error and clean error. Farnia et al. (2018) analyze generalization for specific adversarial attacks and obtain bounds depending exponentially on depth. Yin et al. (2018); Khim and Loh (2018) give adversarially robust generalization bounds by upper bounding the robust loss via a transformed/relaxed loss function, and the bounds depend on the product of weight matrix norms. Yin et al. (2018) also show that the product of norms is inevitable if we go through the standard tools of Rademacher complexity and the output margin. Our adversarial all-layer margin circumvents this lower bound because it considers all layers of the network rather than just the output.

2 Notation

Warmup: Simplified All-Layer Margin and its Guarantees

For shallow models such as linear and kernel methods, the output margin maximization objective enjoys good statistical guarantees (Kakade et al., 2009; Hofmann et al., 2008). For deep networks, the statistical properties of this objective are less clear: until recently, statistical guarantees depending on the output margin also suffered an exponential dependency on depth (Bartlett et al., 2017; Neyshabur et al., 2017b). Recent work removed these dependencies but require technically involved proofs and result in complicated bounds depending on numerous properties of the training data (Nagarajan and Kolter, 2019; Wei and Ma, 2019).

In this section, we introduce a new objective with better statistical guarantees for deep models (Theorem 2.3) and outline the steps for proving these guarantees. Our objective is based on maximizing a notion of margin which measures the stability of a classifier to simultaneous perturbations at all layers. We first rigorously define a generalized margin function as follows:

To motivate our notion of margin, we observe that the following generalization bound holds for any generalized margin function with low complexity.

For a function class F\mathcal{F}, let G={gF:FF}\mathcal{G}=\{g_{F}:F\in\mathcal{F}\} be a class of generalized margin functions indexed by FFF\in\mathcal{F}. Suppose that the covering number of G\mathcal{G} with respect to \infty-norm scales as logN(ϵ,G)C2(G)ϵ2\log\mathcal{N}_{\infty}(\epsilon,\mathcal{G})\leq\lfloor\frac{\mathcal{C}_{\infty}^{2}(\mathcal{G})}{\epsilon^{2}}\rfloor for some complexity C(G)\mathcal{C}_{\infty}(\mathcal{G}). Then with probability 1δ1-\delta over the draw of the training data, all classifiers FFF\in\mathcal{F} which achieve training error 0 satisfy

where ζO(log(1/δ)+lognn)\zeta\triangleq O\left(\frac{\log(1/\delta)+\log n}{n}\right) is a low-order term.

The proof of Lemma 2.2 mainly follows standard proofs of margin-based generalization bounds. The main novelty is in achieving a generalization bound depending on the average inverse margin instead of largest inverse margin. To obtain this bound, we apply a generalization result of Srebro et al. (2010) to a smooth surrogate loss which bounds the 0-1 loss. More details are provided in Section A.

Note that Lemma 2.2 applies to any margin function gg, and motivates defining margin functions with low complexity. We can define the all-layer margin as follows. Suppose that the classifier F(x)=fkf1(x)F(x)=f_{k}\circ\cdots\circ f_{1}(x) is computed by composing kk functions fk,,f1f_{k},\ldots,f_{1}, and let δk,,δ1\delta_{k},\ldots,\delta_{1} denote perturbations intended to be applied at each hidden layer. We recursively define the perturbed network output F(x,δ1,,δk)F(x,\delta_{1},\ldots,\delta_{k}) by

The all-layer margin will now be defined as the minimum norm of δ\delta required to make the classifier misclassify the input. Formally, for classifier FF, input xx, and label yy, we define

Note that mFm_{F} is strictly positive if and only if the unperturbed prediction F(x)F(x) is correct, so mFm_{F} is a margin function according to Definition 2.1. Here multiplying δi\delta_{i} by the previous layer norm hi1(x,δ)2\|h_{i-1}(x,\delta)\|_{2} is important and intuitively balances the relative scale of the perturbations at each layer. We note that the definition above is simplified to convey the main intuition behind our results – to obtain the tightest possible bounds, in Sections 3 and 4, we use the slightly more general mFm_{F} defined in Section A.

Prior works have studied, both empirically and theoretically, the margin of a network with respect to single perturbations at an intermediate or input layer (Sokolić et al., 2017; Novak et al., 2018; Elsayed et al., 2018; Jiang et al., 2018). Our all-layer margin is better tailored towards handling the compositionality of deep networks because it considers simultaneous perturbations to all layers, which is crucial for achieving its statistical guarantees.

Formally, let F{fkf1:fiFi}\mathcal{F}\triangleq\{f_{k}\circ\cdots\circ f_{1}:f_{i}\in\mathcal{F}_{i}\} be the class of compositions of functions from function classes F1,,Fk\mathcal{F}_{1},\ldots,\mathcal{F}_{k}. We bound the population classification error for FFF\in\mathcal{F} based on the distribution of mFm_{F} on the training data and the sum of the complexities of each layer, measured via covering numbers. For simplicity, we assume the covering number of each layer scales as logNop(ϵ,Fi)Ci2/ϵ2\log\mathcal{N}_{\|\cdot\|_{\textup{op}}}(\epsilon,\mathcal{F}_{i})\leq\lfloor\mathcal{C}_{i}^{2}/\epsilon^{2}\rfloor for some complexity Ci\mathcal{C}_{i}, which is common for many function classes.

In the above setting, with probability 1δ1-\delta over the draw of the training data, all classifiers FFF\in\mathcal{F} which achieve training error 0 satisfy

where ζO(log(1/δ)+lognn)\zeta\triangleq O\left(\frac{\log(1/\delta)+\log n}{n}\right) is a low-order term.

In other words, generalization is controlled by the sum of the complexities of the layers and the quadratic mean of 1/mF1/m_{F} on the training set. Theorem A.2 generalizes this statement to provide bounds which depend on the qq-th moment of 1/mF1/m_{F} for any integer q>0q>0 and converge at rates faster than 1/n1/\sqrt{n}. For neural nets, Ci\mathcal{C}_{i} scales with weight matrix norms and 1/mF1/m_{F} can be upper bounded by a polynomial in the Jacobian and hidden layer norms and output margin, allowing us to avoid an exponential dependency on depth when these quantities are well-behaved on the training data.

The proof of Theorem 2.3 follows by showing that mFm_{F} has low complexity which scales with the sum of the complexities at each layer, as shown in the following lemma. We can then apply the generalization bound in Lemma 2.2 to mFm_{F}.

Let mF={mF:FF}m\circ\mathcal{F}=\{m_{F}:F\in\mathcal{F}\} denote the family of all-layer margins of function compositions in F\mathcal{F}. Then

The covering number of an individual layer commonly scales as logNop(ϵi,Fi)Ci2/ϵi2\log\mathcal{N}_{\|\cdot\|_{\textup{op}}}(\epsilon_{i},\mathcal{F}_{i})\leq\lfloor\mathcal{C}_{i}^{2}/\epsilon_{i}^{2}\rfloor. In this case, for all ϵ>0\epsilon>0, we obtain logN(ϵ,mF)(iCi)2ϵ2\log\mathcal{N}_{\infty}\left(\epsilon,m\circ\mathcal{F}\right)\leq\left\lfloor\frac{\left(\sum_{i}\mathcal{C}_{i}\right)^{2}}{\epsilon^{2}}\right\rfloor.

Lemma 2.4 shows that the complexity of mFm_{F} scales linearly in depth for any choice of layers Fi\mathcal{F}_{i}. In sharp contrast, lower bounds show that the complexity of the output margin scales exponentially in depth via a product of Lipschitz constants of all the layers (Bartlett et al., 2017; Golowich et al., 2017). Our proof only relies on basic properties of mFm_{F}, indicating that mFm_{F} is naturally better-equipped to handle the compositionality of F\mathcal{F}. In particular, we prove Lemma 2.4 by leveraging a uniform Lipschitz property of mFm_{F}. This uniform Lipschitz property does not hold for prior definitions of margin and reflects the key insight in our definition – it arises only because our margin depends on simultaneous perturbations to all layers.

For any two compositions F=fkf1F=f_{k}\circ\cdots\circ f_{1} and F^=f^kf^1\widehat{F}=\widehat{f}_{k}\circ\cdots\circ\widehat{f}_{1} and any (x,y)(x,y), we have mF(x,y)mF^(x,y)i=1kfif^iop2|m_{F}(x,y)-m_{\widehat{F}}(x,y)|\leq\sqrt{\sum_{i=1}^{k}\|f_{i}-\widehat{f}_{i}\|_{\textup{op}}^{2}}.

Let δ\delta^{\star} be the optimal choice of δ\delta in the definition of mF(x,y)m_{F}(x,y). We will construct δ^\widehat{\delta} such that δ^2δ2+ifif^iop2\|\widehat{\delta}\|_{2}\leq\|\delta^{\star}\|_{2}+\sqrt{\sum_{i}\|f_{i}-\widehat{f}_{i}\|_{\textup{op}}^{2}} and γ(F^(x,δ^),y)=0\gamma(\widehat{F}(x,\widehat{\delta}),y)=0 as follows: define δ^iδi+Δi\widehat{\delta}_{i}\triangleq\delta_{i}^{\star}+\Delta_{i} for Δifi(hi1(x,δ))f^i(hi1(x,δ))hi1(x,δ)2\Delta_{i}\triangleq\frac{f_{i}(h_{i-1}(x,\delta^{\star}))-\widehat{f}_{i}(h_{i-1}(x,\delta^{\star}))}{\|h_{i-1}(x,\delta^{\star})\|_{2}}, where hh is defined as in (2.1) with respect to the classifier FF. Note that by our definition of op\|\cdot\|_{\textup{op}}, we have Δi2fif^iop\|\Delta_{i}\|_{2}\leq\|f_{i}-\widehat{f}_{i}\|_{\textup{op}}. Now it is possible to check inductively that F^(x,δ^)=F(x,δ)\widehat{F}(x,\widehat{\delta})=F(x,\delta^{\star}). In particular, δ^\widehat{\delta} is satisfies the misclassification constraint in the all-layer margin objective for F^\widehat{F}. Thus, it follows that mF^(x,y)δ^2δ2+Δ2mF(x,y)+ifif^iop2m_{\widehat{F}}(x,y)\leq\|\widehat{\delta}\|_{2}\leq\|\delta^{\star}\|_{2}+\|\Delta\|_{2}\leq m_{F}(x,y)+\sqrt{\sum_{i}\|f_{i}-\widehat{f}_{i}\|_{\textup{op}}^{2}}, where the last inequality followed from Δi2fif^iop\|\Delta_{i}\|_{2}\leq\|f_{i}-\widehat{f}_{i}\|_{\textup{op}}. With the same reasoning,we obtain mF(x,y)mF^(x,y)+ifif^iop2m_{F}(x,y)\leq m_{\widehat{F}}(x,y)+\sqrt{\sum_{i}\|f_{i}-\widehat{f}_{i}\|_{\textup{op}}^{2}}, so mF(x,y)mF^(x,y)ifif^iop2|m_{F}(x,y)-m_{\widehat{F}}(x,y)|\leq\sqrt{\sum_{i}\|f_{i}-\widehat{f}_{i}\|_{\textup{op}}^{2}}. ∎

Given Claim 2.5, Lemma 2.4 follows simply by composing ϵi\epsilon_{i}-covers of Fi\mathcal{F}_{i}. We prove a more general version in Section A (see Lemmas A.3 and A.5.) To complete the proof of Theorem 2.3, we can apply Lemma 2.4 to bound the complexity of the family mFm\circ\mathcal{F}, allowing us to utilize the generalization bound in Lemma 2.2.

Finally, we check that when FF is a linear classifier, mFm_{F} recovers the standard output margin. Thus, we can view the all-layer margin as an extension of the output margin to deeper classifiers.

In the binary classification setting with a linear classifier F(x)=wxF(x)=w^{\top}x where the data xx has norm 11, we have mF(x,y)=max{0,ywx}=γ(F(x),y)m_{F}(x,y)=\max\{0,yw^{\top}x\}=\gamma(F(x),y).

For deeper models, the all-layer margin can be roughly bounded by a quantity which normalizes the output margin by Jacobian and hidden layer norms. We formalize this in Lemma 3.2 and use this to prove our main generalization bound for neural nets, Theorem 3.1.

Generalization Guarantees for Neural Networks

Although the all-layer margin is likely difficult to compute exactly, we can analytically lower bound it for neural nets with smooth activations. In this section, we obtain a generalization bound that depends on computable quantities by substituting this lower bound into Theorem 2.3. Our bound considers the Jacobian norms, hidden layer norms, and output margin on the training data, and avoids the exponential depth dependency when these quantities are well-behaved, as is the case in practice (Arora et al., 2018; Nagarajan and Kolter, 2019; Wei and Ma, 2019). Prior work (Nagarajan and Kolter, 2019; Wei and Ma, 2019) avoided the exponential depth dependency by considering these same quantities but required complicated proof frameworks. We obtain a simpler proof with tighter dependencies on these quantities by analyzing the all-layer margin. In Section B.1, we also provide a more general bound for neural networks with any activations (including relu) which only depends on the all-layer margins and weight matrix norms.

Assume that the activation ϕ\phi has a κϕ\kappa_{\phi}^{\prime}-Lipschitz derivative. Fix reference matrices {A(i),B(i)}i=1k\{A_{(i)},B_{(i)}\}_{i=1}^{k} and any integer q>0q>0. With probability 1δ1-\delta over the draw of the training sample PnP_{n}, all neural nets FF which achieve training error 0 satisfy

where κ(i)NN\kappa^{\textup{NN}}_{(i)} captures a local Lipschitz constant of perturbations at layer ii and is defined by

for a secondary term ψ(i)(x,y)\psi_{(i)}(x,y) given by

Here the second and third summations above are over the choices of j,jj,j^{\prime} satisfying the constraints specified in the summation. We define a(i)a_{(i)} by a(i)min{dW(i)A(i)fro,W(i)B(i)1,1}logd+poly(n1)a_{(i)}\triangleq\min\{\sqrt{d}\|W_{(i)}-A_{(i)}\|_{\textup{fro}},\|W_{(i)}-B_{(i)}\|_{1,1}\}\sqrt{\log d}+\textup{poly}(n^{-1}) and ζrlogn+log(1/δ)+ilog(a(i)+1)n\zeta\lesssim\frac{r\log n+\log(1/\delta)+\sum_{i}\log(a_{(i)}+1)}{n} is a low-order term.

For example, when q=2q=2, from (3.1) we obtain the following bound which depends on the second moment of κ(i)NN\kappa_{(i)}^{\textup{NN}} and features the familiar 1/n1/\sqrt{n} convergence rate in the training set size.

For larger qq, we obtain a faster convergence rate in nn, but the dependency on κ(i)NN\kappa^{\textup{NN}}_{(i)} gets larger. We will outline a proof sketch which obtains a variant of Theorem 3.1 with a slightly worse polynomial dependency on κ(i)NN\kappa^{\textup{NN}}_{(i)} and a(i)a_{(i)}. For simplicity we defer the proof of the full Theorem 3.1 to Sections B and C. First, we need to slightly redefine mFm_{F} so that perturbations are only applied at linear layers (formally, fix δ2i=0\delta_{2i}=0 for the even-indexed activation layers, and let δ(i)δ2i1\delta_{(i)}\triangleq\delta_{2i-1} index perturbations to the ii-th linear layer). It is possible to check that Lemma 2.4 still holds since activation layers correspond to a singleton function class {ϕ}\{\phi\} with log covering number 0. Thus, the conclusion of Theorem 2.3 also applies for this definition of mFm_{F}. Now the following lemma relates this all-layer margin to the output margin and Jacobian and hidden layer norms, showing that mF(x,y)m_{F}(x,y) can be lower bounded in terms of {κ(i)NN(x,y)}\{\kappa^{\textup{NN}}_{(i)}(x,y)\}.

In the setting above, we have the lower bound mF(x,y)1{κ(i)NN(x,y)}i=1r2m_{F}(x,y)\geq\frac{1}{\|\{\kappa^{\textup{NN}}_{(i)}(x,y)\}_{i=1}^{r}\|_{2}}.

Directly plugging the above lower bound into Theorem 2.3 and choosing C2i=0\mathcal{C}_{2i}=0, C2i1=a(i)\mathcal{C}_{2i-1}=a_{(i)} would give a variant of Theorem 3.1 that obtains a different polynomial in κ(i)NN\kappa^{\textup{NN}}_{(i)}, a(i)a_{(i)}.

We compute the derivative of F(x,δ)F(x,\delta) with respect to δ(i)\delta_{(i)}: Dδ(i)F(x,δ)=Dh2i1(x,δ)F(x,δ)h2i2(x,δ)2D_{\delta_{(i)}}F(x,\delta)=D_{h_{2i-1}(x,\delta)}F(x,\delta)\|h_{2i-2}(x,\delta)\|_{2}. We abused notation to let Dh2i1(x,δ)F(x,δ)D_{h_{2i-1}(x,\delta)}F(x,\delta) denote the derivative of FF with respect to the 2i12i-1-th perturbed layer evaluated on input (x,δ)(x,\delta). By definitions of κji\kappa_{j\leftarrow i}, s(i)s_{(i)} and the fact that the output margin is 1-Lipschitz, we obtain Dδ(i)γ(F(x,δ),y)δ=02Dh2i1(x,δ)F(x,0)oph2i2(x,0)2=κ2r12i(x)s(i1)(x)\|D_{\delta_{(i)}}\gamma(F(x,\delta),y)|_{\delta=0}\|_{2}\leq\|D_{h_{2i-1}(x,\delta)}F(x,0)\|_{\textup{op}}\|h_{2i-2}(x,0)\|_{2}=\kappa_{2r-1\leftarrow 2i}(x)s_{(i-1)}(x). With the first order approximation γ(F(x,δ),y)γ(F(x),y)+iDδ(i)γ(F(x,δ),y)δ=0δ(i)\gamma(F(x,\delta),y)\approx\gamma(F(x),y)+\sum_{i}D_{\delta_{(i)}}\gamma(F(x,\delta),y)|_{\delta=0}\delta_{(i)} around δ=0\delta=0, we obtain

The right hand side is nonnegative whenever δ2γ(F(x),y){κ2r12i(x)s(i1)(x)}i=1r2\|\delta\|_{2}\leq\frac{\gamma(F(x),y)}{\|\{\kappa_{2r-1\leftarrow 2i}(x)s_{(i-1)}(x)\}_{i=1}^{r}\|_{2}}, which would imply that mF(x,y)γ(F(x),y){κ2r12i(x)s(i1)(x)}i=1r2m_{F}(x,y)\geq\frac{\gamma(F(x),y)}{\|\{\kappa_{2r-1\leftarrow 2i}(x)s_{(i-1)}(x)\}_{i=1}^{r}\|_{2}}. However, this conclusion is imprecise and non-rigorous because of the first order approximation – to make the argument rigorous, we also control the smoothness of the network around xx in terms of the interlayer Jacobians, ultimately resulting in the bound of Lemma 3.2. We remark that the quantities κ(i)NN\kappa^{\textup{NN}}_{(i)} are not the only expressions with which we could lower bound mF(x,y)m_{F}(x,y). Rather, the role of κ(i)NN\kappa^{\textup{NN}}_{(i)} is to emphasize the key term s(i1)(x)κ2r12i(x)γ(F(x),y)\frac{s_{(i-1)}(x)\kappa_{2r-1\leftarrow 2i}(x)}{\gamma(F(x),y)}, which measures the first order stability of the network to perturbation δ(i)\delta_{(i)} and relates the all-layer margin to the output margin. As highlighted above, if this term is small, mFm_{F} will be large so long as the network is sufficiently smooth around (x,y)(x,y), as captured by ψ(i)(x,y)\psi_{(i)}(x,y).

Comparison to existing bounds

We can informally compare Theorem 3.1 to the existing bounds of (Nagarajan and Kolter, 2019; Wei and Ma, 2019) as follows. First, the leading term s(i1)(x)κ2r12i(x)γ(F(x),y)\frac{s_{(i-1)}(x)\kappa_{2r-1\leftarrow 2i}(x)}{\gamma(F(x),y)} of κ(i)NN\kappa^{\textup{NN}}_{(i)} depends on three quantities all evaluated on the same training example, whereas the analogous quantity in the bounds of (Nagarajan and Kolter, 2019; Wei and Ma, 2019) appears as maxPn1γ(F(x),y)maxPns(i1)(x)maxPnκ2r12i(x)\max_{P_{n}}\frac{1}{\gamma(F(x),y)}\cdot\max_{P_{n}}s_{(i-1)}(x)\cdot\max_{P_{n}}\kappa_{2r-1\leftarrow 2i}(x), where each maximum is taken over the entire training set.

As we have κ(i)NNLq(Pn)maxPnκ(i)NN(x,y)\|\kappa^{\textup{NN}}_{(i)}\|_{L_{q}(P_{n})}\leq\max_{P_{n}}\kappa^{\textup{NN}}_{(i)}(x,y), the term κ(i)NNLq(Pn)\|\kappa^{\textup{NN}}_{(i)}\|_{L_{q}(P_{n})} in our bound can be much smaller than its counterpart in the bounds of (Nagarajan and Kolter, 2019; Wei and Ma, 2019). An interpretation of the parameter qq is that we obtain fast (close to n1n^{-1}) convergence rates if the model fits every training example perfectly with large all-layer margin, or we could have slower convergence rates with better dependence on the all-layer margin distribution. It is unclear whether the techniques in other papers can achieve convergence rates faster than O(1/n)O(1/\sqrt{n}) because their proofs require the simultaneous convergence of multiple data-dependent quantities, whereas we bound everything using the single quantity mFm_{F}.

Additionally, we compare the dependence on the weight matrix norms relative to nn (as the degree of nn in our bound can vary). For simplicitly, assume that the reference matrices A(i)A_{(i)} are set to . Our dependence on the weight matrix norms relative to the training set size is, up to logarithmic factors, (min{dW(i)fro,W(i)1,1}n)2q/(q+2)\left(\frac{\min\{\sqrt{d}\|W_{(i)}\|_{\textup{fro}},\|W_{(i)}\|{1,1}\}}{\sqrt{n}}\right)^{2q/(q+2)}, which always matches or improves on the dependency obtained by PAC-Bayes methods such as (Neyshabur et al., 2017b; Nagarajan and Kolter, 2019). Wei and Ma (2019) obtain the dependency W(i)2,1n\frac{\|{W_{(i)}}^{\top}\|_{2,1}}{\sqrt{n}}, where W(i)2,1\|{W_{(i)}}^{\top}\|_{2,1} is the sum of the 2\|\cdot\|_{2} norms of the rows of W(i)W_{(i)}. This dependency on W(i)W_{(i)} is always smaller than ours. Finally, we note that Theorem 2.3 already gives tighter (but harder to compute) generalization guarantees for relu networks directly in terms of mFm_{F}. Existing work contains a term which depends on inverse pre-activations shown to be large in practice (Nagarajan and Kolter, 2019), whereas mFm_{F} avoids this dependency and is potentially much smaller. We explicitly state the bound in Section B.1.

Generalization Guarantees for Robust Classification

Assume that the activation ϕ\phi has a κϕ\kappa_{\phi}^{\prime}-Lipschitz derivative. Fix reference matrices {A(i),B(i)}i=1k\{A_{(i)},B_{(i)}\}_{i=1}^{k} and any integer q>0q>0. With probability 1δ1-\delta over the draw of the training sample PnP_{n}, all neural nets FF which achieve robust training error 0 satisfy

where κ(i)adv\kappa^{\textup{adv}}_{(i)} is defined by κ(i)adv(x,y)maxxBadv(x)κ(i)NN(x,y)\kappa^{\textup{adv}}_{(i)}(x,y)\triangleq\max_{x^{\prime}\in\mathcal{B}^{\textup{adv}}(x)}\kappa^{\textup{NN}}_{(i)}(x^{\prime},y) for κ(i)NN\kappa^{\textup{NN}}_{(i)} in (3.2), and a(i),ζa_{(i)},\zeta are defined the same as in Theorem 3.1.

Designing regularizers for robust classification based on the bound in Theorem 4.1 is a promising direction for future work. To prove Theorem 4.1, we simply define a natural extension to our all-layer margin, and the remaining steps follow in direct analogy to the clean classification setting. We define the adversarial all-layer margin as the smallest all-layer margin on the perturbed inputs: mFadv(x,y)minxBadv(x)mF(x,y)m^{\textup{adv}}_{F}(x,y)\triangleq\min_{x^{\prime}\in\mathcal{B}^{\textup{adv}}(x)}m_{F}(x,y). We note that mFadv(x,y)m^{\textup{adv}}_{F}(x,y) is nonzero if and only if FF correctly classifies all adversarial perturbations of xx. Furthermore, the adversarial all-layer margin also satisfies the uniform Lipschitz property in Claim 2.5. Thus, the remainder of the proof of Theorem 4.1 follows the same steps laid out in Section 2. As before, we note that Theorem 4.1 requires mFm_{F} to be the more general all-layer margin defined in Section A. We provide the full proofs in Section E.

Empirical Application of the All-Layer Margin

Inspired by the good statistical properties of the all-layer margin, we design an algorithm which encourages a larger all-layer margin during training. Our proposed algorithm is similar to adversarial training (Madry et al., 2017), except perturbations are applied at all hidden layers instead of the input. Xie et al. (2019) demonstrate that adversarial training can be used to improve generalization performance in image recognition. Yan et al. (2019) use adversarial perturbations to input to optimize a notion of margin in the input space.

We use Algorithm 1 to train a WideResNet architecture (Zagoruyko and Komodakis, 2016) on CIFAR10 and CIFAR100 in a variety of settings. Our implementation is available online.https://github.com/cwein3/all-layer-margin-opt For all of our experiments we use t=1t=1, ηperturb=0.01\eta_{\textup{perturb}}=0.01, and we apply perturbations following conv layers in the WideResNet basic blocks. In Table 1 we report the best validation error achieved during a single run of training, demonstrating that our algorithm indeed leads to improved generalization over the strong WideResNet baseline for a variety of settings. Additional parameter settings and experiments for the feedforward VGG (Simonyan and Zisserman, 2014) architecture are in Section F.1. In addition, in Section F.1, we show that dropout, another regularization method which perturbs each hidden layer, does not offer the same improvement as AMO.

Inspired by our robust generalization bound, we also apply AMO to robust classification by extending the robust training algorithm of (Madry et al., 2017). Madry et al. (2017) adversarially perturb the training input via several steps of projected gradient descent (PGD) and train using the loss computed on this perturbed input. At each update, our robust AMO algorithm initializes perturbations δ\delta to 0 for every training example in the batch. The updates to these δ\delta are performed simultaneously with PGD updates for the adversarial perturbations with the same update rule as Algorithm 1.

In Table 2, we report best validation performance for robust AMO and the baseline method of (Madry et al., 2017) on CIFAR10. Our results demonstrate that our robust AMO algorithm can also offer improvements in the robust accuracy. We provide parameter settings in Section F.2.

Conclusion

Many popular objectives in deep learning are based on maximizing a notion of output margin, but unfortunately it is difficult to obtain good statistical guarantees by analyzing this output margin. In this paper, we design a new all-layer margin which attains strong statistical guarantees for deep models. Our proofs for these guarantees follow very naturally from our definition of the margin. We apply the all-layer margin in several ways: 1) we obtain tighter data-dependent generalization bounds for neural nets 2) for adversarially robust classification, we directly bound the robust generalization error in terms of local Lipschitzness around the perturbed training examples, and 3) we design a new algorithm to encourage larger all-layer margins and demonstrate improved performance on real data in both clean and adversarially robust classification settings. We hope that our results prompt further study on maximizing all-layer margin as a new objective for deep learning.

Acknowledgements

CW acknowledges support from an NSF Graduate Research Fellowship. The work is also partially supported by SDSI and SAIL.

References

Appendix A Generalized All-Layer Margin and Missing Proofs for Section 2

In this section, we provide proofs for Section 2 in a more general and rigorous setting. We first formally introduce the setting, which considers functions composed of layers which map between arbitrary normed spaces.

As in Section 2, we will use F(x,δ)F(x,\delta) to denote the classifier output perturbed by δ\delta. It will be useful to define additional notation for the perturbed function between layers ii and jj, denoted by fji(h,δ)f_{j\leftarrow i}(h,\delta), recursively as follows:

where we choose fi1i(h,δ)hf_{i-1\leftarrow i}(h,\delta)\triangleq h. Note that F(x,δ)fk1(x,δ)F(x,\delta)\triangleq f_{k\leftarrow 1}(x,\delta), and the notation hi(x,δ)h_{i}(x,\delta) from Section 2 is equivalent to fi1(x,δ)f_{i\leftarrow 1}(x,\delta). We will use the simplified notation fji(x)fji(x,0)f_{j\leftarrow i}(x)\triangleq f_{j\leftarrow i}(x,0) when the perturbation δ\delta is at all layers.

The norm {|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|} will have the following form:

This more general definition of mFm_{F} will be useful for obtaining Theorems 3.1 and 4.1. Note that by setting αi=1\alpha_{i}=1 for all ii and p=2p=2, we recover the simpler mFm_{F} defined in Section 2.

As before, it will be convenient for the analysis to assume that the ϵ\epsilon-covering number of Fi\mathcal{F}_{i} in operator norm scales with ϵ2\epsilon^{-2}. We formally state this condition for general function classes and norms below:

We say that a function class G\mathcal{G} satisfies the ϵ2\epsilon^{-2} covering condition with respect to norm \|\cdot\| with complexity C(G)\mathcal{C}_{\|\cdot\|}(\mathcal{G}) if for all ϵ>0\epsilon>0,

Now we provide the analogue of Theorem 2.3 for the generalized all-layer margin:

Fix any integer q>0q>0. Suppose that all layer functions Fi\mathcal{F}_{i} satisfy Condition A.1 with operator norm op\|\cdot\|_{\textup{op}} and complexity function Cop(Fi)\mathcal{C}_{\|\cdot\|_{\textup{op}}}(\mathcal{F}_{i}). Let the all layer margin mFm_{F} be defined as in (A.1). Then with probability 1δ1-\delta over the draw of the training data, all classifiers FFF\in\mathcal{F} which achieve training error 0 satisfy

where C(F)(iαi2p/(p+2)Cop(Fi)2p/(p+2))(p+2)/2p\mathcal{C}_{{|\kern-0.75346pt|\kern-0.75346pt|\cdot|\kern-0.75346pt|\kern-0.75346pt|}}(\mathcal{F})\triangleq\left(\sum_{i}\alpha_{i}^{2p/(p+2)}\mathcal{C}_{\|\cdot\|_{\textup{op}}}(\mathcal{F}_{i})^{2p/(p+2)}\right)^{(p+2)/2p} is the complexity (in the sense of Condition A.1) for covering F\mathcal{F} in {|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|} and ζO(log(1/δ)+lognn)\zeta\triangleq O\left(\frac{\log(1/\delta)+\log n}{n}\right) is a low-order term.

Note that this recovers Theorem 2.3 when αi=1\alpha_{i}=1 for all ii and p=2p=2. The proof of Theorem A.2 mirrors the plan laid out in Section 2. As before, the first step of the proof is showing that mFm_{F} has low complexity as measured by covering numbers.

Define mF{(x,y)mF(x,y):FF}m\circ\mathcal{F}\triangleq\{(x,y)\mapsto m_{F}(x,y):F\in\mathcal{F}\}. Then

As in Section 2, we prove Lemma A.3 by bounding the error between mFm_{F} and mF^m_{\widehat{F}} in terms of the {|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|}-norm of the difference between FF and F^\widehat{F}.

For any x,yD0×[l]x,y\in\mathcal{D}_{0}\times[l], and function sequences F={fi}i=1kF=\{f_{i}\}_{i=1}^{k}, F^={fi^}i=1k\widehat{F}=\{\widehat{f_{i}}\}_{i=1}^{k}, we have mF(x,y)mF^(x,y)FF^|m_{F}(x,y)-m_{\widehat{F}}(x,y)|\leq{|\kern-1.07639pt|\kern-1.07639pt|F-\widehat{F}|\kern-1.07639pt|\kern-1.07639pt|}.

Suppose that δ\delta^{\star} optimizes equation (A.1) used to define mF(x,y)m_{F}(x,y). Now we use the notation hifi1(x,δ)h^{\star}_{i}\triangleq f_{i\leftarrow 1}(x,\delta^{\star}). Define δ^\widehat{\delta} as follows:

We first argue via induction that f^i1(x,δ^)=hi\widehat{f}_{i\leftarrow 1}(x,\widehat{\delta})=h_{i}^{\star}. As the base case, we trivially have f^01(x,δ^)=x=h0\widehat{f}_{0\leftarrow 1}(x,\widehat{\delta})=x=h^{\star}_{0}.

Thus, we must have F^(x,δ^)=F(x,δ)\widehat{F}(x,\widehat{\delta})=F(x,\delta^{\star}), so it follows that γ(F^(x,δ^),y)0\gamma(\widehat{F}(x,\widehat{\delta}),y)\leq 0 as well. Furthermore, by triangle inequality

Now we note that as fi(hi1)fi^(hi1)hi1fif^iop\frac{\|f_{i}(h^{\star}_{i-1})-\widehat{f_{i}}(h^{\star}_{i-1})\|}{\|h^{\star}_{i-1}\|}\leq\|f_{i}-\widehat{f}_{i}\|_{\textup{op}}, it follows that

Thus, using (A.2) and the definition of mF^m_{\widehat{F}}, we have

where we relied on the fact that δ=mF(x,y){|\kern-1.07639pt|\kern-1.07639pt|\delta^{\star}|\kern-1.07639pt|\kern-1.07639pt|}=m_{F}(x,y). Using the same reasoning, we also obtain the inequality mF(x,y)mF^(x,y)+FF^m_{F}(x,y)\leq m_{\widehat{F}}(x,y)+{|\kern-1.07639pt|\kern-1.07639pt|F-\widehat{F}|\kern-1.07639pt|\kern-1.07639pt|}. Combining gives mF(x,y)mF^(x,y)FF^|m_{F}(x,y)-m_{\widehat{F}}(x,y)|\leq{|\kern-1.07639pt|\kern-1.07639pt|F-\widehat{F}|\kern-1.07639pt|\kern-1.07639pt|}. ∎

As Claim A.4 holds for any choice of (x,y)D0×[l](x,y)\in\mathcal{D}_{0}\times[l], it follows that if F^\widehat{\mathcal{F}} covers F\mathcal{F} in norm {|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|}, then mF^m\circ\widehat{\mathcal{F}} will be a cover for mFm\circ\mathcal{F} in the functional \infty norm. ∎

Next, we show that when Fi\mathcal{F}_{i} satisfies Condition A.1 in operator norm, the class of compositions F\mathcal{F} satisfies Condition A.1 with respect to norm {|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|}.

Suppose that each Fi\mathcal{F}_{i} satisfies Condition A.1 with norm op\|\cdot\|_{\textup{op}} and complexity Cop(Fi)\mathcal{C}_{\|\cdot\|_{\textup{op}}}(\mathcal{F}_{i}). Define the complexity measure C(F)\mathcal{C}_{{|\kern-0.75346pt|\kern-0.75346pt|\cdot|\kern-0.75346pt|\kern-0.75346pt|}}(\mathcal{F}) by

which by definition implies that F\mathcal{F} satisfies Condition A.1 with norm {|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|} and complexity C(F)\mathcal{C}_{{|\kern-0.75346pt|\kern-0.75346pt|\cdot|\kern-0.75346pt|\kern-0.75346pt|}}(\mathcal{F}).

Let F^i\widehat{\mathcal{F}}_{i} be an ϵi\epsilon_{i}-cover of Fi\mathcal{F}_{i} in the operator norm op\|\cdot\|_{\textup{op}}. We will first show that F^{F^kF^1:F^iF^i}\widehat{\mathcal{F}}\triangleq\{\widehat{F}_{k}\circ\cdots\circ\widehat{F}_{1}:\widehat{F}_{i}\in\widehat{\mathcal{F}}_{i}\} is a {αiϵi}i=1kp\|\{\alpha_{i}\epsilon_{i}\}_{i=1}^{k}\|_{p}-cover of F\mathcal{F} in {|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|}. To see this, for any F=(fk,,f1)FF=(f_{k},\ldots,f_{1})\in\mathcal{F}, let f^iF^i\widehat{f}_{i}\in\widehat{\mathcal{F}}_{i} be the cover element for fif_{i}, and define F^(f^k,,f^1)\widehat{F}\triangleq(\widehat{f}_{k},\ldots,\widehat{f}_{1}). Then we have

as desired. Furthermore, we note that logF^iCop2(Fi)ϵi2\log|\widehat{\mathcal{F}}|\leq\sum_{i}\left\lfloor\frac{\mathcal{C}^{2}_{\|\cdot\|_{\textup{op}}}(\mathcal{F}_{i})}{\epsilon_{i}^{2}}\right\rfloor by Condition A.1. Now we will choose

We first verify that this gives an ϵ\epsilon-cover of F\mathcal{F} in {|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|}:

Next, we check that the covering number is bounded by C2(F)ϵ2\left\lfloor\frac{\mathcal{C}^{2}_{{|\kern-0.5382pt|\kern-0.5382pt|\cdot|\kern-0.5382pt|\kern-0.5382pt|}}(\mathcal{F})}{\epsilon^{2}}\right\rfloor:

Finally, the following lemma bounds the 0-1 test error of classifier FFF\in\mathcal{F} in terms of the complexity of some general margin function gFg_{F} and is a generalization of Lemma 2.2 to obtain a bound in terms of arbitrary qq-th moments of the inverse margin. Theorem A.2 will immediately follow by applying this lemma to the all-layer margin.

For a function class F\mathcal{F}, let G\mathcal{G} be a class of margin functions indexed by FFF\in\mathcal{F}. Suppose that G\mathcal{G} satisfies Condition A.1 with respect to \infty-norm, i.e. logN(ϵ,G)C2(G)ϵ2\log\mathcal{N}_{\infty}(\epsilon,\mathcal{G})\leq\lfloor\frac{\mathcal{C}_{\infty}^{2}(\mathcal{G})}{\epsilon^{2}}\rfloor. Fix any integer q>0q>0. Then with probability 1δ1-\delta over the draw of the training data, all classifiers FFF\in\mathcal{F} which achieve training error 0 satisfy

where ζO(log(1/δ)+lognn)\zeta\triangleq O\left(\frac{\log(1/\delta)+\log n}{n}\right) is a low-order term.

We prove Lemma A.6 in Section A.1. We now complete the proof of Theorem A.2 (and as a result, Theorem 2.3).

Note that by selecting GmF\mathcal{G}\triangleq m\circ\mathcal{F}, the conditions of Lemma A.6 are satisfied with margin family G\mathcal{G} with C(G)=C(F)\mathcal{C}_{\infty}(\mathcal{G})=\mathcal{C}_{{|\kern-0.75346pt|\kern-0.75346pt|\cdot|\kern-0.75346pt|\kern-0.75346pt|}}(\mathcal{F}) defined in Lemma A.5. Plugging this value of C(G)\mathcal{C}_{\infty}(\mathcal{G}) into the bound in Lemma A.6 immediately gives the desired statement. ∎

The following claim is a straightforward adaption of the theorem in [Srebro et al., 2010] to covering numbers. For minor technical reasons we reprove the result below.

By applying Claim A.7 to mFm\circ\mathcal{F} along with Lemma A.3, we can also obtain the following corollary, which will be useful in later sections.

Choosing β\beta to minimize the above expression would give the desired bound – however, such a post-hoc analysis cannot be performed because the optimal β\beta depends on the training data, and the loss class has to be fixed before the training data is drawn.

Instead, we utilize the standard technique of union bounding over a grid of β^\widehat{\beta} in log-scale. Let ξC2(G)poly(n1)\xi\triangleq\mathcal{C}_{\infty}^{2}(\mathcal{G})\textup{poly}(n^{-1}) denote the minimum choice of β^\widehat{\beta} in this grid, and select in this grid all choices of β^\widehat{\beta} in the form ξ2j\xi 2^{j} for j0j\geq 0. For a given choice of β^\widehat{\beta}, we assign it failure probability δ^=δ2β^/ξ\widehat{\delta}=\frac{\delta}{2\widehat{\beta}/\xi}, such that by design δ^=δ\sum\widehat{\delta}=\delta. Thus, applying Claim A.7 for each choice of β^\widehat{\beta} with corresponding failure probability δ^\widehat{\delta}, we note with probability 1δ1-\delta,

holds for all β^\widehat{\beta} and FFF\in\mathcal{F}.

Thus, with probability 1δ1-\delta, for all FFF\in\mathcal{F}, we have

It remains to observe that setting βF=Θ(q(n1gFLq(Pn)qC(G)2log2n)2/(q+2))\beta^{\star}_{F}=\Theta\left(q\left(\frac{n\left\|\frac{1}{g_{F}}\right\|_{L_{q}(P_{n})}^{q}}{\mathcal{C}_{\infty}(\mathcal{G})^{2}\log^{2}n}\right)^{2/(q+2)}\right) gives us the theorem statement. ∎

It suffices the complete the proof of Claim A.7. For this, we closely follow the proof of Theorem 1 in [Srebro et al., 2010], with the only difference that we replace their Rademacher complexity term with our complexity function C(G)\mathcal{C}_{\infty}(\mathcal{G}). For completeness, we outline the steps here.

Now using the same steps as [Srebro et al., 2010] (which relies on applying Theorem 6.1 of [Bousquet, 2002]), we obtain for all gGg\in\mathcal{G}, with probability 1δ1-\delta

where rnr_{n}^{\star} is the largest solution of ψ(μ)=μ\psi(\mu)=\mu. We now plug in rnβlog2nC2(G)nr_{n}^{\star}\lesssim\frac{\beta\log^{2}n\mathcal{C}_{\infty}^{2}(\mathcal{G})}{n} and use the fact that c1c2(c1+c2)/2\sqrt{c_{1}c_{2}}\leq(c_{1}+c_{2})/2 for any c1,c2>0c_{1},c_{2}>0 to simplify the square root term in A.7.

In the setting above, for all μ>0\mu>0, we have

where {σi}i=1n\{\sigma_{i}\}_{i=1}^{n} are i.i.d. Rademacher random variables.

First, by Dudley’s Theorem [Dudley, 1967], we have

We obtained the last line via change of variables to ϵ=ϵ/48βμ\epsilon^{\prime}=\epsilon/\sqrt{48\beta\mu}. Now we substitute α=C(G)48βμn\alpha=\frac{\mathcal{C}_{\infty}(\mathcal{G})\sqrt{48\beta\mu}}{\sqrt{n}} and note that the integrand is for ϵ>C(G)\epsilon^{\prime}>\mathcal{C}_{\infty}(\mathcal{G}) to get

The following proposition bounds the covering number of H(μ)\mathcal{H}(\mu) in terms of C(G)\mathcal{C}_{\infty}(\mathcal{G}).

In the setting of Claim A.7, we have the covering number bound

Appendix B Proofs for Neural Net Generalization

This section will derive the generalization bounds for neural nets in Theorem 3.1 by invoking the more general results in Section C. Theorem 3.1 applies to all neural nets, but to obtain it, we first need to bound generalization for neural nets with fixed norm bounds on their weights (this is a standard step in deriving generalization bounds). The lemma below states the analogue of Theorem 3.1, for all neural nets satisfying fixed norm bounds on their weights.

In the neural network setting, suppose that the activation ϕ\phi has a κϕ\kappa^{\prime}_{\phi}-Lipschitz derivative. For parameters {a(i)}i=1r\{a_{(i)}\}_{i=1}^{r} meant to be norm constraints for the weights, define the class of neural nets with bounded weight norms with respect to reference matrices {A(i),B(i)}\{A_{(i)},B_{(i)}\} as follows:

Then with probability 1δ1-\delta, for any q>0q>0 and for all FFF\in\mathcal{F}, we have

where Sn\mathcal{S}_{n} denotes the subset of examples classified correctly by FF and κ(i)NN\kappa^{\textup{NN}}_{(i)} is defined as in (3.2).

We will identify the class of neural nets with matrix norm bounds {a(i)}i=1r\{a_{(i)}\}_{i=1}^{r} with a sequence of function families

and let FF2r1F1\mathcal{F}\triangleq\mathcal{F}_{2r-1}\circ\cdots\circ\mathcal{F}_{1} denote all possible parameterizations of neural nets with norm bounds {a(i)}i=1r\{a_{(i)}\}_{i=1}^{r}. Let op\|\cdot\|_{\textup{op}} be defined with respect to Euclidean norm 2\|\cdot\|_{2} on the input and output spaces, which coincides with matrix operator norm for linear operators. We first claim that

This is because we can construct two covers: one for {hWh:dWF/logda(i)}\{h\mapsto Wh:\sqrt{d}\|W\|_{F}/\sqrt{\log d}\leq a_{(i)}\}, and one for {hWh:W1,1/logda(i)}\{h\mapsto Wh:\|W\|_{1,1}/\sqrt{\log d}\leq a_{(i)}\}, each of which has log size bounded by O(a(i)2/ϵ2)O(\lfloor{a_{(i)}}^{2}/\epsilon^{2}\rfloor) by Lemma B.3 and Claim B.5. Now we offset the first cover by the linear operator A(i)A_{(i)} and the second by B(i)B_{(i)} and take the union of the two, obtaining an ϵ\epsilon-cover for F2i1\mathcal{F}_{2i-1} in operator norm. Furthermore, logNop(ϵ,F2i)=0\log\mathcal{N}_{\|\cdot\|_{\textup{op}}}(\epsilon,\mathcal{F}_{2i})=0 simply because F2i\mathcal{F}_{2i} is the singleton function.

Thus, F2i1,F2i\mathcal{F}_{2i-1},\mathcal{F}_{2i} satisfy Condition A.1 with norm op\|\cdot\|_{\textup{op}} and complexity functions Cop(F2i1)a(i)\mathcal{C}_{\|\cdot\|_{\textup{op}}}(\mathcal{F}_{2i-1})\lesssim a_{(i)} and Cop(F2i)=0\mathcal{C}_{\|\cdot\|_{\textup{op}}}(\mathcal{F}_{2i})=0, so we can apply Theorem C.1. It remains to argue that κ2i1(x,y)\kappa^{\star}_{2i-1}(x,y) as defined for Theorem C.1 using standard Euclidean norm 2\|\cdot\|_{2} is equivalent to κ(i)NN(x,y)\kappa^{\textup{NN}}_{(i)}(x,y) defined in (3.2). To see this, we note that functions in F2j1\mathcal{F}_{2j-1} have 0-Lipschitz derivative, leading those terms with a coefficient of κ2j1\kappa^{\prime}_{2j-1} to cancel in the definition of κi(x,y)\kappa_{i}^{\star}(x,y). There is a 1-1 correspondence between the remaining terms of κ2i1(x,y)\kappa^{\star}_{2i-1}(x,y) and κ(i)NN(x,y)\kappa^{\textup{NN}}_{(i)}(x,y), so we can substitute κ(i)NN(x,y)\kappa^{\textup{NN}}_{(i)}(x,y) into Theorem C.1 in place of κ2i1(x,y)\kappa^{\star}_{2i-1}(x,y). Furthermore, as we have Cop(F2i)=0\mathcal{C}_{\|\cdot\|_{\textup{op}}}(\mathcal{F}_{2i})=0, the corresponding terms disappear in the bound of Theorem C.1, finally giving the desired result. ∎

Now we obtain Theorem 3.1 by union bounding Lemma B.1 over choices of {a(i)}i=1r\{a_{(i)}\}_{i=1}^{r}.

We will use the standard technique of applying Lemma B.1 over many choices of {a(i)}\{a_{(i)}\}, and union bounding over the failure probability. Choose ξ=poly(n1)\xi=\textup{poly}(n^{-1}) and consider a grid of {α^(i)}\{\widehat{\alpha}_{(i)}\} with a^(i)=ξ2ji\widehat{a}_{(i)}=\xi 2^{j_{i}} for ji1j_{i}\geq 1. We apply Lemma B.1 with for all possible norm bounds {α^(i)}\{\widehat{\alpha}_{(i)}\} in the grid, using failure probability δ^=δ/(ia^(i)/ξ)\widehat{\delta}=\delta/(\prod_{i}\widehat{a}_{(i)}/\xi) for a given choice of {α^(i)}\{\widehat{\alpha}_{(i)}\}. By union bound, with probability 1δ^=1δ1-\sum\widehat{\delta}=1-\delta, the bound of Lemma B.1 holds simultaneously for all choices of {α^(i)}\{\widehat{\alpha}_{(i)}\}. In particular, for the neural net FF with parameters {W(i)}\{W_{(i)}\}, there is a choice of {α^(i)}\{\widehat{\alpha}_{(i)}\} satisfying

for all ii. The application of Lemma B.1 for this choice of α^(i)\widehat{\alpha}_{(i)} gives us the desired generalization bound. ∎

In the case where ϕ\phi is the relu activation, we can no longer lower bound the all-layer margin mF(x,y)m_{F}(x,y) using the techniques in Section C, which rely on smoothness. However, we can still obtain a generalization bound in terms of the distribution of 1/mF(x,y)1/m_{F}(x,y) on the training data. We can expect 1/mF(x,y)1/m_{F}(x,y) to be small in practice because relu networks typically exhibit stability to perturbations. Prior bounds for relu nets suffer from some source of looseness: the bounds of [Bartlett et al., 2017, Neyshabur et al., 2017b] depended on the product of weight norms divided by margin, and the bounds of [Nagarajan and Kolter, 2019] depended on the inverse of the pre-activations, observed to be large in practice. Our bound avoids these dependencies, and in fact, it is possible to upper bound our dependency on 1/mF(x,y)1/m_{F}(x,y) in terms of both these quantities.

For this setting, we choose a fixed {|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|} defined as follows: if ii corresponds to a linear layer in the network, set αi=1\alpha_{i}=1, and for ii corresponding to activation layers, set αi=\alpha_{i}=\infty (in other words, we only allow perturbations after linear layers). We remark that we could use alternative definitions of {|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|}, but because we do not have a closed-form lower bound on mFm_{F}, the tradeoff between these formulations is unclear.

In the neural network setting, suppose that ϕ\phi is any activation (such as the relu function) and mFm_{F} is defined using {|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|} as described above. Fix any integer q>0q>0. Then with probability 1δ1-\delta, for all relu networks FF parameterized by weight matrices {W(i)}i=1r\{W_{(i)}\}_{i=1}^{r} that achieve training error 0, we have

where a(i)a_{(i)} is defined as in Theorem 3.1, and ζO(log(1/δ)+rlogn+i(a(i)+1)n)\zeta\triangleq O\left(\frac{\log(1/\delta)+r\log n+\sum_{i}(a_{(i)}+1)}{n}\right) is a low-order term.

The proof follows via direct application of Theorem A.2 and the same arguments as Lemma B.1 relating matrix norms to covering numbers. We remark that in the case of relu networks, we can upper bound 1mF(x,y)\frac{1}{m_{F}(x,y)} via a quantity depending on the inverse pre-activations that mirrors the bound of Nagarajan and Kolter . However, as mentioned earlier, this is a pessimistic upper bound as Nagarajan and Kolter show that the inverse preactivations can be quite large in practice.

B.2 Matrix Covering Lemmas

In this section we present our spectral norm cover for the weight matrices, which is used in Section B to prove our neural net generalization bounds.

Let Mfro(B)\mathcal{M}_{\textup{fro}}(B) denote the set of d1×d2d_{1}\times d_{2} matrices with Frobenius norm bounded by BB, i.e.

Then letting dmax{d1,d2}d\triangleq\max\{d_{1},d_{2}\} denote the larger dimension, for all ϵ>0\epsilon>0, we have

The idea for this proof is that since the cover is in spectral norm, we only need to cover the top dB2/ϵ2d^{\prime}\triangleq\lfloor B^{2}/\epsilon^{2}\rfloor singular vectors of matrices MMM\in\mathcal{M}.

First, it suffices to work with square matrices, as a spectral norm cover of max{d1,d2}×max{d1,d2}\max\{d_{1},d_{2}\}\times\max\{d_{1},d_{2}\} matrices will also yield a cover of d1×d2d_{1}\times d_{2} matrices in spectral norm (as we can extend a d1×d2d_{1}\times d_{2} matrices to a larger square matrix by adding rows or columns with all 0). Thus, letting dmax{d1,d2}d\triangleq\max\{d_{1},d_{2}\}, we will cover Mfro(B)\mathcal{M}_{\textup{fro}}(B) defined with respect to d×dd\times d matrices.

Let d9B2/ϵ2d^{\prime}\triangleq\lfloor 9B^{2}/\epsilon^{2}\rfloor. We first work in the case when ddd^{\prime}\leq d. Let U^\widehat{\mathcal{U}} be a ϵU\epsilon_{U} Frobenius norm cover of d×dd\times d^{\prime} matrices with Frobenius norm bound dd^{\prime}. Let V^\widehat{\mathcal{V}} be the cover of d×dd^{\prime}\times d matrices with Frobenius norm bound BB in Frobenius norm with resolution ϵV\epsilon_{V}. We construct a cover M^\widehat{\mathcal{M}} for Mfro(B)\mathcal{M}_{\textup{fro}}(B) as follows: take all possible combinations of matrices U^,V^\widehat{U},\widehat{V} from U^,V^\widehat{\mathcal{U}},\widehat{\mathcal{V}}, and add U^V^\widehat{U}\widehat{V} to M^\widehat{\mathcal{M}}. First note that by Claim B.4, we have

Thus, setting ϵU=ϵ/3B\epsilon_{U}=\epsilon/3B, ϵV=ϵ/3d\epsilon_{V}=\epsilon/3d^{\prime}, then we get a ϵ\epsilon-cover of M\mathcal{M} with log cover size 9dB2/ϵ2(log81d2B2/ϵ2)\lfloor 9dB^{2}/\epsilon^{2}\rfloor(\log 81{d^{\prime}}^{2}B^{2}/\epsilon^{2}). As ddd^{\prime}\leq d, this simplifies to 36dB2log(9d)/ϵ2\lfloor 36dB^{2}\log(9d)/\epsilon^{2}\rfloor.

Now when ddd^{\prime}\geq d, we simply take a Frobenius norm cover of d×dd\times d matrices with Frobenius norm bound BB, which by Claim B.4 has log size at most d2log(3B/ϵ)36dB2log(9d)/ϵ2d^{2}\log(3B/\epsilon)\leq\lfloor 36dB^{2}\log(9d)/\epsilon^{2}\rfloor, where the inequality followed because 9B2/ϵ2d9B^{2}/\epsilon^{2}\geq d.

Combining both cases, we get for all ϵ>0\epsilon>0,

The following claims are straightforward and follow from standard covering number bounds for 2\|\cdot\|_{2} and 1\|\cdot\|_{1} balls.

Let Mfro(B)\mathcal{M}_{\textup{fro}}(B) denote the class of d1×d2d_{1}\times d_{2} matrices with Frobenius norm bounded by BB. Then for 0<ϵ<B0<\epsilon<B, logNfro(ϵ,Mfro(B))d1d2log(3B/ϵ)\log\mathcal{N}_{\|\cdot\|_{\textup{fro}}}(\epsilon,\mathcal{M}_{\textup{fro}}(B))\leq d_{1}d_{2}\log(3B/\epsilon).

Appendix C Generalization Bound for Smooth Function Compositions

In this section, we present the bound for general smooth function compositions used to prove Theorem 3.1.

We will work in the same general setting as Section A. Let Jji(x,δ)J_{j\leftarrow i}(x,\delta) denote the ii-to-jj Jacobian evaluated at fi11(x,δ)f_{i-1\leftarrow 1}(x,\delta), i.e. Jji(x,δ)Dhfji(h,δ)h=fi11(x,δ)J_{j\leftarrow i}(x,\delta)\triangleq D_{h}f_{j\leftarrow i}(h,\delta)|_{h=f_{i-1\leftarrow 1}(x,\delta)}. We will additionally define general notation for hidden layer and Jacobian norms which coincides with our notation for neural nets. Let si(x)fi1(x)s_{i}(x)\triangleq\|f_{i\leftarrow 1}(x)\| and s0(x)xs_{0}(x)\triangleq\|x\|. As the function DfjiDf_{j\leftarrow i} outputs operators mapping Di1\mathcal{D}_{i-1} to Dj\mathcal{D}_{j}, we can additionally define κji(x)Dfjifi11(x)op\kappa_{j\leftarrow i}(x)\triangleq\|Df_{j\leftarrow i}\circ f_{i-1\leftarrow 1}(x)\|_{\textup{op}}, with κjj+1(x)1\kappa_{j\leftarrow j+1}(x)\triangleq 1.

Let κi\kappa^{\prime}_{i} be an upper bound on the Lipschitz constant of DfiiDf_{i\leftarrow i} measured in operator norm:

Now we define the value κi(x,y)\kappa^{\star}_{i}(x,y), which can be thought of as a Lipschitz constant for perturbation δi\delta_{i} in the definition of mFm_{F}, as follows:

For this general setting, the following theorem implies that for any integer q>0q>0, if FF classifies all training examples correctly, then its error converges at a rate that scales with nq/(q+2)n^{-q/(q+2)} and the products κiLq(Pn)Cop(Fi)\|\kappa^{\star}_{i}\|_{L_{q}(P_{n})}\mathcal{C}_{\|\cdot\|_{\textup{op}}}(\mathcal{F}_{i}).

Let F={fkf1:fiFi}\mathcal{F}=\{f_{k}\circ\cdots\circ f_{1}:f_{i}\in\mathcal{F}_{i}\} denote a class of compositions of functions from kk families {Fi}i=1k\{\mathcal{F}_{i}\}_{i=1}^{k}, each of which satisfies Condition A.1 with operator norm op\|\cdot\|_{\textup{op}} and complexity Cop(Fi)\mathcal{C}_{\|\cdot\|_{\textup{op}}}(\mathcal{F}_{i}). For any choice of integer q>0q>0, with probability 1δ1-\delta for all FFF\in\mathcal{F} the following bound holds:

where Sn\mathcal{S}_{n} denotes the subset of training examples correctly classified by FF and κi\kappa^{\star}_{i} is defined in (C.1). In particular, if FF classifies all training samples correctly, i.e. Sn=n|\mathcal{S}_{n}|=n, with probability 1δ1-\delta we have

where ζO(klogn+log(1/δ)n)\zeta\triangleq O\left(\frac{k\log n+\log(1/\delta)}{n}\right) is a low order term.

In the setting of Theorem C.1, where each layer Fi\mathcal{F}_{i} is a class of smooth functions, if γ(F(x),y)>0\gamma(F(x),y)>0, we have

We prove Lemma C.2 in Section D by formalizing the intuition outlined in Section 3. With Lemma C.2 in hand, we can prove Theorem C.1. This proof will follow the same outline as the proof of Theorem A.2. The primary difference is that we optimize over kk values of αi\alpha_{i}, whereas Theorem A.2 only optimized over the smoothness β\beta.

Plugging this into (C.2), we get with probability 1δ1-\delta for all FFF\in\mathcal{F},

Thus, it suffices to upper bound EE. By Lemma C.2, we have mF(x,y){κi(x,y)/αi}i=1kp/(p1)1m_{F}(x,y)\geq\|\{\kappa^{\star}_{i}(x,y)/\alpha_{i}\}_{i=1}^{k}\|_{p/(p-1)}^{-1} for the choice of α,p\alpha,p used to define mFm_{F}. We will set p=q/(q1)p=q/(q-1) and union bound (C.3) over choices of α\alpha.

First, for a particular choice of α\alpha and p=q/(q1)p=q/(q-1), we apply our lower bound on mF(x,y)m_{F}(x,y) to simplify (C.4) as follows:

We use ξi\xi_{i} to denote the lower limit on αi\alpha_{i} in our search over α\alpha, setting ξi=Cop(Fi)1poly(k1n1)\xi_{i}=\mathcal{C}_{\|\cdot\|_{\textup{op}}}(\mathcal{F}_{i})^{-1}\textup{poly}(k^{-1}n^{-1}).If Cop(Fi)=0\mathcal{C}_{\|\cdot\|_{\textup{op}}}(\mathcal{F}_{i})=0, then we simply set αi=\alpha_{i}=\infty, which is equivalent to restricting the perturbations used in computing mFm_{F} to layers where Cop(Fi)>0\mathcal{C}_{\|\cdot\|_{\textup{op}}}(\mathcal{F}_{i})>0. Now we consider a grid of {α^i}i=1k\{\widehat{\alpha}_{i}\}_{i=1}^{k}, where α^\widehat{\alpha} has entries of the form α^i=ξi2j\widehat{\alpha}_{i}=\xi_{i}2^{j} for any j0j\geq 0. For a given choice of α^\widehat{\alpha}, we assign it failure probability

where δ\delta is the target failure probability after union bounding. First, note that

Therefore, with probability 1δ1-\delta, we get that (C.2) holds for mFm_{F} defined with respect to every α^\widehat{\alpha}. In particular, with probability 1δ1-\delta, for all FFF\in\mathcal{F} and α^\widehat{\alpha} in the grid,

where the last term was obtained by subsituting (C.6) for the failure probability.

Now we claim that there is some choice of α^\widehat{\alpha} in the grid such that either

To see this, we first consider α^\widehat{\alpha} in our grid such that α^i[αF,i,2αF,i+ξi]\widehat{\alpha}_{i}\in[\alpha^{\star}_{F,i},2\alpha^{\star}_{F,i}+\xi_{i}]. By construction of our grid of α^\widehat{\alpha}, such a choice always exists. Then we have

Thus, it follows that for all FFF\in\mathcal{F},

Finally, we can apply Lemma C.3 using zi=((c2q)q/2n(x,y)Snκi(x,y)q)1/q=(Snn)1/qκiLq(Sn)z_{i}=\left(\frac{(c_{2}q)^{q/2}}{n}\sum_{(x,y)\in\mathcal{S}_{n}}\kappa^{\star}_{i}(x,y)^{q}\right)^{1/q}=\left(\frac{|\mathcal{S}_{n}|}{n}\right)^{1/q}\|\kappa^{\star}_{i}\|_{L_{q}(\mathcal{S}_{n})} and bi=Cop(Fi)lognnb_{i}=\frac{\mathcal{C}_{\|\cdot\|_{\textup{op}}}(\mathcal{F}_{i})\log n}{\sqrt{n}} to get

Substituting into (C.9) gives the desired bound.

For coefficients {zi}i=1k,{bi}i=1k>0\{z_{i}\}_{i=1}^{k},\{b_{i}\}_{i=1}^{k}>0 and integer q>0q>0, define

with minimizer α\alpha^{\star} and minimum value EE^{\star}. Then

Choose {αi}i=1k\{\alpha_{i}\}_{i=1}^{k} as follows (we obtained this by solving for α\alpha for which αE(α)=0\nabla_{\alpha}E(\alpha)=0):

For this particular choice of α\alpha, we can compute

Finally, we note that (q2)2/(q+2)+(q2)qq+22\left(\frac{q}{2}\right)^{2/(q+2)}+\left(\frac{q}{2}\right)^{-\frac{q}{q+2}}\leq 2, so we obtain

In this section, we prove Lemma C.2, which states that when the function FF is a composition of functions with Lipschitz derivative, we will be able to lower bound mF(x,y)m_{F}(x,y) in terms of the intermediate Jacobians and layer norms evaluated at xx. To prove Lemma C.2, we rely on tools developed by [Wei and Ma, 2019] which control the change in the output of a composition of functions if all the intermediate Jacobians are bounded.

We also define the ramp loss TρT_{\rho} as follows:

for nonnegative parameters ρ,ti,τji\rho,t_{i},\tau_{j\leftarrow i} which we will later choose to be the margin, hidden layer norm, and Jacobian norms at the unperturbed input. Because the augmented indicator I(δ;x,y)\mathcal{I}(\delta;x,y) conditions on small Jacobian and hidden layer norms, it will turn out to be κi(x,y)\kappa^{\star}_{i}(x,y)-Lipschitz in the perturbation δi\delta_{i}. Furthermore, by construction, the value of the augmented indicator I(δ;x,y)\mathcal{I}(\delta;x,y) will equal 11 when δ=0\delta=0, and we will also have

This immediately gives a lower bound on the perturbation level required to create a negative margin. The lemma below formally bounds the Lipschitz constant of I(δ;x,y)\mathcal{I}(\delta;x,y) in δi\delta_{i}.

For nonnegative parameters ti,τji,ρt_{i},\tau_{j\leftarrow i},\rho, with τjj+1=1\tau_{j\leftarrow j+1}=1 for any jj and τjj=0\tau_{j\leftarrow j^{\prime}}=0 for jj+2j\leq j^{\prime}+2, define the function I(δ;x,y)\mathcal{I}(\delta;x,y) as in (D.9). Then in the setting of Lemma C.2, for a given i[k]i\in[k], for all choices of δi\delta_{i} and ν\nu, if δj=0\delta_{j}=0 for j>ij>i, we have

We prove Lemma D.1 in Section D.1. With Lemma D.1, we can formalize the proof of Lemma C.2.

Furthermore, by the definition of I(δ;x,y)\mathcal{I}(\delta;x,y), we have

Finally, by our choice of the parameters used to define I(δ;x,y)\mathcal{I}(\delta;x,y), we also have I(0;x,y)1\mathcal{I}(0;x,y)\geq 1. Combining everything with (D.11), we get

To see the core idea of the proof, consider differentiating I(δ;x,y)\mathcal{I}(\delta;x,y) with respect to δi\delta_{i} (ignoring for the moment that the soft indicators are technically not differentiable). Let the terms A1,,AqA_{1},\ldots,A_{q} represent the different indicators which the product I(δ;x,y)\mathcal{I}(\delta;x,y) is comprised of. Then by the product rule for differentiation, we would have

We use the following fact that the product of functions which are product-Lipschitz with respect to one another is in fact Lipschitz. We provide the proof in Section D.2.

Let A1,,Aq:DIA_{1},\ldots,A_{q}:\mathcal{D}_{I}\to be a set of Lipschitz functions such that AiA_{i} is τˉi\bar{\tau}_{i}-product-Lipschitz w.r.t jiAj\prod_{j\neq i}A_{j} for all ii. Then the product iAi\prod_{i}A_{i} is 2iτˉi2\sum_{i}\bar{\tau}_{i}-Lipschitz.

Now we proceed to formalize the intuition of product-rule differentiation presented above, by showing that the individual terms in I(δ;x,y)\mathcal{I}(\delta;x,y) are product-Lipschitz with respect to the other terms. For the following three lemmas, we require the technical assumption that for any fixed choice of x,δix,\delta_{-i}, the functions fj1(x,δ)f_{j\leftarrow 1}(x,\delta), Jjj(x,δ)J_{j^{\prime}\leftarrow j^{\prime\prime}}(x,\delta) are worst-case Lipschitz in δi\delta_{i} as measured in \|\cdot\|, op\|\cdot\|_{\textup{op}}, respectively, with Lipschitz constant CC^{\prime}. Our proof of Lemma D.1, however, can easily circumvent this assumption. The proofs of the following three lemmas are given in Section D.2.

Choose i,j1,j2i,j_{1},j_{2} with j1j2j_{1}\geq j_{2}, j1>ij_{1}>i. Set product-Lipschitz constant τˉ\bar{\tau} as follows:

Given the described steps, we will now complete the proof of Lemma D.1.

Now to remove the CC^{\prime} worst-case Lipschitzness assumption, we can follow the reasoning of Claim D.6 of [Wei and Ma, 2019] to note that such Lipschitz constants exist if we restrict δi\delta_{i} to some compact set, and thus conclude the lemma statement for δi\delta_{i} restricted to this compact set. Now we simply choose this compact set sufficiently large to include both δi\delta_{i} and δi+ν\delta_{i}+\nu. ∎

D.2 Proofs for Product-Lipschitz Lemmas

As each AiA_{i} is Lipschitz and there are a finite number of functions, there exists CC^{\prime} such that any possible product Ai1Ai2AijA_{i_{1}}A_{i_{2}}\cdots A_{i_{j}} is CC^{\prime}-Lipschitz. Furthermore, by the definition of product-Lipschitz, there are c,C>0c,C>0 such that for any νc\|\nu\|\leq c, xDIx\in\mathcal{D}_{I}, and 1iq1\leq i\leq q, we have

We used the fact that j=i+1qAj(x)1\prod_{j=i+1}^{q}A_{j}(x)\leq 1. Now we have Ai(x+ν)Ai(x)jiAj(x)τˉiν+Cν2|A_{i}(x+\nu)-A_{i}(x)|\prod_{j\neq i}A_{j}(x)\leq\bar{\tau}_{i}\|\nu\|+C\|\nu\|^{2}, and Ai(x+ν)Ai(x)CνC2ν2|A_{i}(x+\nu)-A_{i}(x)|C^{\prime}\|\nu\|\leq{C^{\prime}}^{2}\|\nu\|^{2} as AiA_{i} is CC^{\prime}-Lipschitz, so

Plugging this back into (D.12) and applying triangle inequality, we get

Define the constant Cmin{c,iτˉiq(C+C2)}C^{\prime\prime}\triangleq\min\{c,\frac{\sum_{i}\bar{\tau}_{i}}{q(C+C^{\prime 2})}\}. For any xx and all ν\nu satisfying νC\|\nu\|\leq C^{\prime\prime}, we have

Now for any x,yDIx,y\in\mathcal{D}_{I}, we wish to show iAi(y)iAi(x)2xyiτˉi|\prod_{i}A_{i}(y)-\prod_{i}A_{i}(x)|\leq 2\|x-y\|\sum_{i}\bar{\tau}_{i}. To this end, we divide xyx-y into segments of length at most CC^{\prime\prime} and apply (D.13) on each segment.

Define x(j)=x+jCxy(yx)x^{(j)}=x+j\frac{C^{\prime\prime}}{\|x-y\|}(y-x) for j=1,,xy/Cj=1,\ldots,\lfloor\|x-y\|/C^{\prime\prime}\rfloor. Then as x(j)x(j1)C\|x^{(j)}-x^{(j-1)}\|\leq C^{\prime\prime}, we have iAi(x(j))iAi(x(j1))x(j)x(j1)iτˉi|\prod_{i}A_{i}(x^{(j)})-\prod_{i}A_{i}(x^{(j-1)})|\leq\|x^{(j)}-x^{(j-1)}\|\sum_{i}\bar{\tau}_{i}. Furthermore, we note that the sum of all the segment lengths equals yx\|y-x\|. Thus, we can sum this inequality over pairs (x,x(1)),,(x(xy/C),y)(x,x^{(1)}),\ldots,(x^{(\lfloor\|x-y\|/C^{\prime\prime}\rfloor)},y) and apply triangle inequality to get

We first note that Dδifj1(x,δ)D_{\delta_{i}}f_{j\leftarrow 1}(x,\delta), the partial derivative of fj1f_{j\leftarrow 1} with respect to δi\delta_{i}, is given by Jji+1(x,δ)fi11(x,δ)J_{j\leftarrow i+1}(x,\delta)\|f_{i-1\leftarrow 1}(x,\delta)\| by Claim D.7. As Jji+1(x,δ)J_{j\leftarrow i+1}(x,\delta), fi11(x,δ)\|f_{i-1\leftarrow 1}(x,\delta)\| are both worst-case Lipschitz in δi\delta_{i} with some Lipschitz constant CC^{\prime}, we can apply Claim H.4 of [Wei and Ma, 2019] to obtain:

Now by definition of A(x,δ)A(x,\delta), we get that the right hand side equals if Jji+1(x,δ)op2τji+1\|J_{j\leftarrow i+1}(x,\delta)\|_{\textup{op}}\geq 2\tau_{j\leftarrow i+1} or fi11(x,δ)2ti1\|f_{i-1\leftarrow 1}(x,\delta)\|\geq 2t_{i-1}. Thus, the right hand side must be bounded by

which gives product-Lipschitzness with constant 4τji+1ti1tj\frac{4\tau_{j\leftarrow i+1}t_{i-1}}{t_{j}}. ∎

This proof follows in an identical manner to that of Lemma D.3. The only additional step is using the fact that γ(h,y)\gamma(h,y) is 1-Lipschitz in hh, so the composition Tρ(γ(h,y))T_{\rho}(\gamma(h,y)) is ρ1\rho^{-1}-Lipschitz in hh. ∎

Let CC^{\prime} be an upper bound on the Lipschitz constant in δi\delta_{i} of all the functions Jjj(x,δ)J_{j^{\prime}\leftarrow j^{\prime\prime}}(x,\delta). As we assumed that each fjf_{j} has κj\kappa^{\prime}_{j}-Lipschitz Jacobian, such an upper bound exists. We first argue that

for some Lipschitz constant CC^{\prime\prime}. The proof of this statement is nearly identical to the proof of Claim D.3 in [Wei and Ma, 2019], so we only sketch it here and point out the differences. We rely on the expansion

which follows from the chain rule. Now we note that we can compute the change in a single term Jjj(x,δ)J_{j^{\prime}\leftarrow j^{\prime}}(x,\delta) from perturbing δi\delta_{i} as follows:

Note that when j>ij^{\prime}>i, by assumption δj=0\delta_{j^{\prime}}=0, so Dhfjj(h,δ)=Dhfj(h)D_{h}f_{j^{\prime}\leftarrow j^{\prime}}(h,\delta)=D_{h}f_{j^{\prime}}(h). Thus, as fjf_{j^{\prime}} has κj\kappa^{\prime}_{j^{\prime}}-Lipschitz derivative, we get

We obtained the last line via Claim D.7. We note that the cases when j>ij^{\prime}>i contribute to the terms under the summation in (D.14).

When j=ij^{\prime}=i, we have Dhfii(h,δ)=Dhfi(h)+Dh[δih]D_{h}f_{i\leftarrow i}(h,\delta)=D_{h}f_{i}(h)+D_{h}[\delta_{i}\|h\|] for any hh, so Dhfii(h,δi,δi)Dhfii(h,δi+ν,δi)op=Dh[νh]opν\|D_{h}f_{i\leftarrow i}(h,\delta_{i},\delta_{-i})-D_{h}f_{i\leftarrow i}(h,\delta_{i}+\nu,\delta_{-i})\|_{\textup{op}}=\|D_{h}[\nu\|h\|]\|_{\textup{op}}\leq\|\nu\|. As this holds for any hh, it follows that Jii(x,δi,δi+ν)Jii(x,δi,δi)opν\|J_{i\leftarrow i}(x,\delta_{-i},\delta_{i}+\nu)-J_{i\leftarrow i}(x,\delta_{-i},\delta_{i})\|_{\textup{op}}\leq\|\nu\|. This term results in the last quantity in (D.14). Finally, when j<ij^{\prime}<i, we have Jjj(x,δi,δi+ν)=Jjj(x,δi,δi)J_{j^{\prime}\leftarrow j^{\prime}}(x,\delta_{-i},\delta_{i}+\nu)=J_{j^{\prime}\leftarrow j^{\prime}}(x,\delta_{-i},\delta_{i}) as JjjJ_{j^{\prime}\leftarrow j^{\prime}} does not depend on δi\delta_{i}.

To see how (D.14) follows, we would apply the above bounds in a telescoping sum over indices jj^{\prime} ranging from max{j2,i}\max\{j_{2},i\} to j1j_{1}. For a more detailed derivation, refer to the steps in Claim D.3 of [Wei and Ma, 2019].

Note that if any of the bounds set by the indicators in A(x,δ)A(x,\delta) are violated, then A(x,δ)=0A(x,\delta)=0, and thus

In the other case, we have fi11(x,δ)2ti1\|f_{i-1\leftarrow 1}(x,\delta)\|\leq 2t_{i-1}, and Jjj(x,δ)op2τjj\|J_{j^{\prime}\leftarrow j^{\prime\prime}}(x,\delta)\|_{\textup{op}}\leq 2\tau_{j^{\prime}\leftarrow j^{\prime\prime}}, in which case (D.14) can be bounded by

Let A1,A2,A3:DIA_{1},A_{2},A_{3}:\mathcal{D}_{I}\to be functions where A1A_{1} is τˉ\bar{\tau}-product-Lipschitz w.r.t. A2A_{2}. Then A1A_{1} is also τˉ\bar{\tau}-product Lipschitz w.r.t. A2A3A_{2}A_{3}.

This statement follows from the definition of product-Lipschitzness and the fact that

The partial derivative of fj1f_{j\leftarrow 1} with respect to variable δi\delta_{i} evaluated at x,δx,\delta can be computed as

By definition, fj1(x,δ)=fji+1(fi1(x,δ),δ)f_{j\leftarrow 1}(x,\delta)=f_{j\leftarrow i+1}(f_{i\leftarrow 1}(x,\delta),\delta). Thus, we note that fj1(x,δ)f_{j\leftarrow 1}(x,\delta) only depends on δi\delta_{i} through fi1(x,δ)f_{i\leftarrow 1}(x,\delta), so by chain rule we have

In the second line, we invoked the definition of Jji+1(x,δ)J_{j\leftarrow i+1}(x,\delta). ∎

This is a result of the chain rule, but for completeness we state the proof here. We have

Thus, plugging in h=fi11(x,δ)h=f_{i-1\leftarrow 1}(x,\delta), we get

Now we can apply identical steps to expand Jj1i(x,δ)J_{j-1\leftarrow i}(x,\delta), giving the desired result. ∎

Appendix E Proofs for Adversarially Robust Classification

In this section, we derive the generalization bounds for adversarial classification presented in Section 4. Recall the adversarial all-layer margin mFadv(x,y)minxBadv(x)mF(x,y)m^{\textup{adv}}_{F}(x,y)\triangleq\min_{x^{\prime}\in\mathcal{B}^{\textup{adv}}(x)}m_{F}(x^{\prime},y) defined in Section 4. In this section, we will use the general definition of mFm_{F} in (A.1).

We will sketch the proof of Theorem E.1 using the same steps as those laid out in Sections 2 and A. We will rely on the following general analogue of Theorem C.1 with the exact same proof, but with κi\kappa^{\star}_{i} (defined in Section C) replaced by κiadv(x,y)maxxBadv(x)κi(x,y)\kappa^{\textup{adv}}_{i}(x,y)\triangleq\max_{x^{\prime}\in\mathcal{B}^{\textup{adv}}(x)}\kappa^{\star}_{i}(x^{\prime},y) everywhere:

Let F={fkf1:fiFi}\mathcal{F}=\{f_{k}\circ\cdots\circ f_{1}:f_{i}\in\mathcal{F}_{i}\} denote a class of compositions of functions from kk families {Fi}i=1k\{\mathcal{F}_{i}\}_{i=1}^{k}, each of which satisfies Condition A.1 with operator norm op\|\cdot\|_{\textup{op}} and complexity Cop(Fi)\mathcal{C}_{\|\cdot\|_{\textup{op}}}(\mathcal{F}_{i}). For any choice of integer q>0q>0, with probability 1δ1-\delta for all FFF\in\mathcal{F} the following bound holds:

where Snadv\mathcal{S}_{n}^{\textup{adv}} denotes the subset of training examples correctly classified by FF with respect to adversarial perturbations. In particular, if FF classifies all training samples with adversarial error 0, i.e. Snadv=n|\mathcal{S}_{n}^{\textup{adv}}|=n, with probability 1δ1-\delta we have

where ζO(klogn+log(1/δ)n)\zeta\triangleq O\left(\frac{k\log n+\log(1/\delta)}{n}\right) is a low order term.

Given Theorem E.1, Theorem 4.1 follows with the same proof as the proof of Theorem 3.1 given in Section B. To prove Theorem E.1, we first have the following analogue of Lemma A.3 bounding the covering number of madvFm^{\textup{adv}}\circ\mathcal{F}.

Define madvF{(x,y)mFadv(x,y):FF}m^{\textup{adv}}\circ\mathcal{F}\triangleq\{(x,y)\mapsto m^{\textup{adv}}_{F}(x,y):F\in\mathcal{F}\}. Then

This lemma allows us to invoke Claim A.7 on a smooth loss composed with madvFm^{\textup{adv}}\circ\mathcal{F}, as we did for the clean classification setting. The lemma is proven the exact same way as Lemma A.3, given the Lipschitz-ness of mFadvm^{\textup{adv}}_{F} below:

For any x,yD0×[l]x,y\in\mathcal{D}_{0}\times[l], and function sequences F={fi}i=1k,F^={f^i}i=1kF=\{f_{i}\}_{i=1}^{k},\widehat{F}=\{\widehat{f}_{i}\}_{i=1}^{k}, we have mFadv(x,y)mF^adv(x,y)FF^|m^{\textup{adv}}_{F}(x,y)-m^{\textup{adv}}_{\widehat{F}}(x,y)|\leq{|\kern-1.07639pt|\kern-1.07639pt|F-\widehat{F}|\kern-1.07639pt|\kern-1.07639pt|}.

Let xBadv(x)x^{\star}\in\mathcal{B}^{\textup{adv}}(x) be such that mFadv(x,y)=mF(x,y)m^{\textup{adv}}_{F}(x,y)=m_{F}(x^{\star},y). By Claim A.4, we have

We can apply the reverse reasoning to also obtain mFadv(x,y)mF^adv(x,y)+FF^m^{\textup{adv}}_{F}(x,y)\leq m^{\textup{adv}}_{\widehat{F}}(x,y)+{|\kern-1.07639pt|\kern-1.07639pt|F-\widehat{F}|\kern-1.07639pt|\kern-1.07639pt|}. Combining the two gives us the desired result. ∎

Next, we lower bound mFadvm^{\textup{adv}}_{F} when each function in FF is smooth.

Appendix F Additional Experimental Details

For the experiments presented in Table 1, the other hyperparameters besides tt and ηperturb\eta_{\textup{perturb}} are set to their defaults for WideResNet architectures. Our code is inspired by the following PyTorch WideResNet implementation: https://github.com/xternalz/WideResNet-pytorch, and we use the default hyperparameters in this implementation. Although we tried a larger choice of tt, the number of updates to the perturbations δ\delta, our results did not depend much on tt.

In Table 3, we demonstrate that our AMO algorithm can also improve performance for conventional feedforward architectures such as VGG [Simonyan and Zisserman, 2014]. We report the best validation error on the clean classification setting. For both methods, we train VGG-19 for 350 epochs using weight decay 0.00050.0005 and an initial learning rate of 0.05 annealing by a factor of 0.1 at the 150-th and 250-th epochs. For the AMO algorithm, we optimize perturbations δ\delta for t=1t=1 steps and use learning rate ηperturb=0.01\eta_{\textup{perturb}}=0.01.

In Table 4, we demonstrate that our AMO algorithm can offer improvements over dropout, which is also a regularization method based on perturbing the hidden layers. This demonstrates the value of regularizing stability with respect to worst-case perturbations rather than random perturbations. For the combinations of CIFAR dataset and WideResNet architecture, we tuned the level of dropout and display the best tuned results with the dropout probability pp in Table 4. We use the same hyperparameter settings as described above.

F.2 Additional Details for Robust AMO

For both models trained with robust AMO, the perturbations δ\delta have a step size of 6.4e-3, and we shrink the norm of δ\delta by a factor of 0.92 every iteration. For the WideResNet16 model trained with robust AMO, we use a batch size of 128 with initial learning rate of 0.1 and weight decay 5e-4. For the WideResNet28 model trained using robust AMO, we use a batch size of 64 (chosen so that the δ\delta vectors fit in GPU memory) with initial learning rate of 0.707 (chosen to keep the covariance of the gradient update the same with the decreased batch size). We use weight decay 2e-4. We chose these parameters by tuning.