Improved Sample Complexities for Deep Networks and Robust Classification via an All-Layer Margin

Colin Wei, Tengyu Ma

Introduction

The most popular classification objectives for deep learning, such as cross entropy loss, encourage a larger output margin – the gap between predictions on the true label and and next most confident label. These objectives have been popular long before deep learning was prevalent, and there is a long line of work showing they enjoy strong statistical guarantees for linear and kernel methods (Bartlett and Mendelson, 2002; Koltchinskii et al., 2002; Hofmann et al., 2008; Kakade et al., 2009). These guarantees have been used to explain the successes of popular algorithms such as SVM (Boser et al., 1992; Cortes and Vapnik, 1995).

For linear classifiers, the relationship between output margin and generalization is simple and direct – generalization error is controlled by the output margins normalized by the classifier norm. Concretely, suppose we have $n$ training data points each with norm 1, and let $\gamma_{i}$ be the output margin on the $i$ -th example. With high probability, if the classifier perfectly fits the training data, we obtainThis is a stronger version of the classical textbook bound which involves the min margin on the training examples. We present this stronger version because it motivates our work better. It can be derived from the results of Srebro et al. (2010).

For deeper models, the relationship between output margin and generalization is unfortunately less clear and interpretable. Known bounds for deep nets normalize the output margin by a quantity that either scales exponentially in depth or depends on complex properties of the network (Neyshabur et al., 2015; Bartlett et al., 2017; Neyshabur et al., 2017b; Golowich et al., 2017; Nagarajan and Kolter, 2019; Wei and Ma, 2019). This is evidently more complicated than the linear case in (1.1). These complications arise because for deep nets, it is unclear how to properly normalize the output margin. In this work, we remedy this issue by proposing a new notion of margin, called "all-layer margin", which we use to obtain simple guarantees like (1.1) for deep models. Let $m_{i}$ be the all-layer margin for the $i$ -th example. We can simply normalize it by the sum of the complexities of the weights (often measured by the norms or the covering number) and obtain a bound of the following form:

As the name suggests, the all-layer margin considers all layers of the network simultaneously, unlike the output margin which only considers the last layer. We note that the definition of the all-layer margin is the key insight for deriving (1.2) – given the definition, the rest of the proof follows naturally with standard tools. (Please see equation (2.2) for a formal definition of the margin, and Theorem 2.3 for a formal version of bound (1.2).) To further highlight the good statistical properties of the all-layer margin, we present three of its concrete applications in this paper.

1. By relating all-layer margin to output margin and other quantities, we obtain improved generalization bounds for neural nets. In Section 3, we derive an analytic lower bound on the all-layer margin for neural nets with smooth activations which depends on the output margin normalized by other data-dependent quantities. By substituting this lower bound into (1.2), we obtain a generalization bound in Theorem 3.1 which avoids the exponential depth dependency and has tighter data-dependent guarantees than (Nagarajan and Kolter, 2019; Wei and Ma, 2019) in several ways. First, their bounds use the same normalizing quantity for each example, whereas our bounds are tighter and more natural because we use a different normalizer for each training example – the local Lipschitzness for that particular example. Second, our bound depends on the empirical distribution of some complexity measure computed for each training example. When these complexities are small for each training example, we can obtain convergence rates faster than $1/\sqrt{n}$ . We provide a more thorough comparison to prior work in Section 3.

Furthermore, for relu networks, we give a tighter generalization bound which removes the dependency on inverse pre-activations suffered by (Nagarajan and Kolter, 2019), which they showed to be large empirically (see Section B.1). The techniques of (Wei and Ma, 2019) could not remove this dependency because they relied on smooth activations.

2. We extend our tools to give generalization bounds for adversarially robust classification error which are analogous to our bounds in the standard setting. In Section 4, we provide a natural extension of our all-layer margin to adversarially robust classification. This allows us to translate our neural net generalization bound, Theorem 3.1, directly to adversarially robust classification (see Theorem 4.1). The resulting bound takes a very similar form as our generalization bound for clean accuracy – it simply replaces the data-dependent quantities in Theorem 3.1 with their worst-case values in the adversarial neighborhood of the training example. As a result, it also avoids explicit exponential dependencies on depth. As our bound is the first direct analysis of the robust test error, it presents a marked improvement over existing work which analyze loose relaxations of the adversarial error (Khim and Loh, 2018; Yin et al., 2018). Finally, our analysis of generalization for the clean setting translates directly to the adversarial setting with almost no additional steps. This is an additional advantage of our all-layer margin definition.

3. We design a training algorithm that encourages a larger all-layer margin and demonstrate that it improves empirical performance over strong baselines. In Section 5, we apply our regularizer to WideResNet models (Zagoruyko and Komodakis, 2016) trained on the CIFAR datasets and demonstrate improved generalization performance for both clean and adversarially robust classification. We hope that these promising empirical results can inspire researchers to develop new methods for optimizing the all-layer margin and related quantities. Our code is available online at the following link: https://github.com/cwein3/all-layer-margin-opt.

Zhang et al. (2016); Neyshabur et al. (2017a) note that deep learning often exhibits statistical properties that are counterintuitive to conventional wisdom. This has prompted a variety of new perspectives for studying generalization in deep learning, such as implicit or algorithmic regularization (Gunasekar et al., 2017; Li et al., 2017; Soudry et al., 2018; Gunasekar et al., 2018), new analyses of interpolating classifiers (Belkin et al., 2018; Liang and Rakhlin, 2018; Hastie et al., 2019; Bartlett et al., 2019), and the noise and stability of SGD (Hardt et al., 2015; Keskar et al., 2016; Chaudhari et al., 2016). In this work, we adopt a different perspective for analyzing generalization by studying a novel definition of margin for deep models which differs from the well-studied notion of output margin. We hope that our generalization bounds can inspire the design of new regularizers tailored towards deep learning. Classical results have bounded generalization error in terms of the model’s output margin and the complexity of its prediction (Bartlett and Mendelson, 2002; Koltchinskii et al., 2002), but for deep models this complexity grows exponentially in depth (Neyshabur et al., 2015; Bartlett et al., 2017; Neyshabur et al., 2017b; Golowich et al., 2017). Recently, Nagarajan and Kolter (2019); Wei and Ma (2019) derived complexity measures in terms of hidden layer and Jacobian norms which avoid the exponential dependence on depth, but their proofs require complicated techniques for controlling the complexity of the output margin. Li et al. (2018) also derive Jacobian-based generalization bounds.

Dziugaite and Roy (2017) directly optimize a PAC-Bayes bound based on distance from initialization and stability to random perturbations, obtaining a nonvacuous bound on generalization error of a neural net. Neyshabur et al. (2017a); Arora et al. (2018) provide complexity measures related to the data-dependent stability of the network, but the resulting bounds only apply to a randomized or compressed version of the original classifier. We provide a simple framework which derives such bounds for the original classifier. Long and Sedghi (2019) prove bounds for convolutional networks which scale with the number of parameters and their distance from initialization. Novak et al. (2018); Javadi et al. (2019) study stability-related complexity measures empirically.

Keskar et al. (2016); Chaudhari et al. (2016); Yao et al. (2018); Jastrzebski et al. (2018) study the relationship between generalization and “sharpness” of local minima in the context of small v.s. large batch SGD. They propose measures of sharpness related to the second derivatives of the loss with respect to the model parameters. Our all-layer margin can also be viewed as a notion of “sharpness” of the local minima around the model, as it measures the stability of the model’s prediction.

A recent line of work establishes rigorous equivalences between logistic loss and output margin maximization. Soudry et al. (2018); Ji and Telgarsky (2018) show that gradient descent implicitly maximizes the margin for linearly separable data, and Lyu and Li (2019) prove gradient descent converges to a stationary point of the max-margin formulation for deep homogeneous networks. Other works show global minimizers of regularized logistic loss are equivalent to margin maximizers, in linear cases (Rosset et al., 2004) and for deep networks (Wei et al., 2018; Nacson et al., 2019). A number of empirical works also suggest alternatives to the logistic loss which optimize variants of the output margin (Sun et al., 2014; Wen et al., 2016; Liu, ; Liang et al., 2017; Cao et al., 2019). The neural net margin at intermediate and input layers has also been studied. Elsayed et al. (2018) design an algorithm to maximize a notion of margin at intermediate layers of the network, and Jiang et al. (2018) demonstrate that the generalization gap of popular architectures can empirically be predicted using statistics of intermediate margin distributions. Verma et al. (2018) propose a regularization technique which they empirically show improves the structure of the decision boundary at intermediate layers. Yan et al. (2019) use adversarial perturbations to input to optimize a notion of margin in the input space, and Xie et al. (2019) demonstrate that using adversarial examples as data augmentation can improve clean generalization performance. Sokolić et al. (2017) provide generalization bounds based on the input margin of the neural net, but these bounds depend exponentially on the dimension of the data manifold. These papers study margins defined for individual network layers, whereas our all-layer margin simultaneously considers all layers. This distinction is crucial for deriving our statistical guarantees.

A number of recent works provide negative results for adversarially robust generalization (Tsipras et al., 2018; Montasser et al., 2019; Yin et al., 2018; Raghunathan et al., 2019). We provide positive results stating that adversarial test accuracy can be good if the adversarial all-layer margin is large on the training data. Schmidt et al. (2018) demonstrate that more data may be required for generalization on adversarial inputs than on clean data. Montasser et al. (2019) provide impossiblity results for robust PAC learning with proper learning rules, even for finite VC dimension hypothesis classes. Zhang et al. (2019) consider the trade-off between the robust error and clean error. Farnia et al. (2018) analyze generalization for specific adversarial attacks and obtain bounds depending exponentially on depth. Yin et al. (2018); Khim and Loh (2018) give adversarially robust generalization bounds by upper bounding the robust loss via a transformed/relaxed loss function, and the bounds depend on the product of weight matrix norms. Yin et al. (2018) also show that the product of norms is inevitable if we go through the standard tools of Rademacher complexity and the output margin. Our adversarial all-layer margin circumvents this lower bound because it considers all layers of the network rather than just the output.

2 Notation

Warmup: Simplified All-Layer Margin and its Guarantees

For shallow models such as linear and kernel methods, the output margin maximization objective enjoys good statistical guarantees (Kakade et al., 2009; Hofmann et al., 2008). For deep networks, the statistical properties of this objective are less clear: until recently, statistical guarantees depending on the output margin also suffered an exponential dependency on depth (Bartlett et al., 2017; Neyshabur et al., 2017b). Recent work removed these dependencies but require technically involved proofs and result in complicated bounds depending on numerous properties of the training data (Nagarajan and Kolter, 2019; Wei and Ma, 2019).

In this section, we introduce a new objective with better statistical guarantees for deep models (Theorem 2.3) and outline the steps for proving these guarantees. Our objective is based on maximizing a notion of margin which measures the stability of a classifier to simultaneous perturbations at all layers. We first rigorously define a generalized margin function as follows:

To motivate our notion of margin, we observe that the following generalization bound holds for any generalized margin function with low complexity.

For a function class $\mathcal{F}$ , let $\mathcal{G}=\{g_{F}:F\in\mathcal{F}\}$ be a class of generalized margin functions indexed by $F\in\mathcal{F}$ . Suppose that the covering number of $\mathcal{G}$ with respect to $\infty$ -norm scales as $\log\mathcal{N}_{\infty}(\epsilon,\mathcal{G})\leq\lfloor\frac{\mathcal{C}_{\infty}^{2}(\mathcal{G})}{\epsilon^{2}}\rfloor$ for some complexity $\mathcal{C}_{\infty}(\mathcal{G})$ . Then with probability $1-\delta$ over the draw of the training data, all classifiers $F\in\mathcal{F}$ which achieve training error 0 satisfy

where $\zeta\triangleq O\left(\frac{\log(1/\delta)+\log n}{n}\right)$ is a low-order term.

The proof of Lemma 2.2 mainly follows standard proofs of margin-based generalization bounds. The main novelty is in achieving a generalization bound depending on the average inverse margin instead of largest inverse margin. To obtain this bound, we apply a generalization result of Srebro et al. (2010) to a smooth surrogate loss which bounds the 0-1 loss. More details are provided in Section A.

Note that Lemma 2.2 applies to any margin function $g$ , and motivates defining margin functions with low complexity. We can define the all-layer margin as follows. Suppose that the classifier $F(x)=f_{k}\circ\cdots\circ f_{1}(x)$ is computed by composing $k$ functions $f_{k},\ldots,f_{1}$ , and let $\delta_{k},\ldots,\delta_{1}$ denote perturbations intended to be applied at each hidden layer. We recursively define the perturbed network output $F(x,\delta_{1},\ldots,\delta_{k})$ by

The all-layer margin will now be defined as the minimum norm of $\delta$ required to make the classifier misclassify the input. Formally, for classifier $F$ , input $x$ , and label $y$ , we define

Note that $m_{F}$ is strictly positive if and only if the unperturbed prediction $F(x)$ is correct, so $m_{F}$ is a margin function according to Definition 2.1. Here multiplying $\delta_{i}$ by the previous layer norm $\|h_{i-1}(x,\delta)\|_{2}$ is important and intuitively balances the relative scale of the perturbations at each layer. We note that the definition above is simplified to convey the main intuition behind our results – to obtain the tightest possible bounds, in Sections 3 and 4, we use the slightly more general $m_{F}$ defined in Section A.

Prior works have studied, both empirically and theoretically, the margin of a network with respect to single perturbations at an intermediate or input layer (Sokolić et al., 2017; Novak et al., 2018; Elsayed et al., 2018; Jiang et al., 2018). Our all-layer margin is better tailored towards handling the compositionality of deep networks because it considers simultaneous perturbations to all layers, which is crucial for achieving its statistical guarantees.

Formally, let $\mathcal{F}\triangleq\{f_{k}\circ\cdots\circ f_{1}:f_{i}\in\mathcal{F}_{i}\}$ be the class of compositions of functions from function classes $\mathcal{F}_{1},\ldots,\mathcal{F}_{k}$ . We bound the population classification error for $F\in\mathcal{F}$ based on the distribution of $m_{F}$ on the training data and the sum of the complexities of each layer, measured via covering numbers. For simplicity, we assume the covering number of each layer scales as $\log\mathcal{N}_{\|\cdot\|_{\textup{op}}}(\epsilon,\mathcal{F}_{i})\leq\lfloor\mathcal{C}_{i}^{2}/\epsilon^{2}\rfloor$ for some complexity $\mathcal{C}_{i}$ , which is common for many function classes.

In the above setting, with probability $1-\delta$ over the draw of the training data, all classifiers $F\in\mathcal{F}$ which achieve training error 0 satisfy

where $\zeta\triangleq O\left(\frac{\log(1/\delta)+\log n}{n}\right)$ is a low-order term.

In other words, generalization is controlled by the sum of the complexities of the layers and the quadratic mean of $1/m_{F}$ on the training set. Theorem A.2 generalizes this statement to provide bounds which depend on the $q$ -th moment of $1/m_{F}$ for any integer $q>0$ and converge at rates faster than $1/\sqrt{n}$ . For neural nets, $\mathcal{C}_{i}$ scales with weight matrix norms and $1/m_{F}$ can be upper bounded by a polynomial in the Jacobian and hidden layer norms and output margin, allowing us to avoid an exponential dependency on depth when these quantities are well-behaved on the training data.

The proof of Theorem 2.3 follows by showing that $m_{F}$ has low complexity which scales with the sum of the complexities at each layer, as shown in the following lemma. We can then apply the generalization bound in Lemma 2.2 to $m_{F}$ .

Let $m\circ\mathcal{F}=\{m_{F}:F\in\mathcal{F}\}$ denote the family of all-layer margins of function compositions in $\mathcal{F}$ . Then

The covering number of an individual layer commonly scales as $\log\mathcal{N}_{\|\cdot\|_{\textup{op}}}(\epsilon_{i},\mathcal{F}_{i})\leq\lfloor\mathcal{C}_{i}^{2}/\epsilon_{i}^{2}\rfloor$ . In this case, for all $\epsilon>0$ , we obtain $\log\mathcal{N}_{\infty}\left(\epsilon,m\circ\mathcal{F}\right)\leq\left\lfloor\frac{\left(\sum_{i}\mathcal{C}_{i}\right)^{2}}{\epsilon^{2}}\right\rfloor$ .

Lemma 2.4 shows that the complexity of $m_{F}$ scales linearly in depth for any choice of layers $\mathcal{F}_{i}$ . In sharp contrast, lower bounds show that the complexity of the output margin scales exponentially in depth via a product of Lipschitz constants of all the layers (Bartlett et al., 2017; Golowich et al., 2017). Our proof only relies on basic properties of $m_{F}$ , indicating that $m_{F}$ is naturally better-equipped to handle the compositionality of $\mathcal{F}$ . In particular, we prove Lemma 2.4 by leveraging a uniform Lipschitz property of $m_{F}$ . This uniform Lipschitz property does not hold for prior definitions of margin and reflects the key insight in our definition – it arises only because our margin depends on simultaneous perturbations to all layers.

For any two compositions $F=f_{k}\circ\cdots\circ f_{1}$ and $\widehat{F}=\widehat{f}_{k}\circ\cdots\circ\widehat{f}_{1}$ and any $(x,y)$ , we have $|m_{F}(x,y)-m_{\widehat{F}}(x,y)|\leq\sqrt{\sum_{i=1}^{k}\|f_{i}-\widehat{f}_{i}\|_{\textup{op}}^{2}}$ .

Let $\delta^{\star}$ be the optimal choice of $\delta$ in the definition of $m_{F}(x,y)$ . We will construct $\widehat{\delta}$ such that $\|\widehat{\delta}\|_{2}\leq\|\delta^{\star}\|_{2}+\sqrt{\sum_{i}\|f_{i}-\widehat{f}_{i}\|_{\textup{op}}^{2}}$ and $\gamma(\widehat{F}(x,\widehat{\delta}),y)=0$ as follows: define $\widehat{\delta}_{i}\triangleq\delta_{i}^{\star}+\Delta_{i}$ for $\Delta_{i}\triangleq\frac{f_{i}(h_{i-1}(x,\delta^{\star}))-\widehat{f}_{i}(h_{i-1}(x,\delta^{\star}))}{\|h_{i-1}(x,\delta^{\star})\|_{2}}$ , where $h$ is defined as in (2.1) with respect to the classifier $F$ . Note that by our definition of $\|\cdot\|_{\textup{op}}$ , we have $\|\Delta_{i}\|_{2}\leq\|f_{i}-\widehat{f}_{i}\|_{\textup{op}}$ . Now it is possible to check inductively that $\widehat{F}(x,\widehat{\delta})=F(x,\delta^{\star})$ . In particular, $\widehat{\delta}$ is satisfies the misclassification constraint in the all-layer margin objective for $\widehat{F}$ . Thus, it follows that $m_{\widehat{F}}(x,y)\leq\|\widehat{\delta}\|_{2}\leq\|\delta^{\star}\|_{2}+\|\Delta\|_{2}\leq m_{F}(x,y)+\sqrt{\sum_{i}\|f_{i}-\widehat{f}_{i}\|_{\textup{op}}^{2}}$ , where the last inequality followed from $\|\Delta_{i}\|_{2}\leq\|f_{i}-\widehat{f}_{i}\|_{\textup{op}}$ . With the same reasoning,we obtain $m_{F}(x,y)\leq m_{\widehat{F}}(x,y)+\sqrt{\sum_{i}\|f_{i}-\widehat{f}_{i}\|_{\textup{op}}^{2}}$ , so $|m_{F}(x,y)-m_{\widehat{F}}(x,y)|\leq\sqrt{\sum_{i}\|f_{i}-\widehat{f}_{i}\|_{\textup{op}}^{2}}$ . ∎

Given Claim 2.5, Lemma 2.4 follows simply by composing $\epsilon_{i}$ -covers of $\mathcal{F}_{i}$ . We prove a more general version in Section A (see Lemmas A.3 and A.5.) To complete the proof of Theorem 2.3, we can apply Lemma 2.4 to bound the complexity of the family $m\circ\mathcal{F}$ , allowing us to utilize the generalization bound in Lemma 2.2.

Finally, we check that when $F$ is a linear classifier, $m_{F}$ recovers the standard output margin. Thus, we can view the all-layer margin as an extension of the output margin to deeper classifiers.

In the binary classification setting with a linear classifier $F(x)=w^{\top}x$ where the data $x$ has norm $1$ , we have $m_{F}(x,y)=\max\{0,yw^{\top}x\}=\gamma(F(x),y)$ .

For deeper models, the all-layer margin can be roughly bounded by a quantity which normalizes the output margin by Jacobian and hidden layer norms. We formalize this in Lemma 3.2 and use this to prove our main generalization bound for neural nets, Theorem 3.1.

Generalization Guarantees for Neural Networks

Although the all-layer margin is likely difficult to compute exactly, we can analytically lower bound it for neural nets with smooth activations. In this section, we obtain a generalization bound that depends on computable quantities by substituting this lower bound into Theorem 2.3. Our bound considers the Jacobian norms, hidden layer norms, and output margin on the training data, and avoids the exponential depth dependency when these quantities are well-behaved, as is the case in practice (Arora et al., 2018; Nagarajan and Kolter, 2019; Wei and Ma, 2019). Prior work (Nagarajan and Kolter, 2019; Wei and Ma, 2019) avoided the exponential depth dependency by considering these same quantities but required complicated proof frameworks. We obtain a simpler proof with tighter dependencies on these quantities by analyzing the all-layer margin. In Section B.1, we also provide a more general bound for neural networks with any activations (including relu) which only depends on the all-layer margins and weight matrix norms.

Assume that the activation $\phi$ has a $\kappa_{\phi}^{\prime}$ -Lipschitz derivative. Fix reference matrices $\{A_{(i)},B_{(i)}\}_{i=1}^{k}$ and any integer $q>0$ . With probability $1-\delta$ over the draw of the training sample $P_{n}$ , all neural nets $F$ which achieve training error 0 satisfy

where $\kappa^{\textup{NN}}_{(i)}$ captures a local Lipschitz constant of perturbations at layer $i$ and is defined by

for a secondary term $\psi_{(i)}(x,y)$ given by

Here the second and third summations above are over the choices of $j,j^{\prime}$ satisfying the constraints specified in the summation. We define $a_{(i)}$ by $a_{(i)}\triangleq\min\{\sqrt{d}\|W_{(i)}-A_{(i)}\|_{\textup{fro}},\|W_{(i)}-B_{(i)}\|_{1,1}\}\sqrt{\log d}+\textup{poly}(n^{-1})$ and $\zeta\lesssim\frac{r\log n+\log(1/\delta)+\sum_{i}\log(a_{(i)}+1)}{n}$ is a low-order term.

For example, when $q=2$ , from (3.1) we obtain the following bound which depends on the second moment of $\kappa_{(i)}^{\textup{NN}}$ and features the familiar $1/\sqrt{n}$ convergence rate in the training set size.

For larger $q$ , we obtain a faster convergence rate in $n$ , but the dependency on $\kappa^{\textup{NN}}_{(i)}$ gets larger. We will outline a proof sketch which obtains a variant of Theorem 3.1 with a slightly worse polynomial dependency on $\kappa^{\textup{NN}}_{(i)}$ and $a_{(i)}$ . For simplicity we defer the proof of the full Theorem 3.1 to Sections B and C. First, we need to slightly redefine $m_{F}$ so that perturbations are only applied at linear layers (formally, fix $\delta_{2i}=0$ for the even-indexed activation layers, and let $\delta_{(i)}\triangleq\delta_{2i-1}$ index perturbations to the $i$ -th linear layer). It is possible to check that Lemma 2.4 still holds since activation layers correspond to a singleton function class $\{\phi\}$ with log covering number 0. Thus, the conclusion of Theorem 2.3 also applies for this definition of $m_{F}$ . Now the following lemma relates this all-layer margin to the output margin and Jacobian and hidden layer norms, showing that $m_{F}(x,y)$ can be lower bounded in terms of $\{\kappa^{\textup{NN}}_{(i)}(x,y)\}$ .

In the setting above, we have the lower bound $m_{F}(x,y)\geq\frac{1}{\|\{\kappa^{\textup{NN}}_{(i)}(x,y)\}_{i=1}^{r}\|_{2}}$ .

Directly plugging the above lower bound into Theorem 2.3 and choosing $\mathcal{C}_{2i}=0$ , $\mathcal{C}_{2i-1}=a_{(i)}$ would give a variant of Theorem 3.1 that obtains a different polynomial in $\kappa^{\textup{NN}}_{(i)}$ , $a_{(i)}$ .

We compute the derivative of $F(x,\delta)$ with respect to $\delta_{(i)}$ : $D_{\delta_{(i)}}F(x,\delta)=D_{h_{2i-1}(x,\delta)}F(x,\delta)\|h_{2i-2}(x,\delta)\|_{2}$ . We abused notation to let $D_{h_{2i-1}(x,\delta)}F(x,\delta)$ denote the derivative of $F$ with respect to the $2i-1$ -th perturbed layer evaluated on input $(x,\delta)$ . By definitions of $\kappa_{j\leftarrow i}$ , $s_{(i)}$ and the fact that the output margin is 1-Lipschitz, we obtain $\|D_{\delta_{(i)}}\gamma(F(x,\delta),y)|_{\delta=0}\|_{2}\leq\|D_{h_{2i-1}(x,\delta)}F(x,0)\|_{\textup{op}}\|h_{2i-2}(x,0)\|_{2}=\kappa_{2r-1\leftarrow 2i}(x)s_{(i-1)}(x)$ . With the first order approximation $\gamma(F(x,\delta),y)\approx\gamma(F(x),y)+\sum_{i}D_{\delta_{(i)}}\gamma(F(x,\delta),y)|_{\delta=0}\delta_{(i)}$ around $\delta=0$ , we obtain

The right hand side is nonnegative whenever $\|\delta\|_{2}\leq\frac{\gamma(F(x),y)}{\|\{\kappa_{2r-1\leftarrow 2i}(x)s_{(i-1)}(x)\}_{i=1}^{r}\|_{2}}$ , which would imply that $m_{F}(x,y)\geq\frac{\gamma(F(x),y)}{\|\{\kappa_{2r-1\leftarrow 2i}(x)s_{(i-1)}(x)\}_{i=1}^{r}\|_{2}}$ . However, this conclusion is imprecise and non-rigorous because of the first order approximation – to make the argument rigorous, we also control the smoothness of the network around $x$ in terms of the interlayer Jacobians, ultimately resulting in the bound of Lemma 3.2. We remark that the quantities $\kappa^{\textup{NN}}_{(i)}$ are not the only expressions with which we could lower bound $m_{F}(x,y)$ . Rather, the role of $\kappa^{\textup{NN}}_{(i)}$ is to emphasize the key term $\frac{s_{(i-1)}(x)\kappa_{2r-1\leftarrow 2i}(x)}{\gamma(F(x),y)}$ , which measures the first order stability of the network to perturbation $\delta_{(i)}$ and relates the all-layer margin to the output margin. As highlighted above, if this term is small, $m_{F}$ will be large so long as the network is sufficiently smooth around $(x,y)$ , as captured by $\psi_{(i)}(x,y)$ .

Comparison to existing bounds

We can informally compare Theorem 3.1 to the existing bounds of (Nagarajan and Kolter, 2019; Wei and Ma, 2019) as follows. First, the leading term $\frac{s_{(i-1)}(x)\kappa_{2r-1\leftarrow 2i}(x)}{\gamma(F(x),y)}$ of $\kappa^{\textup{NN}}_{(i)}$ depends on three quantities all evaluated on the same training example, whereas the analogous quantity in the bounds of (Nagarajan and Kolter, 2019; Wei and Ma, 2019) appears as $\max_{P_{n}}\frac{1}{\gamma(F(x),y)}\cdot\max_{P_{n}}s_{(i-1)}(x)\cdot\max_{P_{n}}\kappa_{2r-1\leftarrow 2i}(x)$ , where each maximum is taken over the entire training set.

As we have $\|\kappa^{\textup{NN}}_{(i)}\|_{L_{q}(P_{n})}\leq\max_{P_{n}}\kappa^{\textup{NN}}_{(i)}(x,y)$ , the term $\|\kappa^{\textup{NN}}_{(i)}\|_{L_{q}(P_{n})}$ in our bound can be much smaller than its counterpart in the bounds of (Nagarajan and Kolter, 2019; Wei and Ma, 2019). An interpretation of the parameter $q$ is that we obtain fast (close to $n^{-1}$ ) convergence rates if the model fits every training example perfectly with large all-layer margin, or we could have slower convergence rates with better dependence on the all-layer margin distribution. It is unclear whether the techniques in other papers can achieve convergence rates faster than $O(1/\sqrt{n})$ because their proofs require the simultaneous convergence of multiple data-dependent quantities, whereas we bound everything using the single quantity $m_{F}$ .

Additionally, we compare the dependence on the weight matrix norms relative to $n$ (as the degree of $n$ in our bound can vary). For simplicitly, assume that the reference matrices $A_{(i)}$ are set to . Our dependence on the weight matrix norms relative to the training set size is, up to logarithmic factors, $\left(\frac{\min\{\sqrt{d}\|W_{(i)}\|_{\textup{fro}},\|W_{(i)}\|{1,1}\}}{\sqrt{n}}\right)^{2q/(q+2)}$ , which always matches or improves on the dependency obtained by PAC-Bayes methods such as (Neyshabur et al., 2017b; Nagarajan and Kolter, 2019). Wei and Ma (2019) obtain the dependency $\frac{\|{W_{(i)}}^{\top}\|_{2,1}}{\sqrt{n}}$ , where $\|{W_{(i)}}^{\top}\|_{2,1}$ is the sum of the $\|\cdot\|_{2}$ norms of the rows of $W_{(i)}$ . This dependency on $W_{(i)}$ is always smaller than ours. Finally, we note that Theorem 2.3 already gives tighter (but harder to compute) generalization guarantees for relu networks directly in terms of $m_{F}$ . Existing work contains a term which depends on inverse pre-activations shown to be large in practice (Nagarajan and Kolter, 2019), whereas $m_{F}$ avoids this dependency and is potentially much smaller. We explicitly state the bound in Section B.1.

Generalization Guarantees for Robust Classification

where $\kappa^{\textup{adv}}_{(i)}$ is defined by $\kappa^{\textup{adv}}_{(i)}(x,y)\triangleq\max_{x^{\prime}\in\mathcal{B}^{\textup{adv}}(x)}\kappa^{\textup{NN}}_{(i)}(x^{\prime},y)$ for $\kappa^{\textup{NN}}_{(i)}$ in (3.2), and $a_{(i)},\zeta$ are defined the same as in Theorem 3.1.

Designing regularizers for robust classification based on the bound in Theorem 4.1 is a promising direction for future work. To prove Theorem 4.1, we simply define a natural extension to our all-layer margin, and the remaining steps follow in direct analogy to the clean classification setting. We define the adversarial all-layer margin as the smallest all-layer margin on the perturbed inputs: $m^{\textup{adv}}_{F}(x,y)\triangleq\min_{x^{\prime}\in\mathcal{B}^{\textup{adv}}(x)}m_{F}(x,y)$ . We note that $m^{\textup{adv}}_{F}(x,y)$ is nonzero if and only if $F$ correctly classifies all adversarial perturbations of $x$ . Furthermore, the adversarial all-layer margin also satisfies the uniform Lipschitz property in Claim 2.5. Thus, the remainder of the proof of Theorem 4.1 follows the same steps laid out in Section 2. As before, we note that Theorem 4.1 requires $m_{F}$ to be the more general all-layer margin defined in Section A. We provide the full proofs in Section E.

Empirical Application of the All-Layer Margin

Inspired by the good statistical properties of the all-layer margin, we design an algorithm which encourages a larger all-layer margin during training. Our proposed algorithm is similar to adversarial training (Madry et al., 2017), except perturbations are applied at all hidden layers instead of the input. Xie et al. (2019) demonstrate that adversarial training can be used to improve generalization performance in image recognition. Yan et al. (2019) use adversarial perturbations to input to optimize a notion of margin in the input space.

We use Algorithm 1 to train a WideResNet architecture (Zagoruyko and Komodakis, 2016) on CIFAR10 and CIFAR100 in a variety of settings. Our implementation is available online.https://github.com/cwein3/all-layer-margin-opt For all of our experiments we use $t=1$ , $\eta_{\textup{perturb}}=0.01$ , and we apply perturbations following conv layers in the WideResNet basic blocks. In Table 1 we report the best validation error achieved during a single run of training, demonstrating that our algorithm indeed leads to improved generalization over the strong WideResNet baseline for a variety of settings. Additional parameter settings and experiments for the feedforward VGG (Simonyan and Zisserman, 2014) architecture are in Section F.1. In addition, in Section F.1, we show that dropout, another regularization method which perturbs each hidden layer, does not offer the same improvement as AMO.

Inspired by our robust generalization bound, we also apply AMO to robust classification by extending the robust training algorithm of (Madry et al., 2017). Madry et al. (2017) adversarially perturb the training input via several steps of projected gradient descent (PGD) and train using the loss computed on this perturbed input. At each update, our robust AMO algorithm initializes perturbations $\delta$ to 0 for every training example in the batch. The updates to these $\delta$ are performed simultaneously with PGD updates for the adversarial perturbations with the same update rule as Algorithm 1.

In Table 2, we report best validation performance for robust AMO and the baseline method of (Madry et al., 2017) on CIFAR10. Our results demonstrate that our robust AMO algorithm can also offer improvements in the robust accuracy. We provide parameter settings in Section F.2.

Conclusion

Many popular objectives in deep learning are based on maximizing a notion of output margin, but unfortunately it is difficult to obtain good statistical guarantees by analyzing this output margin. In this paper, we design a new all-layer margin which attains strong statistical guarantees for deep models. Our proofs for these guarantees follow very naturally from our definition of the margin. We apply the all-layer margin in several ways: 1) we obtain tighter data-dependent generalization bounds for neural nets 2) for adversarially robust classification, we directly bound the robust generalization error in terms of local Lipschitzness around the perturbed training examples, and 3) we design a new algorithm to encourage larger all-layer margins and demonstrate improved performance on real data in both clean and adversarially robust classification settings. We hope that our results prompt further study on maximizing all-layer margin as a new objective for deep learning.

Acknowledgements

CW acknowledges support from an NSF Graduate Research Fellowship. The work is also partially supported by SDSI and SAIL.

References

Appendix A Generalized All-Layer Margin and Missing Proofs for Section 2

In this section, we provide proofs for Section 2 in a more general and rigorous setting. We first formally introduce the setting, which considers functions composed of layers which map between arbitrary normed spaces.

As in Section 2, we will use $F(x,\delta)$ to denote the classifier output perturbed by $\delta$ . It will be useful to define additional notation for the perturbed function between layers $i$ and $j$ , denoted by $f_{j\leftarrow i}(h,\delta)$ , recursively as follows:

where we choose $f_{i-1\leftarrow i}(h,\delta)\triangleq h$ . Note that $F(x,\delta)\triangleq f_{k\leftarrow 1}(x,\delta)$ , and the notation $h_{i}(x,\delta)$ from Section 2 is equivalent to $f_{i\leftarrow 1}(x,\delta)$ . We will use the simplified notation $f_{j\leftarrow i}(x)\triangleq f_{j\leftarrow i}(x,0)$ when the perturbation $\delta$ is at all layers.

The norm ${|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|}$ will have the following form:

This more general definition of $m_{F}$ will be useful for obtaining Theorems 3.1 and 4.1. Note that by setting $\alpha_{i}=1$ for all $i$ and $p=2$ , we recover the simpler $m_{F}$ defined in Section 2.

As before, it will be convenient for the analysis to assume that the $\epsilon$ -covering number of $\mathcal{F}_{i}$ in operator norm scales with $\epsilon^{-2}$ . We formally state this condition for general function classes and norms below:

We say that a function class $\mathcal{G}$ satisfies the $\epsilon^{-2}$ covering condition with respect to norm $\|\cdot\|$ with complexity $\mathcal{C}_{\|\cdot\|}(\mathcal{G})$ if for all $\epsilon>0$ ,

Now we provide the analogue of Theorem 2.3 for the generalized all-layer margin:

Fix any integer $q>0$ . Suppose that all layer functions $\mathcal{F}_{i}$ satisfy Condition A.1 with operator norm $\|\cdot\|_{\textup{op}}$ and complexity function $\mathcal{C}_{\|\cdot\|_{\textup{op}}}(\mathcal{F}_{i})$ . Let the all layer margin $m_{F}$ be defined as in (A.1). Then with probability $1-\delta$ over the draw of the training data, all classifiers $F\in\mathcal{F}$ which achieve training error 0 satisfy

where $\mathcal{C}_{{|\kern-0.75346pt|\kern-0.75346pt|\cdot|\kern-0.75346pt|\kern-0.75346pt|}}(\mathcal{F})\triangleq\left(\sum_{i}\alpha_{i}^{2p/(p+2)}\mathcal{C}_{\|\cdot\|_{\textup{op}}}(\mathcal{F}_{i})^{2p/(p+2)}\right)^{(p+2)/2p}$ is the complexity (in the sense of Condition A.1) for covering $\mathcal{F}$ in ${|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|}$ and $\zeta\triangleq O\left(\frac{\log(1/\delta)+\log n}{n}\right)$ is a low-order term.

Note that this recovers Theorem 2.3 when $\alpha_{i}=1$ for all $i$ and $p=2$ . The proof of Theorem A.2 mirrors the plan laid out in Section 2. As before, the first step of the proof is showing that $m_{F}$ has low complexity as measured by covering numbers.

Define $m\circ\mathcal{F}\triangleq\{(x,y)\mapsto m_{F}(x,y):F\in\mathcal{F}\}$ . Then

As in Section 2, we prove Lemma A.3 by bounding the error between $m_{F}$ and $m_{\widehat{F}}$ in terms of the ${|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|}$ -norm of the difference between $F$ and $\widehat{F}$ .

For any $x,y\in\mathcal{D}_{0}\times[l]$ , and function sequences $F=\{f_{i}\}_{i=1}^{k}$ , $\widehat{F}=\{\widehat{f_{i}}\}_{i=1}^{k}$ , we have $|m_{F}(x,y)-m_{\widehat{F}}(x,y)|\leq{|\kern-1.07639pt|\kern-1.07639pt|F-\widehat{F}|\kern-1.07639pt|\kern-1.07639pt|}$ .

Suppose that $\delta^{\star}$ optimizes equation (A.1) used to define $m_{F}(x,y)$ . Now we use the notation $h^{\star}_{i}\triangleq f_{i\leftarrow 1}(x,\delta^{\star})$ . Define $\widehat{\delta}$ as follows:

We first argue via induction that $\widehat{f}_{i\leftarrow 1}(x,\widehat{\delta})=h_{i}^{\star}$ . As the base case, we trivially have $\widehat{f}_{0\leftarrow 1}(x,\widehat{\delta})=x=h^{\star}_{0}$ .

Thus, we must have $\widehat{F}(x,\widehat{\delta})=F(x,\delta^{\star})$ , so it follows that $\gamma(\widehat{F}(x,\widehat{\delta}),y)\leq 0$ as well. Furthermore, by triangle inequality

Now we note that as $\frac{\|f_{i}(h^{\star}_{i-1})-\widehat{f_{i}}(h^{\star}_{i-1})\|}{\|h^{\star}_{i-1}\|}\leq\|f_{i}-\widehat{f}_{i}\|_{\textup{op}}$ , it follows that

Thus, using (A.2) and the definition of $m_{\widehat{F}}$ , we have

where we relied on the fact that ${|\kern-1.07639pt|\kern-1.07639pt|\delta^{\star}|\kern-1.07639pt|\kern-1.07639pt|}=m_{F}(x,y)$ . Using the same reasoning, we also obtain the inequality $m_{F}(x,y)\leq m_{\widehat{F}}(x,y)+{|\kern-1.07639pt|\kern-1.07639pt|F-\widehat{F}|\kern-1.07639pt|\kern-1.07639pt|}$ . Combining gives $|m_{F}(x,y)-m_{\widehat{F}}(x,y)|\leq{|\kern-1.07639pt|\kern-1.07639pt|F-\widehat{F}|\kern-1.07639pt|\kern-1.07639pt|}$ . ∎

As Claim A.4 holds for any choice of $(x,y)\in\mathcal{D}_{0}\times[l]$ , it follows that if $\widehat{\mathcal{F}}$ covers $\mathcal{F}$ in norm ${|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|}$ , then $m\circ\widehat{\mathcal{F}}$ will be a cover for $m\circ\mathcal{F}$ in the functional $\infty$ norm. ∎

Next, we show that when $\mathcal{F}_{i}$ satisfies Condition A.1 in operator norm, the class of compositions $\mathcal{F}$ satisfies Condition A.1 with respect to norm ${|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|}$ .

Suppose that each $\mathcal{F}_{i}$ satisfies Condition A.1 with norm $\|\cdot\|_{\textup{op}}$ and complexity $\mathcal{C}_{\|\cdot\|_{\textup{op}}}(\mathcal{F}_{i})$ . Define the complexity measure $\mathcal{C}_{{|\kern-0.75346pt|\kern-0.75346pt|\cdot|\kern-0.75346pt|\kern-0.75346pt|}}(\mathcal{F})$ by

which by definition implies that $\mathcal{F}$ satisfies Condition A.1 with norm ${|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|}$ and complexity $\mathcal{C}_{{|\kern-0.75346pt|\kern-0.75346pt|\cdot|\kern-0.75346pt|\kern-0.75346pt|}}(\mathcal{F})$ .

Let $\widehat{\mathcal{F}}_{i}$ be an $\epsilon_{i}$ -cover of $\mathcal{F}_{i}$ in the operator norm $\|\cdot\|_{\textup{op}}$ . We will first show that $\widehat{\mathcal{F}}\triangleq\{\widehat{F}_{k}\circ\cdots\circ\widehat{F}_{1}:\widehat{F}_{i}\in\widehat{\mathcal{F}}_{i}\}$ is a $\|\{\alpha_{i}\epsilon_{i}\}_{i=1}^{k}\|_{p}$ -cover of $\mathcal{F}$ in ${|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|}$ . To see this, for any $F=(f_{k},\ldots,f_{1})\in\mathcal{F}$ , let $\widehat{f}_{i}\in\widehat{\mathcal{F}}_{i}$ be the cover element for $f_{i}$ , and define $\widehat{F}\triangleq(\widehat{f}_{k},\ldots,\widehat{f}_{1})$ . Then we have

as desired. Furthermore, we note that $\log|\widehat{\mathcal{F}}|\leq\sum_{i}\left\lfloor\frac{\mathcal{C}^{2}_{\|\cdot\|_{\textup{op}}}(\mathcal{F}_{i})}{\epsilon_{i}^{2}}\right\rfloor$ by Condition A.1. Now we will choose

We first verify that this gives an $\epsilon$ -cover of $\mathcal{F}$ in ${|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|}$ :

Next, we check that the covering number is bounded by $\left\lfloor\frac{\mathcal{C}^{2}_{{|\kern-0.5382pt|\kern-0.5382pt|\cdot|\kern-0.5382pt|\kern-0.5382pt|}}(\mathcal{F})}{\epsilon^{2}}\right\rfloor$ :

Finally, the following lemma bounds the 0-1 test error of classifier $F\in\mathcal{F}$ in terms of the complexity of some general margin function $g_{F}$ and is a generalization of Lemma 2.2 to obtain a bound in terms of arbitrary $q$ -th moments of the inverse margin. Theorem A.2 will immediately follow by applying this lemma to the all-layer margin.

For a function class $\mathcal{F}$ , let $\mathcal{G}$ be a class of margin functions indexed by $F\in\mathcal{F}$ . Suppose that $\mathcal{G}$ satisfies Condition A.1 with respect to $\infty$ -norm, i.e. $\log\mathcal{N}_{\infty}(\epsilon,\mathcal{G})\leq\lfloor\frac{\mathcal{C}_{\infty}^{2}(\mathcal{G})}{\epsilon^{2}}\rfloor$ . Fix any integer $q>0$ . Then with probability $1-\delta$ over the draw of the training data, all classifiers $F\in\mathcal{F}$ which achieve training error 0 satisfy

where $\zeta\triangleq O\left(\frac{\log(1/\delta)+\log n}{n}\right)$ is a low-order term.

We prove Lemma A.6 in Section A.1. We now complete the proof of Theorem A.2 (and as a result, Theorem 2.3).

Note that by selecting $\mathcal{G}\triangleq m\circ\mathcal{F}$ , the conditions of Lemma A.6 are satisfied with margin family $\mathcal{G}$ with $\mathcal{C}_{\infty}(\mathcal{G})=\mathcal{C}_{{|\kern-0.75346pt|\kern-0.75346pt|\cdot|\kern-0.75346pt|\kern-0.75346pt|}}(\mathcal{F})$ defined in Lemma A.5. Plugging this value of $\mathcal{C}_{\infty}(\mathcal{G})$ into the bound in Lemma A.6 immediately gives the desired statement. ∎

The following claim is a straightforward adaption of the theorem in [Srebro et al., 2010] to covering numbers. For minor technical reasons we reprove the result below.

By applying Claim A.7 to $m\circ\mathcal{F}$ along with Lemma A.3, we can also obtain the following corollary, which will be useful in later sections.

Choosing $\beta$ to minimize the above expression would give the desired bound – however, such a post-hoc analysis cannot be performed because the optimal $\beta$ depends on the training data, and the loss class has to be fixed before the training data is drawn.

Instead, we utilize the standard technique of union bounding over a grid of $\widehat{\beta}$ in log-scale. Let $\xi\triangleq\mathcal{C}_{\infty}^{2}(\mathcal{G})\textup{poly}(n^{-1})$ denote the minimum choice of $\widehat{\beta}$ in this grid, and select in this grid all choices of $\widehat{\beta}$ in the form $\xi 2^{j}$ for $j\geq 0$ . For a given choice of $\widehat{\beta}$ , we assign it failure probability $\widehat{\delta}=\frac{\delta}{2\widehat{\beta}/\xi}$ , such that by design $\sum\widehat{\delta}=\delta$ . Thus, applying Claim A.7 for each choice of $\widehat{\beta}$ with corresponding failure probability $\widehat{\delta}$ , we note with probability $1-\delta$ ,

holds for all $\widehat{\beta}$ and $F\in\mathcal{F}$ .

Thus, with probability $1-\delta$ , for all $F\in\mathcal{F}$ , we have

It remains to observe that setting $\beta^{\star}_{F}=\Theta\left(q\left(\frac{n\left\|\frac{1}{g_{F}}\right\|_{L_{q}(P_{n})}^{q}}{\mathcal{C}_{\infty}(\mathcal{G})^{2}\log^{2}n}\right)^{2/(q+2)}\right)$ gives us the theorem statement. ∎

It suffices the complete the proof of Claim A.7. For this, we closely follow the proof of Theorem 1 in [Srebro et al., 2010], with the only difference that we replace their Rademacher complexity term with our complexity function $\mathcal{C}_{\infty}(\mathcal{G})$ . For completeness, we outline the steps here.

Now using the same steps as [Srebro et al., 2010] (which relies on applying Theorem 6.1 of [Bousquet, 2002]), we obtain for all $g\in\mathcal{G}$ , with probability $1-\delta$

where $r_{n}^{\star}$ is the largest solution of $\psi(\mu)=\mu$ . We now plug in $r_{n}^{\star}\lesssim\frac{\beta\log^{2}n\mathcal{C}_{\infty}^{2}(\mathcal{G})}{n}$ and use the fact that $\sqrt{c_{1}c_{2}}\leq(c_{1}+c_{2})/2$ for any $c_{1},c_{2}>0$ to simplify the square root term in A.7.

In the setting above, for all $\mu>0$ , we have

where $\{\sigma_{i}\}_{i=1}^{n}$ are i.i.d. Rademacher random variables.

First, by Dudley’s Theorem [Dudley, 1967], we have

We obtained the last line via change of variables to $\epsilon^{\prime}=\epsilon/\sqrt{48\beta\mu}$ . Now we substitute $\alpha=\frac{\mathcal{C}_{\infty}(\mathcal{G})\sqrt{48\beta\mu}}{\sqrt{n}}$ and note that the integrand is for $\epsilon^{\prime}>\mathcal{C}_{\infty}(\mathcal{G})$ to get

The following proposition bounds the covering number of $\mathcal{H}(\mu)$ in terms of $\mathcal{C}_{\infty}(\mathcal{G})$ .

In the setting of Claim A.7, we have the covering number bound

Appendix B Proofs for Neural Net Generalization

This section will derive the generalization bounds for neural nets in Theorem 3.1 by invoking the more general results in Section C. Theorem 3.1 applies to all neural nets, but to obtain it, we first need to bound generalization for neural nets with fixed norm bounds on their weights (this is a standard step in deriving generalization bounds). The lemma below states the analogue of Theorem 3.1, for all neural nets satisfying fixed norm bounds on their weights.

In the neural network setting, suppose that the activation $\phi$ has a $\kappa^{\prime}_{\phi}$ -Lipschitz derivative. For parameters $\{a_{(i)}\}_{i=1}^{r}$ meant to be norm constraints for the weights, define the class of neural nets with bounded weight norms with respect to reference matrices $\{A_{(i)},B_{(i)}\}$ as follows:

Then with probability $1-\delta$ , for any $q>0$ and for all $F\in\mathcal{F}$ , we have

where $\mathcal{S}_{n}$ denotes the subset of examples classified correctly by $F$ and $\kappa^{\textup{NN}}_{(i)}$ is defined as in (3.2).

We will identify the class of neural nets with matrix norm bounds $\{a_{(i)}\}_{i=1}^{r}$ with a sequence of function families

and let $\mathcal{F}\triangleq\mathcal{F}_{2r-1}\circ\cdots\circ\mathcal{F}_{1}$ denote all possible parameterizations of neural nets with norm bounds $\{a_{(i)}\}_{i=1}^{r}$ . Let $\|\cdot\|_{\textup{op}}$ be defined with respect to Euclidean norm $\|\cdot\|_{2}$ on the input and output spaces, which coincides with matrix operator norm for linear operators. We first claim that

This is because we can construct two covers: one for $\{h\mapsto Wh:\sqrt{d}\|W\|_{F}/\sqrt{\log d}\leq a_{(i)}\}$ , and one for $\{h\mapsto Wh:\|W\|_{1,1}/\sqrt{\log d}\leq a_{(i)}\}$ , each of which has log size bounded by $O(\lfloor{a_{(i)}}^{2}/\epsilon^{2}\rfloor)$ by Lemma B.3 and Claim B.5. Now we offset the first cover by the linear operator $A_{(i)}$ and the second by $B_{(i)}$ and take the union of the two, obtaining an $\epsilon$ -cover for $\mathcal{F}_{2i-1}$ in operator norm. Furthermore, $\log\mathcal{N}_{\|\cdot\|_{\textup{op}}}(\epsilon,\mathcal{F}_{2i})=0$ simply because $\mathcal{F}_{2i}$ is the singleton function.

Thus, $\mathcal{F}_{2i-1},\mathcal{F}_{2i}$ satisfy Condition A.1 with norm $\|\cdot\|_{\textup{op}}$ and complexity functions $\mathcal{C}_{\|\cdot\|_{\textup{op}}}(\mathcal{F}_{2i-1})\lesssim a_{(i)}$ and $\mathcal{C}_{\|\cdot\|_{\textup{op}}}(\mathcal{F}_{2i})=0$ , so we can apply Theorem C.1. It remains to argue that $\kappa^{\star}_{2i-1}(x,y)$ as defined for Theorem C.1 using standard Euclidean norm $\|\cdot\|_{2}$ is equivalent to $\kappa^{\textup{NN}}_{(i)}(x,y)$ defined in (3.2). To see this, we note that functions in $\mathcal{F}_{2j-1}$ have 0-Lipschitz derivative, leading those terms with a coefficient of $\kappa^{\prime}_{2j-1}$ to cancel in the definition of $\kappa_{i}^{\star}(x,y)$ . There is a 1-1 correspondence between the remaining terms of $\kappa^{\star}_{2i-1}(x,y)$ and $\kappa^{\textup{NN}}_{(i)}(x,y)$ , so we can substitute $\kappa^{\textup{NN}}_{(i)}(x,y)$ into Theorem C.1 in place of $\kappa^{\star}_{2i-1}(x,y)$ . Furthermore, as we have $\mathcal{C}_{\|\cdot\|_{\textup{op}}}(\mathcal{F}_{2i})=0$ , the corresponding terms disappear in the bound of Theorem C.1, finally giving the desired result. ∎

Now we obtain Theorem 3.1 by union bounding Lemma B.1 over choices of $\{a_{(i)}\}_{i=1}^{r}$ .

We will use the standard technique of applying Lemma B.1 over many choices of $\{a_{(i)}\}$ , and union bounding over the failure probability. Choose $\xi=\textup{poly}(n^{-1})$ and consider a grid of $\{\widehat{\alpha}_{(i)}\}$ with $\widehat{a}_{(i)}=\xi 2^{j_{i}}$ for $j_{i}\geq 1$ . We apply Lemma B.1 with for all possible norm bounds $\{\widehat{\alpha}_{(i)}\}$ in the grid, using failure probability $\widehat{\delta}=\delta/(\prod_{i}\widehat{a}_{(i)}/\xi)$ for a given choice of $\{\widehat{\alpha}_{(i)}\}$ . By union bound, with probability $1-\sum\widehat{\delta}=1-\delta$ , the bound of Lemma B.1 holds simultaneously for all choices of $\{\widehat{\alpha}_{(i)}\}$ . In particular, for the neural net $F$ with parameters $\{W_{(i)}\}$ , there is a choice of $\{\widehat{\alpha}_{(i)}\}$ satisfying

for all $i$ . The application of Lemma B.1 for this choice of $\widehat{\alpha}_{(i)}$ gives us the desired generalization bound. ∎

In the case where $\phi$ is the relu activation, we can no longer lower bound the all-layer margin $m_{F}(x,y)$ using the techniques in Section C, which rely on smoothness. However, we can still obtain a generalization bound in terms of the distribution of $1/m_{F}(x,y)$ on the training data. We can expect $1/m_{F}(x,y)$ to be small in practice because relu networks typically exhibit stability to perturbations. Prior bounds for relu nets suffer from some source of looseness: the bounds of [Bartlett et al., 2017, Neyshabur et al., 2017b] depended on the product of weight norms divided by margin, and the bounds of [Nagarajan and Kolter, 2019] depended on the inverse of the pre-activations, observed to be large in practice. Our bound avoids these dependencies, and in fact, it is possible to upper bound our dependency on $1/m_{F}(x,y)$ in terms of both these quantities.

For this setting, we choose a fixed ${|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|}$ defined as follows: if $i$ corresponds to a linear layer in the network, set $\alpha_{i}=1$ , and for $i$ corresponding to activation layers, set $\alpha_{i}=\infty$ (in other words, we only allow perturbations after linear layers). We remark that we could use alternative definitions of ${|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|}$ , but because we do not have a closed-form lower bound on $m_{F}$ , the tradeoff between these formulations is unclear.

In the neural network setting, suppose that $\phi$ is any activation (such as the relu function) and $m_{F}$ is defined using ${|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|}$ as described above. Fix any integer $q>0$ . Then with probability $1-\delta$ , for all relu networks $F$ parameterized by weight matrices $\{W_{(i)}\}_{i=1}^{r}$ that achieve training error 0, we have

where $a_{(i)}$ is defined as in Theorem 3.1, and $\zeta\triangleq O\left(\frac{\log(1/\delta)+r\log n+\sum_{i}(a_{(i)}+1)}{n}\right)$ is a low-order term.

The proof follows via direct application of Theorem A.2 and the same arguments as Lemma B.1 relating matrix norms to covering numbers. We remark that in the case of relu networks, we can upper bound $\frac{1}{m_{F}(x,y)}$ via a quantity depending on the inverse pre-activations that mirrors the bound of Nagarajan and Kolter . However, as mentioned earlier, this is a pessimistic upper bound as Nagarajan and Kolter show that the inverse preactivations can be quite large in practice.

B.2 Matrix Covering Lemmas

In this section we present our spectral norm cover for the weight matrices, which is used in Section B to prove our neural net generalization bounds.

Let $\mathcal{M}_{\textup{fro}}(B)$ denote the set of $d_{1}\times d_{2}$ matrices with Frobenius norm bounded by $B$ , i.e.

Then letting $d\triangleq\max\{d_{1},d_{2}\}$ denote the larger dimension, for all $\epsilon>0$ , we have

The idea for this proof is that since the cover is in spectral norm, we only need to cover the top $d^{\prime}\triangleq\lfloor B^{2}/\epsilon^{2}\rfloor$ singular vectors of matrices $M\in\mathcal{M}$ .

First, it suffices to work with square matrices, as a spectral norm cover of $\max\{d_{1},d_{2}\}\times\max\{d_{1},d_{2}\}$ matrices will also yield a cover of $d_{1}\times d_{2}$ matrices in spectral norm (as we can extend a $d_{1}\times d_{2}$ matrices to a larger square matrix by adding rows or columns with all 0). Thus, letting $d\triangleq\max\{d_{1},d_{2}\}$ , we will cover $\mathcal{M}_{\textup{fro}}(B)$ defined with respect to $d\times d$ matrices.

Let $d^{\prime}\triangleq\lfloor 9B^{2}/\epsilon^{2}\rfloor$ . We first work in the case when $d^{\prime}\leq d$ . Let $\widehat{\mathcal{U}}$ be a $\epsilon_{U}$ Frobenius norm cover of $d\times d^{\prime}$ matrices with Frobenius norm bound $d^{\prime}$ . Let $\widehat{\mathcal{V}}$ be the cover of $d^{\prime}\times d$ matrices with Frobenius norm bound $B$ in Frobenius norm with resolution $\epsilon_{V}$ . We construct a cover $\widehat{\mathcal{M}}$ for $\mathcal{M}_{\textup{fro}}(B)$ as follows: take all possible combinations of matrices $\widehat{U},\widehat{V}$ from $\widehat{\mathcal{U}},\widehat{\mathcal{V}}$ , and add $\widehat{U}\widehat{V}$ to $\widehat{\mathcal{M}}$ . First note that by Claim B.4, we have

Thus, setting $\epsilon_{U}=\epsilon/3B$ , $\epsilon_{V}=\epsilon/3d^{\prime}$ , then we get a $\epsilon$ -cover of $\mathcal{M}$ with log cover size $\lfloor 9dB^{2}/\epsilon^{2}\rfloor(\log 81{d^{\prime}}^{2}B^{2}/\epsilon^{2})$ . As $d^{\prime}\leq d$ , this simplifies to $\lfloor 36dB^{2}\log(9d)/\epsilon^{2}\rfloor$ .

Now when $d^{\prime}\geq d$ , we simply take a Frobenius norm cover of $d\times d$ matrices with Frobenius norm bound $B$ , which by Claim B.4 has log size at most $d^{2}\log(3B/\epsilon)\leq\lfloor 36dB^{2}\log(9d)/\epsilon^{2}\rfloor$ , where the inequality followed because $9B^{2}/\epsilon^{2}\geq d$ .

Combining both cases, we get for all $\epsilon>0$ ,

The following claims are straightforward and follow from standard covering number bounds for $\|\cdot\|_{2}$ and $\|\cdot\|_{1}$ balls.

Let $\mathcal{M}_{\textup{fro}}(B)$ denote the class of $d_{1}\times d_{2}$ matrices with Frobenius norm bounded by $B$ . Then for $0<\epsilon<B$ , $\log\mathcal{N}_{\|\cdot\|_{\textup{fro}}}(\epsilon,\mathcal{M}_{\textup{fro}}(B))\leq d_{1}d_{2}\log(3B/\epsilon)$ .

Appendix C Generalization Bound for Smooth Function Compositions

In this section, we present the bound for general smooth function compositions used to prove Theorem 3.1.

We will work in the same general setting as Section A. Let $J_{j\leftarrow i}(x,\delta)$ denote the $i$ -to- $j$ Jacobian evaluated at $f_{i-1\leftarrow 1}(x,\delta)$ , i.e. $J_{j\leftarrow i}(x,\delta)\triangleq D_{h}f_{j\leftarrow i}(h,\delta)|_{h=f_{i-1\leftarrow 1}(x,\delta)}$ . We will additionally define general notation for hidden layer and Jacobian norms which coincides with our notation for neural nets. Let $s_{i}(x)\triangleq\|f_{i\leftarrow 1}(x)\|$ and $s_{0}(x)\triangleq\|x\|$ . As the function $Df_{j\leftarrow i}$ outputs operators mapping $\mathcal{D}_{i-1}$ to $\mathcal{D}_{j}$ , we can additionally define $\kappa_{j\leftarrow i}(x)\triangleq\|Df_{j\leftarrow i}\circ f_{i-1\leftarrow 1}(x)\|_{\textup{op}}$ , with $\kappa_{j\leftarrow j+1}(x)\triangleq 1$ .

Let $\kappa^{\prime}_{i}$ be an upper bound on the Lipschitz constant of $Df_{i\leftarrow i}$ measured in operator norm:

Now we define the value $\kappa^{\star}_{i}(x,y)$ , which can be thought of as a Lipschitz constant for perturbation $\delta_{i}$ in the definition of $m_{F}$ , as follows:

For this general setting, the following theorem implies that for any integer $q>0$ , if $F$ classifies all training examples correctly, then its error converges at a rate that scales with $n^{-q/(q+2)}$ and the products $\|\kappa^{\star}_{i}\|_{L_{q}(P_{n})}\mathcal{C}_{\|\cdot\|_{\textup{op}}}(\mathcal{F}_{i})$ .

Let $\mathcal{F}=\{f_{k}\circ\cdots\circ f_{1}:f_{i}\in\mathcal{F}_{i}\}$ denote a class of compositions of functions from $k$ families $\{\mathcal{F}_{i}\}_{i=1}^{k}$ , each of which satisfies Condition A.1 with operator norm $\|\cdot\|_{\textup{op}}$ and complexity $\mathcal{C}_{\|\cdot\|_{\textup{op}}}(\mathcal{F}_{i})$ . For any choice of integer $q>0$ , with probability $1-\delta$ for all $F\in\mathcal{F}$ the following bound holds:

where $\mathcal{S}_{n}$ denotes the subset of training examples correctly classified by $F$ and $\kappa^{\star}_{i}$ is defined in (C.1). In particular, if $F$ classifies all training samples correctly, i.e. $|\mathcal{S}_{n}|=n$ , with probability $1-\delta$ we have

where $\zeta\triangleq O\left(\frac{k\log n+\log(1/\delta)}{n}\right)$ is a low order term.

In the setting of Theorem C.1, where each layer $\mathcal{F}_{i}$ is a class of smooth functions, if $\gamma(F(x),y)>0$ , we have

We prove Lemma C.2 in Section D by formalizing the intuition outlined in Section 3. With Lemma C.2 in hand, we can prove Theorem C.1. This proof will follow the same outline as the proof of Theorem A.2. The primary difference is that we optimize over $k$ values of $\alpha_{i}$ , whereas Theorem A.2 only optimized over the smoothness $\beta$ .

Plugging this into (C.2), we get with probability $1-\delta$ for all $F\in\mathcal{F}$ ,

Thus, it suffices to upper bound $E$ . By Lemma C.2, we have $m_{F}(x,y)\geq\|\{\kappa^{\star}_{i}(x,y)/\alpha_{i}\}_{i=1}^{k}\|_{p/(p-1)}^{-1}$ for the choice of $\alpha,p$ used to define $m_{F}$ . We will set $p=q/(q-1)$ and union bound (C.3) over choices of $\alpha$ .

First, for a particular choice of $\alpha$ and $p=q/(q-1)$ , we apply our lower bound on $m_{F}(x,y)$ to simplify (C.4) as follows:

We use $\xi_{i}$ to denote the lower limit on $\alpha_{i}$ in our search over $\alpha$ , setting $\xi_{i}=\mathcal{C}_{\|\cdot\|_{\textup{op}}}(\mathcal{F}_{i})^{-1}\textup{poly}(k^{-1}n^{-1})$ .If $\mathcal{C}_{\|\cdot\|_{\textup{op}}}(\mathcal{F}_{i})=0$ , then we simply set $\alpha_{i}=\infty$ , which is equivalent to restricting the perturbations used in computing $m_{F}$ to layers where $\mathcal{C}_{\|\cdot\|_{\textup{op}}}(\mathcal{F}_{i})>0$ . Now we consider a grid of $\{\widehat{\alpha}_{i}\}_{i=1}^{k}$ , where $\widehat{\alpha}$ has entries of the form $\widehat{\alpha}_{i}=\xi_{i}2^{j}$ for any $j\geq 0$ . For a given choice of $\widehat{\alpha}$ , we assign it failure probability

where $\delta$ is the target failure probability after union bounding. First, note that

Therefore, with probability $1-\delta$ , we get that (C.2) holds for $m_{F}$ defined with respect to every $\widehat{\alpha}$ . In particular, with probability $1-\delta$ , for all $F\in\mathcal{F}$ and $\widehat{\alpha}$ in the grid,

where the last term was obtained by subsituting (C.6) for the failure probability.

Now we claim that there is some choice of $\widehat{\alpha}$ in the grid such that either

To see this, we first consider $\widehat{\alpha}$ in our grid such that $\widehat{\alpha}_{i}\in[\alpha^{\star}_{F,i},2\alpha^{\star}_{F,i}+\xi_{i}]$ . By construction of our grid of $\widehat{\alpha}$ , such a choice always exists. Then we have

Thus, it follows that for all $F\in\mathcal{F}$ ,

Finally, we can apply Lemma C.3 using $z_{i}=\left(\frac{(c_{2}q)^{q/2}}{n}\sum_{(x,y)\in\mathcal{S}_{n}}\kappa^{\star}_{i}(x,y)^{q}\right)^{1/q}=\left(\frac{|\mathcal{S}_{n}|}{n}\right)^{1/q}\|\kappa^{\star}_{i}\|_{L_{q}(\mathcal{S}_{n})}$ and $b_{i}=\frac{\mathcal{C}_{\|\cdot\|_{\textup{op}}}(\mathcal{F}_{i})\log n}{\sqrt{n}}$ to get

Substituting into (C.9) gives the desired bound.

For coefficients $\{z_{i}\}_{i=1}^{k},\{b_{i}\}_{i=1}^{k}>0$ and integer $q>0$ , define

with minimizer $\alpha^{\star}$ and minimum value $E^{\star}$ . Then

Choose $\{\alpha_{i}\}_{i=1}^{k}$ as follows (we obtained this by solving for $\alpha$ for which $\nabla_{\alpha}E(\alpha)=0$ ):

For this particular choice of $\alpha$ , we can compute

Finally, we note that $\left(\frac{q}{2}\right)^{2/(q+2)}+\left(\frac{q}{2}\right)^{-\frac{q}{q+2}}\leq 2$ , so we obtain

In this section, we prove Lemma C.2, which states that when the function $F$ is a composition of functions with Lipschitz derivative, we will be able to lower bound $m_{F}(x,y)$ in terms of the intermediate Jacobians and layer norms evaluated at $x$ . To prove Lemma C.2, we rely on tools developed by [Wei and Ma, 2019] which control the change in the output of a composition of functions if all the intermediate Jacobians are bounded.

We also define the ramp loss $T_{\rho}$ as follows:

for nonnegative parameters $\rho,t_{i},\tau_{j\leftarrow i}$ which we will later choose to be the margin, hidden layer norm, and Jacobian norms at the unperturbed input. Because the augmented indicator $\mathcal{I}(\delta;x,y)$ conditions on small Jacobian and hidden layer norms, it will turn out to be $\kappa^{\star}_{i}(x,y)$ -Lipschitz in the perturbation $\delta_{i}$ . Furthermore, by construction, the value of the augmented indicator $\mathcal{I}(\delta;x,y)$ will equal $1$ when $\delta=0$ , and we will also have

This immediately gives a lower bound on the perturbation level required to create a negative margin. The lemma below formally bounds the Lipschitz constant of $\mathcal{I}(\delta;x,y)$ in $\delta_{i}$ .

For nonnegative parameters $t_{i},\tau_{j\leftarrow i},\rho$ , with $\tau_{j\leftarrow j+1}=1$ for any $j$ and $\tau_{j\leftarrow j^{\prime}}=0$ for $j\leq j^{\prime}+2$ , define the function $\mathcal{I}(\delta;x,y)$ as in (D.9). Then in the setting of Lemma C.2, for a given $i\in[k]$ , for all choices of $\delta_{i}$ and $\nu$ , if $\delta_{j}=0$ for $j>i$ , we have

We prove Lemma D.1 in Section D.1. With Lemma D.1, we can formalize the proof of Lemma C.2.

Furthermore, by the definition of $\mathcal{I}(\delta;x,y)$ , we have

Finally, by our choice of the parameters used to define $\mathcal{I}(\delta;x,y)$ , we also have $\mathcal{I}(0;x,y)\geq 1$ . Combining everything with (D.11), we get

To see the core idea of the proof, consider differentiating $\mathcal{I}(\delta;x,y)$ with respect to $\delta_{i}$ (ignoring for the moment that the soft indicators are technically not differentiable). Let the terms $A_{1},\ldots,A_{q}$ represent the different indicators which the product $\mathcal{I}(\delta;x,y)$ is comprised of. Then by the product rule for differentiation, we would have

We use the following fact that the product of functions which are product-Lipschitz with respect to one another is in fact Lipschitz. We provide the proof in Section D.2.

Let $A_{1},\ldots,A_{q}:\mathcal{D}_{I}\to$ be a set of Lipschitz functions such that $A_{i}$ is $\bar{\tau}_{i}$ -product-Lipschitz w.r.t $\prod_{j\neq i}A_{j}$ for all $i$ . Then the product $\prod_{i}A_{i}$ is $2\sum_{i}\bar{\tau}_{i}$ -Lipschitz.

Now we proceed to formalize the intuition of product-rule differentiation presented above, by showing that the individual terms in $\mathcal{I}(\delta;x,y)$ are product-Lipschitz with respect to the other terms. For the following three lemmas, we require the technical assumption that for any fixed choice of $x,\delta_{-i}$ , the functions $f_{j\leftarrow 1}(x,\delta)$ , $J_{j^{\prime}\leftarrow j^{\prime\prime}}(x,\delta)$ are worst-case Lipschitz in $\delta_{i}$ as measured in $\|\cdot\|$ , $\|\cdot\|_{\textup{op}}$ , respectively, with Lipschitz constant $C^{\prime}$ . Our proof of Lemma D.1, however, can easily circumvent this assumption. The proofs of the following three lemmas are given in Section D.2.

Choose $i,j_{1},j_{2}$ with $j_{1}\geq j_{2}$ , $j_{1}>i$ . Set product-Lipschitz constant $\bar{\tau}$ as follows:

Given the described steps, we will now complete the proof of Lemma D.1.

Now to remove the $C^{\prime}$ worst-case Lipschitzness assumption, we can follow the reasoning of Claim D.6 of [Wei and Ma, 2019] to note that such Lipschitz constants exist if we restrict $\delta_{i}$ to some compact set, and thus conclude the lemma statement for $\delta_{i}$ restricted to this compact set. Now we simply choose this compact set sufficiently large to include both $\delta_{i}$ and $\delta_{i}+\nu$ . ∎

D.2 Proofs for Product-Lipschitz Lemmas

As each $A_{i}$ is Lipschitz and there are a finite number of functions, there exists $C^{\prime}$ such that any possible product $A_{i_{1}}A_{i_{2}}\cdots A_{i_{j}}$ is $C^{\prime}$ -Lipschitz. Furthermore, by the definition of product-Lipschitz, there are $c,C>0$ such that for any $\|\nu\|\leq c$ , $x\in\mathcal{D}_{I}$ , and $1\leq i\leq q$ , we have

We used the fact that $\prod_{j=i+1}^{q}A_{j}(x)\leq 1$ . Now we have $|A_{i}(x+\nu)-A_{i}(x)|\prod_{j\neq i}A_{j}(x)\leq\bar{\tau}_{i}\|\nu\|+C\|\nu\|^{2}$ , and $|A_{i}(x+\nu)-A_{i}(x)|C^{\prime}\|\nu\|\leq{C^{\prime}}^{2}\|\nu\|^{2}$ as $A_{i}$ is $C^{\prime}$ -Lipschitz, so

Plugging this back into (D.12) and applying triangle inequality, we get

Define the constant $C^{\prime\prime}\triangleq\min\{c,\frac{\sum_{i}\bar{\tau}_{i}}{q(C+C^{\prime 2})}\}$ . For any $x$ and all $\nu$ satisfying $\|\nu\|\leq C^{\prime\prime}$ , we have

Now for any $x,y\in\mathcal{D}_{I}$ , we wish to show $|\prod_{i}A_{i}(y)-\prod_{i}A_{i}(x)|\leq 2\|x-y\|\sum_{i}\bar{\tau}_{i}$ . To this end, we divide $x-y$ into segments of length at most $C^{\prime\prime}$ and apply (D.13) on each segment.

Define $x^{(j)}=x+j\frac{C^{\prime\prime}}{\|x-y\|}(y-x)$ for $j=1,\ldots,\lfloor\|x-y\|/C^{\prime\prime}\rfloor$ . Then as $\|x^{(j)}-x^{(j-1)}\|\leq C^{\prime\prime}$ , we have $|\prod_{i}A_{i}(x^{(j)})-\prod_{i}A_{i}(x^{(j-1)})|\leq\|x^{(j)}-x^{(j-1)}\|\sum_{i}\bar{\tau}_{i}$ . Furthermore, we note that the sum of all the segment lengths equals $\|y-x\|$ . Thus, we can sum this inequality over pairs $(x,x^{(1)}),\ldots,(x^{(\lfloor\|x-y\|/C^{\prime\prime}\rfloor)},y)$ and apply triangle inequality to get

We first note that $D_{\delta_{i}}f_{j\leftarrow 1}(x,\delta)$ , the partial derivative of $f_{j\leftarrow 1}$ with respect to $\delta_{i}$ , is given by $J_{j\leftarrow i+1}(x,\delta)\|f_{i-1\leftarrow 1}(x,\delta)\|$ by Claim D.7. As $J_{j\leftarrow i+1}(x,\delta)$ , $\|f_{i-1\leftarrow 1}(x,\delta)\|$ are both worst-case Lipschitz in $\delta_{i}$ with some Lipschitz constant $C^{\prime}$ , we can apply Claim H.4 of [Wei and Ma, 2019] to obtain:

Now by definition of $A(x,\delta)$ , we get that the right hand side equals if $\|J_{j\leftarrow i+1}(x,\delta)\|_{\textup{op}}\geq 2\tau_{j\leftarrow i+1}$ or $\|f_{i-1\leftarrow 1}(x,\delta)\|\geq 2t_{i-1}$ . Thus, the right hand side must be bounded by

which gives product-Lipschitzness with constant $\frac{4\tau_{j\leftarrow i+1}t_{i-1}}{t_{j}}$ . ∎

This proof follows in an identical manner to that of Lemma D.3. The only additional step is using the fact that $\gamma(h,y)$ is 1-Lipschitz in $h$ , so the composition $T_{\rho}(\gamma(h,y))$ is $\rho^{-1}$ -Lipschitz in $h$ . ∎

Let $C^{\prime}$ be an upper bound on the Lipschitz constant in $\delta_{i}$ of all the functions $J_{j^{\prime}\leftarrow j^{\prime\prime}}(x,\delta)$ . As we assumed that each $f_{j}$ has $\kappa^{\prime}_{j}$ -Lipschitz Jacobian, such an upper bound exists. We first argue that

for some Lipschitz constant $C^{\prime\prime}$ . The proof of this statement is nearly identical to the proof of Claim D.3 in [Wei and Ma, 2019], so we only sketch it here and point out the differences. We rely on the expansion

which follows from the chain rule. Now we note that we can compute the change in a single term $J_{j^{\prime}\leftarrow j^{\prime}}(x,\delta)$ from perturbing $\delta_{i}$ as follows:

Note that when $j^{\prime}>i$ , by assumption $\delta_{j^{\prime}}=0$ , so $D_{h}f_{j^{\prime}\leftarrow j^{\prime}}(h,\delta)=D_{h}f_{j^{\prime}}(h)$ . Thus, as $f_{j^{\prime}}$ has $\kappa^{\prime}_{j^{\prime}}$ -Lipschitz derivative, we get

We obtained the last line via Claim D.7. We note that the cases when $j^{\prime}>i$ contribute to the terms under the summation in (D.14).

When $j^{\prime}=i$ , we have $D_{h}f_{i\leftarrow i}(h,\delta)=D_{h}f_{i}(h)+D_{h}[\delta_{i}\|h\|]$ for any $h$ , so $\|D_{h}f_{i\leftarrow i}(h,\delta_{i},\delta_{-i})-D_{h}f_{i\leftarrow i}(h,\delta_{i}+\nu,\delta_{-i})\|_{\textup{op}}=\|D_{h}[\nu\|h\|]\|_{\textup{op}}\leq\|\nu\|$ . As this holds for any $h$ , it follows that $\|J_{i\leftarrow i}(x,\delta_{-i},\delta_{i}+\nu)-J_{i\leftarrow i}(x,\delta_{-i},\delta_{i})\|_{\textup{op}}\leq\|\nu\|$ . This term results in the last quantity in (D.14). Finally, when $j^{\prime}<i$ , we have $J_{j^{\prime}\leftarrow j^{\prime}}(x,\delta_{-i},\delta_{i}+\nu)=J_{j^{\prime}\leftarrow j^{\prime}}(x,\delta_{-i},\delta_{i})$ as $J_{j^{\prime}\leftarrow j^{\prime}}$ does not depend on $\delta_{i}$ .

To see how (D.14) follows, we would apply the above bounds in a telescoping sum over indices $j^{\prime}$ ranging from $\max\{j_{2},i\}$ to $j_{1}$ . For a more detailed derivation, refer to the steps in Claim D.3 of [Wei and Ma, 2019].

Note that if any of the bounds set by the indicators in $A(x,\delta)$ are violated, then $A(x,\delta)=0$ , and thus

In the other case, we have $\|f_{i-1\leftarrow 1}(x,\delta)\|\leq 2t_{i-1}$ , and $\|J_{j^{\prime}\leftarrow j^{\prime\prime}}(x,\delta)\|_{\textup{op}}\leq 2\tau_{j^{\prime}\leftarrow j^{\prime\prime}}$ , in which case (D.14) can be bounded by

Let $A_{1},A_{2},A_{3}:\mathcal{D}_{I}\to$ be functions where $A_{1}$ is $\bar{\tau}$ -product-Lipschitz w.r.t. $A_{2}$ . Then $A_{1}$ is also $\bar{\tau}$ -product Lipschitz w.r.t. $A_{2}A_{3}$ .

This statement follows from the definition of product-Lipschitzness and the fact that

The partial derivative of $f_{j\leftarrow 1}$ with respect to variable $\delta_{i}$ evaluated at $x,\delta$ can be computed as

By definition, $f_{j\leftarrow 1}(x,\delta)=f_{j\leftarrow i+1}(f_{i\leftarrow 1}(x,\delta),\delta)$ . Thus, we note that $f_{j\leftarrow 1}(x,\delta)$ only depends on $\delta_{i}$ through $f_{i\leftarrow 1}(x,\delta)$ , so by chain rule we have

In the second line, we invoked the definition of $J_{j\leftarrow i+1}(x,\delta)$ . ∎

This is a result of the chain rule, but for completeness we state the proof here. We have

Thus, plugging in $h=f_{i-1\leftarrow 1}(x,\delta)$ , we get

Now we can apply identical steps to expand $J_{j-1\leftarrow i}(x,\delta)$ , giving the desired result. ∎

Appendix E Proofs for Adversarially Robust Classification

In this section, we derive the generalization bounds for adversarial classification presented in Section 4. Recall the adversarial all-layer margin $m^{\textup{adv}}_{F}(x,y)\triangleq\min_{x^{\prime}\in\mathcal{B}^{\textup{adv}}(x)}m_{F}(x^{\prime},y)$ defined in Section 4. In this section, we will use the general definition of $m_{F}$ in (A.1).

We will sketch the proof of Theorem E.1 using the same steps as those laid out in Sections 2 and A. We will rely on the following general analogue of Theorem C.1 with the exact same proof, but with $\kappa^{\star}_{i}$ (defined in Section C) replaced by $\kappa^{\textup{adv}}_{i}(x,y)\triangleq\max_{x^{\prime}\in\mathcal{B}^{\textup{adv}}(x)}\kappa^{\star}_{i}(x^{\prime},y)$ everywhere:

where $\mathcal{S}_{n}^{\textup{adv}}$ denotes the subset of training examples correctly classified by $F$ with respect to adversarial perturbations. In particular, if $F$ classifies all training samples with adversarial error 0, i.e. $|\mathcal{S}_{n}^{\textup{adv}}|=n$ , with probability $1-\delta$ we have

where $\zeta\triangleq O\left(\frac{k\log n+\log(1/\delta)}{n}\right)$ is a low order term.

Given Theorem E.1, Theorem 4.1 follows with the same proof as the proof of Theorem 3.1 given in Section B. To prove Theorem E.1, we first have the following analogue of Lemma A.3 bounding the covering number of $m^{\textup{adv}}\circ\mathcal{F}$ .

Define $m^{\textup{adv}}\circ\mathcal{F}\triangleq\{(x,y)\mapsto m^{\textup{adv}}_{F}(x,y):F\in\mathcal{F}\}$ . Then

This lemma allows us to invoke Claim A.7 on a smooth loss composed with $m^{\textup{adv}}\circ\mathcal{F}$ , as we did for the clean classification setting. The lemma is proven the exact same way as Lemma A.3, given the Lipschitz-ness of $m^{\textup{adv}}_{F}$ below:

For any $x,y\in\mathcal{D}_{0}\times[l]$ , and function sequences $F=\{f_{i}\}_{i=1}^{k},\widehat{F}=\{\widehat{f}_{i}\}_{i=1}^{k}$ , we have $|m^{\textup{adv}}_{F}(x,y)-m^{\textup{adv}}_{\widehat{F}}(x,y)|\leq{|\kern-1.07639pt|\kern-1.07639pt|F-\widehat{F}|\kern-1.07639pt|\kern-1.07639pt|}$ .

Let $x^{\star}\in\mathcal{B}^{\textup{adv}}(x)$ be such that $m^{\textup{adv}}_{F}(x,y)=m_{F}(x^{\star},y)$ . By Claim A.4, we have

We can apply the reverse reasoning to also obtain $m^{\textup{adv}}_{F}(x,y)\leq m^{\textup{adv}}_{\widehat{F}}(x,y)+{|\kern-1.07639pt|\kern-1.07639pt|F-\widehat{F}|\kern-1.07639pt|\kern-1.07639pt|}$ . Combining the two gives us the desired result. ∎

Next, we lower bound $m^{\textup{adv}}_{F}$ when each function in $F$ is smooth.

Appendix F Additional Experimental Details

For the experiments presented in Table 1, the other hyperparameters besides $t$ and $\eta_{\textup{perturb}}$ are set to their defaults for WideResNet architectures. Our code is inspired by the following PyTorch WideResNet implementation: https://github.com/xternalz/WideResNet-pytorch, and we use the default hyperparameters in this implementation. Although we tried a larger choice of $t$ , the number of updates to the perturbations $\delta$ , our results did not depend much on $t$ .

In Table 3, we demonstrate that our AMO algorithm can also improve performance for conventional feedforward architectures such as VGG [Simonyan and Zisserman, 2014]. We report the best validation error on the clean classification setting. For both methods, we train VGG-19 for 350 epochs using weight decay $0.0005$ and an initial learning rate of 0.05 annealing by a factor of 0.1 at the 150-th and 250-th epochs. For the AMO algorithm, we optimize perturbations $\delta$ for $t=1$ steps and use learning rate $\eta_{\textup{perturb}}=0.01$ .

In Table 4, we demonstrate that our AMO algorithm can offer improvements over dropout, which is also a regularization method based on perturbing the hidden layers. This demonstrates the value of regularizing stability with respect to worst-case perturbations rather than random perturbations. For the combinations of CIFAR dataset and WideResNet architecture, we tuned the level of dropout and display the best tuned results with the dropout probability $p$ in Table 4. We use the same hyperparameter settings as described above.

F.2 Additional Details for Robust AMO

For both models trained with robust AMO, the perturbations $\delta$ have a step size of 6.4e-3, and we shrink the norm of $\delta$ by a factor of 0.92 every iteration. For the WideResNet16 model trained with robust AMO, we use a batch size of 128 with initial learning rate of 0.1 and weight decay 5e-4. For the WideResNet28 model trained using robust AMO, we use a batch size of 64 (chosen so that the $\delta$ vectors fit in GPU memory) with initial learning rate of 0.707 (chosen to keep the covariance of the gradient update the same with the decreased batch size). We use weight decay 2e-4. We chose these parameters by tuning.