Spectrally-normalized margin bounds for neural networks

Peter Bartlett, Dylan J. Foster, Matus Telgarsky

Overview

Neural networks owe their astonishing success not only to their ability to fit any data set: they also generalize well, meaning they provide a close fit on unseen data. A classical statistical adage is that models capable of fitting too much will generalize poorly; what’s going on here?

Let’s navigate the many possible explanations provided by statistical theory. A first observation is that any analysis based solely on the number of possible labellings on a finite training set — as is the case with VC dimension — is doomed: if the function class can fit all possible labels (as is the case with neural networks in standard configurations (Zhang et al., 2017)), then this analysis can not distinguish it from the collection of all possible functions!

Next let’s consider scale-sensitive measures of complexity, such as Rademacher complexity and covering numbers, which (can) work directly with real-valued function classes, and moreover are sensitive to their magnitudes. Figure 1 plots the excess risk (the test error minus the training error) across training epochs against one candidate scale-sensitive complexity measure, the Lipschitz constant of the network (the product of the spectral norms of the weight matrices), and demonstrates that they are tightly correlated (which is not the case for, say, the $l_{2}$ norm of the weights). The data considered in Figure 1 is the standard cifar10 dataset, both with original and with random labels, which has been used as a sanity check when investigating neural network generalization (Zhang et al., 2017).

There is still an issue with basing a complexity measure purely on the Lipschitz constant (although it has already been successfully employed to regularize neural networks (Cisse et al., 2017)): as depicted in Figure 1, the measure grows over time, despite the excess risk plateauing. Fortunately, there is a standard resolution to this issue: investigating the margins (a precise measure of confidence) of the outputs of the network. This tool has been used to study the behavior of 2-layer networks, boosting methods, SVMs, and many others (Bartlett, 1996; Schapire et al., 1997; Boucheron et al., 2005); in boosting, for instance, there is a similar growth in complexity over time (each training iteration adds a weak learner), whereas margin bounds correctly stay flat or even decrease. This behavior is recovered here: as depicted in Figure 1, even though standard networks exhibit growing Lipschitz constants, normalizing these Lipschitz constants by the margin instead gives a decaying curve.

This work investigates a complexity measure for neural networks that is based on the Lipschitz constant, but normalized by the margin of the predictor. The two central contributions are as follows.

Theorem 1.1 below will give the rigorous statement of the generalization bound that is the basis of this work. In contrast to prior work, this bound: (a) scales with the Lipschitz constant (product of spectral norms of weight matrices) divided by the margin; (b) has no dependence on combinatorial parameters (e.g., number of layers or nodes) outside of log factors; (c) is multiclass (with no explicit dependence on the number of classes); (d) measures complexity against a reference network (e.g., for the ResNet (He et al., 2016), the reference network has identity mappings at each layer). The bound is stated below, with a general form and analysis summary appearing in Section 3 and the full details relegated to the appendix.

An empirical investigation, in Section 2, of neural network generalization on the standard datasets cifar10, cifar100, and mnist using the preceding bound. Rather than using the bound to provide a single number, it can be used to form a margin distribution as in Figure 2. These margin distributions will illuminate the following intuitive observations: (a) cifar10 is harder than mnist; (b) random labels make cifar10 and mnist much more difficult; (c) the margin distributions (and bounds) converge during training, even though the weight matrices continue to grow; (d) $l_{2}$ regularization (“weight decay”) does not significantly impact margins or generalization.

Unfortunately, margins alone do not seem to say much; see for instance Figure 2(a), where the collections of all margins for all data points — the unnormalized margin distribution — are similar for cifar10 with and without random labels. What is missing is an appropriate normalization, as in Figure 2(b). This normalization is provided by Theorem 1.1, which can now be explained in detail.

The following theorem provides a generalization bound for neural networks whose nonlinearities are fixed but whose weight matrices $\mathcal{A}$ have bounded spectral complexity $R_{\mathcal{A}}$ .

where $\widehat{\mathcal{R}}_{\gamma}(f)\leq n^{-1}\sum_{i}\mathds{1}\mathinner{\left[f(x_{i})_{y_{i}}\leq\gamma+\max_{j\neq y_{i}}f(x_{i})_{j}\right]}$ and $\|X\|_{2}=\sqrt{\sum_{i}\|x_{i}\|_{2}^{2}}$ .

The full proof and a generalization beyond spectral norms is relegated to the appendix, but a sketch is provided in Section 3, along with a lower bound. Section 3 also gives a discussion of related work: briefly, it’s essential to note that margin and Lipschitz-sensitive bounds have a long history in the neural networks literature (Bartlett, 1996; Anthony and Bartlett, 1999; Neyshabur et al., 2015); the distinction here is the sensitivity to the spectral norm, and that there is no explicit appearance of combinatorial quantities such as numbers of parameters or layers (outside of log terms, and indices to summations and products).

To close, miscellaneous observations and open problems are collected in Section 4.

Generalization case studies via margin distributions

In this section, we empirically study the generalization behavior of neural networks, via margin distributions and the generalization bound stated in Theorem 1.1.

where the spectral complexity $R_{\mathcal{A}}$ is from eq. 1.2. The normalization is thus derived from the bound in Theorem 1.1, but ignoring log terms.

Taken this way, the two margin distributions for two datasets can be interpreted as follows. Considering any fixed point on the horizontal axis, if the cumulative distribution of one density is lower than the other, then it corresponds to a lower right hand side in Theorem 1.1. For no reason other than visual interpretability, the plots here will instead depict a density estimate of the margin distribution. The vertical and horizontal axes are rescaled in different plots, but the random and true cifar10 margin distributions are always the same.

A little more detail about the experimental setup is as follows. All experiments were implemented in Keras (Chollet et al., 2015). In order to minimize conflating effects of optimization and regularization, the optimization method was vanilla SGD with step size $0.01$ , and all regularization (weight decay, batch normalization, etc.) were disabled. “cifar” in general refers to cifar10, however cifar100 will also be explicitly mentioned. The network architecture is essentially AlexNet (Krizhevsky et al., 2012) with all normalization/regularization removed, and with no adjustments of any kind (even to the learning rate) across the different experiments.

Comparing datasets. A first comparison is of cifar10 and the standard mnist digit data. mnist is considered “easy”, since any of a variety of methods can achieve roughly 1% test error. The “easiness” is corroborated by Figure 3(a), where the margin distribution for mnist places all its mass far to the right of the mass for cifar10. Interestingly, randomizing the labels of mnist, as in Figure 3(b), results in a margin distribution to the left of not only cifar10, but also slightly to the left of (but close to) cifar10 with randomized labels.

Next, Figure 3(c) compares cifar10 and cifar100, where cifar100 uses the same input images as cifar10; indeed, cifar10 is obtained from cifar100 by collapsing the original 100 categories into 10 groups. Interestingly, cifar100, from the perspective of margin bounds, is just as difficult as cifar10 with random labels. This is consistent with the large observed test error on cifar100 (which has not been “optimized” in any way via regularization).

Lastly, Figure 3(d) replaces the cifar10 input images with random images sampled from Gaussians matching the first- and second-order image statistics (see (Zhang et al., 2017) for similar experiments).

Convergence of margins. As was pointed out in Section 1, the weights of the neural networks do not seem to converge in the usual sense during training (the norms grow continually). However, as depicted in Figure 4(a), the sequence of (normalized) margin distributions is itself converging.

Regularization. As remarked in (Zhang et al., 2017), regularization only seems to bring minor benefits to test error (though adequate to be employed in all cutting edge results). This observation is certainly consistent with the margin distributions in Figure 4(b), which do not improve (e.g., by shifting to the right) in any visible way under regularization. An open question, discussed further in Section 4, is to design regularization that improves margins.

Analysis of margin bound

This section will sketch the proof of Theorem 1.1, give a lower bound, and discuss related work.

With this notation in place, the basic bound is as follows.

Then, with probability at least $1-\delta$ over a sample $S$ of size $n$ , every $f\in\mathcal{F}$ satisfies

This bound is a direct consequence of standard tools in Rademacher complexity. In order to instantiate this bound, covering numbers will be used to directly upper bound the Rademacher complexity term $\mathfrak{R}((\mathcal{F}_{\gamma})_{|S})$ . Interestingly, the choice of directly working in terms of covering numbers seems essential to providing a bound with no explicit dependence on $k$ ; by contrast, prior work primarily handles multiclass via a Rademacher complexity analysis on each coordinate of a $k$ -tuple of functions, and pays a factor of $\sqrt{k}$ (Zhang, 2004).

2 Covering number complexity upper bounds

This subsection proves Theorem 1.1 via Lemma 3.1 by controlling, via covering numbers, the Rademacher complexity $\mathfrak{R}((\mathcal{F}_{\gamma})_{|S})$ for networks with bounded spectral complexity.

The notation here for (proper) covering numbers is as follows. Let $\mathcal{N}(U,\epsilon,\|\cdot\|)$ denote the least cardinality of any subset $V\subseteq U$ that covers $U$ at scale $\epsilon$ with norm $\|\cdot\|$ , meaning

Choices of $U$ that will be used in the present work include both the image $\mathcal{F}_{|S}$ of data $S$ under some function class $\mathcal{F}$ , as well as the conceptually simpler choice of a family of matrix products.

The full proof has the following steps. (I) A matrix covering bound for the affine transformation of each layer is provided in Lemma 3.2; handling whole layers at once allows for more flexible norms. (II) An induction on layers then gives a covering number bound for entire networks; this analysis is only sketched here for the special case of norms used in Theorem 1.1, but the full proof in the appendix culminates in a bound for more general norms (cf. Lemma A.7). (III) The preceding whole-network covering number leads to Theorem 1.1 via Lemma 3.1 and standard techniques.

Step (I), matrix covering, is handled by the following lemma. The covering number considers the matrix product $XA$ , where $A$ will be instantiated as the weight matrix for a layer, and $X$ is the data passed through all layers prior to the present layer.

The proof relies upon the Maurey sparsification lemma (Pisier, 1980), which is stated in terms of sparsifying convex hulls, and in its use here is inspired by covering number bounds for linear predictors (Zhang, 2002). To prove Theorem 1.1, this matrix covering bound will be instantiated for the case of $\left\|A\right\|_{2,1}$ . It is possible to instead scale with $\|A\|_{2}$ and $\|X\|_{2}$ , but even for the case of the identity matrix $X=I$ , this incurs an extra dimension factor. The use of $\|A\|_{2,1}$ here thus helps Theorem 1.1 avoid any appearance of $W$ and $L$ outside of log terms; indeed, the goal of covering a whole matrix at a time (rather than the more standard vector covering) was to allow this greater sensitivity and avoid combinatorial parameters.

Step (II), the induction on layers, proceeds as follows. Let $X_{i}$ denote the output of layer $i$ but with images of examples of columns (thus $X_{0}=X^{\top}$ ), and inductively suppose there exists a cover element $\widehat{X}_{i}$ for $X_{i}$ which depends on covering matrices $(\widehat{A}_{1},\ldots,\widehat{A}_{i-1})$ chosen to cover weight matrices in earlier layers. Thanks to Lemma 3.2, there also exists $\widehat{A}_{i}$ so that $\|A_{i}\widehat{X}_{i}-\widehat{A}_{i}\widehat{X}_{i}\|_{2}\leq\epsilon_{i}$ . The desired cover element is thus $\widehat{X}_{i+1}=\sigma_{i}(\widehat{A}_{i}\widehat{X}_{i})$ where $\sigma_{i}$ is the nonlinearity in layer $i$ ; indeed, supposing $\sigma_{i}$ is $\rho_{i}$ -Lipschitz,

where the first term is controlled with the inductive hypothesis. Since $\widehat{X}_{i+1}$ depends on each choice $(\widehat{A}_{i},\ldots,\widehat{A}_{i})$ , the cardinality of the full network cover is the product of the individual matrix covers.

The preceding proof had no sensitivity to the particular choice of norms; it merely required an operator norm on $A_{i}$ , as well as some other norm that allows matrix covering. Such an analysis is presented in full generality in Section A.5. Specializing to the particular case of spectral norms and $(2,1)$ group norms leads to the following full-network covering bound.

where each matrix has dimension at most $W$ along each axis. Then for any $\epsilon>0$ ,

What remains is (III): Theorem 3.3 can be combined with the standard Dudley entropy integral upper bound on Rademacher complexity (see e.g. Mohri et al. (2012)), which together with Lemma 3.1 gives Theorem 1.1.

3 Rademacher complexity lower bounds

By reduction to the linear case (i.e., removing all nonlinearities), it is easy to provide a lower bound on the Rademacher complexity of the networks studied here. Unfortunately, this bound only scales with the product of spectral norms, and not the other terms in $R_{\mathcal{A}}$ (cf. eq. 1.2).

Note that, due to the nonlinearity, the lower bound should indeed depend on $\prod_{i}\|A_{i}\|_{\sigma}$ and not $\|\prod_{i}A_{i}\|_{\sigma}$ ; as a simple sanity check, there exist networks for which the latter quantity is 0, but the network does not compute the zero function.

4 Related work

To close this section on proofs, it is a good time to summarize connections to existing literature.

The algorithmic idea of large margin classifiers was introduced in the linear case by Vapnik (1982) (see also (Boser et al., 1992; Cortes and Vapnik, 1995)). Vapnik (1995) gave an intuitive explanation of the performance of these methods based on a sample-dependent VC-dimension calculation, but without generalization bounds. The first rigorous generalization bounds for large margin linear classifiers (Shawe-Taylor et al., 1998) required a scale-sensitive complexity analysis of real-valued function classes. At the same time, a large margin analysis was developed for two-layer networks (Bartlett, 1996), indeed with a proof technique that inspired the layer-wise induction used to prove Theorem 1.1 in the present work. Margin theory was quickly extended to many other settings (see for instance the survey by Boucheron et al. (2005)), one major success being an explanation of the generalization ability of boosting methods, which exhibit an explicit growth in the size of the function class over time, but a stable excess risk (Schapire et al., 1997). The contribution of the present work is to provide a margin bound (and corresponding Rademacher analysis) that can be adapted to various operator norms at each layer. Additionally, the present work operates in the multiclass setting, and avoids an explicit dependence on the number of classes $k$ , which seems to appear in prior work (Zhang, 2004; Tewari and Bartlett, 2007).

There are numerous generalization bounds for neural networks, including VC-dimension and fat-shattering bounds (many of these can be found in (Anthony and Bartlett, 1999)). Scale-sensitive analysis of neural networks started with (Bartlett, 1996), which can be interpreted in the present setting as utilizing data norm $\|\cdot\|_{\infty}$ and operator norm $\|\cdot\|_{\infty\to\infty}$ (equivalently, the norm $\|A_{i}^{\top}\|_{1,\infty}$ on weight matrix $A_{i}$ ). This analysis can be adapted to give a Rademacher complexity analysis (Bartlett and Mendelson, 2002), and has been adapted to other norms (Neyshabur et al., 2015), although the $\|\cdot\|_{\infty}$ setting appears to be necessary to avoid extra combinatorial factors. More work is still needed to develop complexity analyses that have matching upper and lower bounds, and also to determine which norms are well-adapted to neural networks as used in practice.

The present analysis utilizes covering numbers, and is most closely connected to earlier covering number bounds (Anthony and Bartlett, 1999, Chapter 12), themselves based on the earlier fat-shattering analysis (Bartlett, 1996), however the technique here of pushing an empirical cover through layers is akin to VC dimension proofs for neural networks (Anthony and Bartlett, 1999). The use of Maurey’s sparsification lemma was inspired by linear predictor covering number bounds (Zhang, 2002).

Further observations and open problems

Adversarial examples are a phenomenon where the neural network predictions can be altered by adding seemingly imperceptible noise to an input (Goodfellow et al., 2014). This phenomenon can be connected to margins as follows. The margin is nothing more than the distance an input must traverse before its label is flipped; consequently, low margin points are more susceptible to adversarial noise than high margin points. Concretely, taking the 100 lowest margin inputs from cifar10 and adding uniform noise at scale $0.15$ yielded flipped labels on 5.86% of the images, whereas the same level of noise on high margin points yielded 0.04% flipped labels. Can the bounds here suggest a way to defend against adversarial examples?

Regularization.

It was observed in (Zhang et al., 2017) that explicit regularization contributes little to the generalization performance of neural networks. In the margin framework, standard weight decay ( $l_{2}$ ) regularization seemed to have little impact on margin distributions in Section 2. On the other hand, in the boosting literature, special types of regularization were developed to maximize margins (Shalev-Shwartz and Singer, 2008); perhaps a similar development can be performed here?

SGD.

The present analysis applies to predictors that have large margins; what is missing is an analysis verifying that SGD applied to standard neural networks returns large margin predictors! Indeed, perhaps SGD returns not simply large margin predictors, but predictors that are well-behaved in a variety of other ways that can be directly translated into refined generalization bounds.

Improvements to Theorem 1.1.

There are several directions in which Theorem 1.1 might be improved. Can a better choice of layer geometries (norms) yield better bounds on practical networks? Can the nonlinearities’ worst-case Lipschitz constant be replaced with an (empirically) averaged quantity? Alternatively, can better lower bounds rule out these directions?

Rademacher vs. covering.

Is it possible to prove Theorem 1.1 solely via Rademacher complexity, with no invocation of covering numbers?

Acknowledgements

The authors thank Srinadh Bhojanapalli, Ryan Jian, Behnam Neyshabur, Maxim Raginsky, Andrew J. Risteski, and Belinda Tzen for useful conversations and feedback. The authors thank Ben Recht for giving a provocative lecture at the Simons Institute, stressing the need for understanding of both generalization and optimization of neural networks. M.T. and D.F. acknowledge the use of a GPU machine provided by Karthik Sridharan and made possible by an NVIDIA GPU grant. D.F. acknowledges the support of the NDSEG fellowship. P.B. gratefully acknowledges the support of the NSF through grant IIS-1619362 and of the Australian Research Council through an Australian Laureate Fellowship (FL110100281) and through the ARC Centre of Excellence for Mathematical and Statistical Frontiers. The authors thank the Simons Institute for the Theory of Computing Spring 2017 program on the Foundations of Machine Learning. Lastly, the authors are grateful to La Burrita (both the north and the south Berkeley campus locations) for upholding the glorious tradition of the California Burrito.

References

Appendix A Proofs

This appendix collects various proofs omitted from the main text.

The standard ReLU (“Rectified Linear Unit”) is the univariate mapping

When applied to a vector or a matrix, it operates coordinate-wise. While the ReLU is currently the most popular choice of univariate nonlinearity, another common choice is the sigmoid $r\mapsto 1/(1+\exp(-r))$ . More generally, these univariate nonlinearities are Lipschitz, and this carries over to their vector and matrix forms as follows.

Define a max-pooling operator $\mathcal{P}$ as follows. Given an input and output pair of finite-dimensional vector spaces $\mathcal{T}$ and $\mathcal{T}^{\prime}$ (possibly arranged as matrices or tensors), the max-pooling operator iterates over a collection of sets of indices $\mathcal{Z}$ (whose cardinality is equal to the dimension of $\mathcal{T}$ ’), and for each element of $Z_{i}\in\mathcal{Z}$ sets the corresponding coordinate $i$ in the output to the maximum entry of the input over $Z_{i}$ : given $T\in\mathcal{T}$ ,

The following Lipschitz constant of pooling operators will depend on the number of times each coordinate is accessed across elements of $\mathcal{Z}$ ; when this operator is used in computer vision, the number of times is typically a small constant, for instance 5 or 9 (Krizhevsky et al., 2012).

Suppose that each coordinate $j$ of the input appears in at most $m$ elements of the collection $\mathcal{Z}$ . Then the max-pooling operator $\mathcal{P}$ is $m^{1/p}$ -Lipschitz wrt $\|\cdot\|_{p}$ for any $p\geq 1$ . In particular, the max-pooling operator is $1$ -Lipschitz whenever $\mathcal{Z}$ forms a partition.

Let $T,T^{\prime}\in\mathcal{T}$ be given. First consider any fixed set of indices $Z\in\mathcal{Z}$ , and suppose without loss of generality that $\mathcal{P}(T)_{Z}=\max_{j\in Z}T_{j}\geq\max_{j\in Z}T^{\prime}_{j}$ . Then

A.2 Margin properties in Section 3.1

For every $j$ and every $p\geq 1$ , $\mathcal{M}(\cdot,j)$ is 2-Lipschitz wrt $\|\cdot\|_{p}$ .

Let $v,v^{\prime},j$ be given, and suppose (without loss of generality) $\mathcal{M}(v,j)\geq\mathcal{M}(v^{\prime},j)$ . Choose coordinate $i\neq j$ so that $\mathcal{M}(v^{\prime},j)=v^{\prime}_{j}-v^{\prime}_{i}$ . Then

Next, recall the definition of the ramp loss

(These quantities are standard; see for instance (Boucheron et al., 2005; Zhang, 2004; Tewari and Bartlett, 2007).)

where the $\operatorname*{arg\,max}$ follows any deterministic tie-breaking strategy.

With these tools in place, the proof of Lemma 3.1 is straightforward.

The bound now follows by applying Lemma A.4 to the left hand side.

A.3 Dudley Entropy Integral

This section contains a slight variant of the standard Dudley entropy integral bound on the empirical Rademacher complexity (e.g. Mohri et al. (2012)), which is used in the proof of Theorem 1.1. The presentation here diverges from standard presentations because the data metric (as in eq. A.1) is not normalized by $\sqrt{n}$ . The proof itself is entirely standard however — even up to constants — and is included only for completeness.

Let $\mathcal{F}$ be a real-valued function class taking values in $ $, and assume that$ \mathbf{0}\in\mathcal{F}$. Then

and $\mathinner{\!\left\lvert V_{i}\right\rvert}=\mathcal{N}(\mathcal{F}_{|S},\varepsilon_{i},\|\cdot\|_{2})$ . For a fixed $f\in\mathcal{F}$ , let $v^{i}[f]$ denote the nearest element in $V_{{}_{i}}$ . Then

For the third term, observe that it suffices to take $V_{1}=\left\{\boldsymbol{0}\right\}$ , which implies

The first term may be handled using Cauchy-Schwarz as follows:

Last to take care of are the terms of the form

For each $i$ , let $W_{i}=\left\{v^{i}[f]-v^{i+1}[f]\mid{}f\in\mathcal{F}\right\}$ . Then $\mathinner{\!\left\lvert W_{i}\right\rvert}\leq{}\mathinner{\!\left\lvert V_{i}\right\rvert}\mathinner{\!\left\lvert V_{i+1}\right\rvert}\leq{}\mathinner{\!\left\lvert V_{i+1}\right\rvert}^{2}$ ,

With this observation, the standard Massart finite class lemma (Mohri et al., 2012) implies

Finally, select any $\alpha>0$ and take $N$ be the largest integer with $\varepsilon_{N+1}>\alpha$ . Then $\varepsilon_{N}=4\varepsilon_{N+2}<4\alpha$ , and so

A.4 Proof of matrix covering (Lemma 3.2)

First recall the Maurey sparsification lemma.

Set $\beta\mathrel{\mathop{\ordinarycolon}}=\|\alpha\|_{1}$ for convenience, and let $(W_{1},\ldots,W_{k})$ denote $k$ iid random variables where $\textup{Pr}[W_{1}=\beta V_{i}]\mathrel{\mathop{\ordinarycolon}}=\alpha_{i}/\beta$ . Define $W\mathrel{\mathop{\ordinarycolon}}=k^{-1}\sum_{i=1}^{k}W_{i}$ , whereby

To finish, by the probabilistic method, there exists integers $(j_{1},\ldots,j_{k})\in\{1,\ldots,d\}^{k}$ and an assignment $\widehat{W}_{i}\mathrel{\mathop{\ordinarycolon}}=\beta V_{j_{i}}$ and $\widehat{W}\mathrel{\mathop{\ordinarycolon}}=k^{-1}\sum_{i=1}^{k}\widehat{W}_{i}$ such that

The result now follows by defining integers $(k_{1},\ldots,k_{d})$ according to $k_{i}\mathrel{\mathop{\ordinarycolon}}=\sum_{l=1}^{k}\mathds{1}[j_{l}=i]$ . ∎

As stated, the Maurey sparsification lemma seems to only grant bounds in terms of $l_{1}$ norms. As developed by Zhang (2002) in the vector covering case, however, it is easy to handle other norms by rescaling the cover elements. With slightly more care, these proofs generalize to the matrix case, thus yielding the proof of Lemma 3.2.

where the $k_{i}$ ’s are integers. Now $p\leq 2$ combined with the definition of $V_{i}$ and $Y$ implies

It will now be shown that $\mathcal{C}$ is the desired cover. Firstly, $|\mathcal{C}|\leq N^{k}$ by construction, namely by the final equality of eq. A.2. Secondly, let $A$ with $\|A\|_{q,s}\leq a$ be given, and construct a cover element within $\mathcal{C}$ using the following technique, which follows the approach developed by Zhang (2002) for linear prediction in which the basic Maurey lemma is applied to non- $l_{1}$ balls simply by rescaling.

Define $B\mathrel{\mathop{\ordinarycolon}}=\alpha\odot A$ , whereby using conjugacy of $\|\cdot\|_{p,r}$ and $\|\cdot\|_{q,s}$ gives

where $\textup{conv}(\mathinner{\left\{V_{1},\ldots,V_{N}\right\}})$ is the convex hull of $\mathinner{\left\{V_{1},\ldots,V_{N}\right\}}$ .

Combining the preceding constructions with Lemma A.6, there exist nonnegative integers $(k_{1},\ldots,k_{N})$ with $\sum_{i}k_{i}=k$ with

The desired cover element is thus $\frac{\bar{a}}{k}\sum_{i}k_{i}V_{i}\in\mathcal{C}$ .

A.5 A whole-network covering bound for general norms

As stated in the text, the construction of a whole-network cover via induction on layers does not demand much structure from the norms placed on the weight matrices. This subsection develops this general analysis. A tantalizing direction for future work is to specialize the general bound in other ways, namely ones that are better adapted to the geometry of neural networks as encountered in practice.

The structure of the networks is the same as before; namely, given matrices $\mathcal{A}=(A_{1},\ldots,A_{L})$ , define the mapping $F_{\mathcal{A}}$ as (1.1), and more generally for $i\leq L$ define $\mathcal{A}_{1}^{i}\mathrel{\mathop{\ordinarycolon}}=(A_{1},\ldots,A_{i})$ and

with the convention $F_{\emptyset}(Z)=Z$ .

Define two sequences of vector spaces $\mathcal{V}_{1},\ldots,\mathcal{V}_{L}$ and $\mathcal{W}_{2},\ldots,\mathcal{W}_{L+1}$ , where $\mathcal{V}_{i}$ has a norm $|\cdot|_{i}$ and $\mathcal{W}_{i}$ has norm $\mathopen{|\mkern-1.5mu|\mkern-1.5mu|}\cdot\mathclose{|\mkern-1.5mu|\mkern-1.5mu|}_{i}$ .

The linear operators $A_{i}\mathrel{\mathop{\ordinarycolon}}\mathcal{V}_{i}\to\mathcal{W}_{i+1}$ are associated with some operator norm $|A_{i}|_{i\to i+1}\leq c_{i}$ :

As stated before, these linear operators $\mathcal{A}=(A_{1},\ldots,A_{L})$ vary across functions $F_{\mathcal{A}}$ . When used to prove Theorem 1.1, $Z$ is a matrix (the forward image of data matrix $X^{\top}$ across layers), and these norms are all matrix norms.

The $\rho_{i}$ -Lipschitz mappings $\sigma_{i}\mathrel{\mathop{\ordinarycolon}}\mathcal{W}_{i+1}\to\mathcal{V}_{i+1}$ have $\rho_{i}$ measured with respect to norms $|\cdot|_{i+1}$ and $\mathopen{|\mkern-1.5mu|\mkern-1.5mu|}\cdot\mathclose{|\mkern-1.5mu|\mkern-1.5mu|}_{i+1}$ : for any $z,z^{\prime}\in\mathcal{W}_{i+1}$ ,

These Lipschitz mappings are considered fixed within $F_{\mathcal{A}}$ . Note again that these operations, when applied to prove Theorem 1.1, operate on matrices that represent the forward images of all data points together. Lipschitz properties of the standard coordinate-wise ReLU and max-pooling operators can be found in Section A.1.

Let $(\epsilon_{1},\ldots,\epsilon_{L})$ be given, along with fixed Lipschitz mappings $(\sigma_{1},\ldots,\sigma_{L})$ (where $\sigma_{i}$ is $\rho_{i}$ -Lipschitz), and operator norm bounds $(c_{1},\ldots,c_{L})$ . Suppose the matrices $\mathcal{A}=(A_{1},\ldots,A_{L})$ lie within $\mathcal{B}_{1}\times\cdots\times\mathcal{B}_{L}$ where $\mathcal{B}_{i}$ are arbitrary classes with the property that each $A_{i}\in\mathcal{B}_{i}$ has $|A_{i}|_{i\to i+1}\leq c_{i}$ . Lastly, let data $Z$ be given with $|Z|_{1}\leq B$ . Then, letting $\tau\mathrel{\mathop{\ordinarycolon}}=\sum_{j\leq L}\epsilon_{j}\rho_{j}\prod_{l=j+1}^{L}\rho_{l}c_{l}$ , the neural net images $\mathcal{H}_{Z}\mathrel{\mathop{\ordinarycolon}}=\{F_{\mathcal{A}}(Z)\mathrel{\mathop{\ordinarycolon}}\mathcal{A}\in\mathcal{B}_{1}\times\cdots\times\mathcal{B}_{L}\}$ have covering number bound

Inductively construct covers $\mathcal{F}_{1},\ldots,\mathcal{F}_{L}$ of $\mathcal{W}_{2},\ldots,\mathcal{W}_{L+1}$ as follows.

Choose an $\epsilon_{1}$ -cover $\mathcal{F}_{1}$ of $\mathinner{\left\{A_{1}Z\mathrel{\mathop{\ordinarycolon}}A_{1}\in\mathcal{B}_{1}\right\}}$ , thus

For every element $F\in\mathcal{F}_{i}$ , construct an $\epsilon_{i+1}$ -cover $\mathcal{G}_{i+1}(F)$ of

Since the covers are proper, meaning $F=A_{i}F_{(A_{1},\ldots,A_{i-1})}(Z)$ for some matrices $(A_{1},\ldots,A_{i})\in\mathcal{B}_{1}\times\cdots\times\mathcal{B}_{i}$ , it follows that

Define $\mathcal{F}\mathrel{\mathop{\ordinarycolon}}=\mathinner{\left\{\sigma_{L}(F)\mathrel{\mathop{\ordinarycolon}}F\in\mathcal{F}_{L}\right\}}$ ; by construction, $\mathcal{F}$ satisfies the desired cardinality constraint. to show that it is indeed a cover, fix any $(A_{1},\ldots,A_{L})$ satisfying the above constraints, and for convenience define recursively the mapped elements

The goal is to exhibit $\widehat{G}_{L}\in\mathcal{F}$ satisfying $|G_{L}-\widehat{G}_{L}|_{L+1}\leq\tau$ . To this end, inductively construct approximating elements $(\widehat{F}_{i},\widehat{G}_{i})$ as follows.

Choose $\widehat{F}_{i}\in\mathcal{F}_{i}$ with $\mathopen{|\mkern-1.5mu|\mkern-1.5mu|}A_{i}\widehat{G}_{i-1}-\widehat{F}_{i}\mathclose{|\mkern-1.5mu|\mkern-1.5mu|}_{i+1}\leq\epsilon_{i}$ , and set $\widehat{G}_{i}\mathrel{\mathop{\ordinarycolon}}=\sigma_{i}(\widehat{F}_{i})$ .

To complete the proof, it will be shown inductively that

The core of the proof rests upon inequalities which break the task of covering a layer into a cover term for the previous layer (handled by induction) and another cover term for the present layer’s weights (handled by matrix covering). These inequalities are similar to those in an existing covering number proof (Anthony and Bartlett, 1999, Chapter 12) (itself rooted in the earlier work of Bartlett (1996)); however that proof (a) operates node by node, and can not take advantage of special norms on $\mathcal{A}$ , and (b) does not maintain an empirical cover across layers, instead explicitly covering the parameters of all weight matrices, which incurs the number of parameters as a multiplicative factor. The idea here to push an empirical cover through layers, meanwhile, is reminiscent of VC dimension proofs for neural networks (Anthony and Bartlett, 1999, Chapter 8).

A.6 Proof of spectral covering bound (Theorem 3.3)

The whole-network covering bound in terms of spectral and $(2,1)$ norms now follows by the general norm covering number in Lemma A.7, and the matrix covering lemma in Lemma 3.2.

First dispense with the parenthetical statement regarding coordinate-wise ReLU and max-pooling operaters, which are Lipschitz by Lemmas A.1 and A.2. The rest of the proof is now a consequence of Lemma A.7 with all data norms set to the $l_{2}$ norm ( $|\cdot|_{i}=\mathopen{|\mkern-1.5mu|\mkern-1.5mu|}\cdot\mathclose{|\mkern-1.5mu|\mkern-1.5mu|}_{i}=\|\cdot\|_{2}$ ), all operator norms set to the spectral norm ( $|\cdot|_{i\to i+1}=\|\cdot\|_{\sigma}$ ), the matrix constraint sets set to $\mathcal{B}_{i}=\mathinner{\left\{A_{i}\mathrel{\mathop{\ordinarycolon}}\|A_{i}\|_{\sigma}\leq s_{i},\|A_{i}^{\top}-M_{i}^{\top}\|_{2,1}\leq b_{i}\right\}}$ , and lastly the per-layer cover resolutions $(\epsilon_{1},\ldots,\epsilon_{L})$ set according to

By this choice, it follows that the final cover resolution $\tau$ provided by Lemma A.7 satisfies

where $(*)$ follows firstly since $l_{2}$ covering a matrix and its transpose is the same, and secondly since the cover can be translated by $F_{(A_{1},\ldots,A_{i-1})}(X^{\top})^{\top}M_{i}^{\top}$ without changing its cardinality. In order to simplify this expression, note for any $(A_{1},\ldots,A_{i-1})$ that

Combining eqs. A.3 and A.4, then expanding the choice of $\epsilon_{i}$ and collecting terms,

A.7 Proof of Theorem 1.1

As an intermediate step to Theorem 1.1, a bound is first produced which has constraints on matrix and data norms provided in advance.

What remains is to relate covering numbers and Rademacher complexity via a Dudley entropy integral; note that most presentations of this technique place $1/n$ inside the covering number norm, and thus the application here is the result of a tiny amount of massaging. Continuing with this in mind, the Dudley entropy integral bound on Rademacher complexity from Lemma A.5 grants

The $\inf$ is uniquely minimized at $\alpha\mathrel{\mathop{\ordinarycolon}}=3\sqrt{R/n}$ , but the desired bound may be obtained by the simple choice $\alpha\mathrel{\mathop{\ordinarycolon}}=1/n$ , and plugging the resulting Rademacher complexity estimate into Lemma 3.1. ∎

The proof of Theorem 1.1 now follows by instantiating Lemma A.8 for many choices of its various parameters, and applying a union bound. There are many ways to cut up this parameter space and organize the union bound; the following lemma makes one such choice, whereby Theorem 1.1 is easily proved. A slightly better bound is possible by invoking positive homogeneity of $(\sigma_{1},\ldots,\sigma_{L})$ to balance the spectral norms of the matrices $(A_{1},\ldots,A_{L})$ , however these rebalanced matrices are then used in the comparison to $(M_{1},\ldots,M_{L})$ , which is harder to interpret when $M_{i}\neq 0$ .

Given positive integers $(\vec{j},\vec{k},\vec{l})=(j_{1},j_{2},j_{3},k_{1},\ldots,k_{L},l_{1},\ldots,l_{L})$ , define a set of instances (a set of triples $(\gamma,X,\mathcal{A})$ )

Fix any $(\vec{j},\vec{k},\vec{l})$ . By Lemma A.8, with probability at least $1-\delta(\vec{j},\vec{k},\vec{l})$ , every $(\gamma,X,\mathcal{A})\in\mathcal{B}(\vec{j},\vec{k},\vec{l})$ satisfies

Since $\sum_{\vec{j},\vec{k},\vec{l}}\delta(\vec{j},\vec{k},\vec{l})=\delta$ , by a union bound, the preceding bound holds simultaneously over all $\mathcal{B}(\vec{j},\vec{k},\vec{l})$ with probability at least $1-\delta$ .

Thus, to finish the proof, discard the preceding failure event, and let an arbitrary $(\gamma,X,\mathcal{A})$ be given. Choose the smallest $(\vec{j},\vec{k},\vec{l})$ so that $(\gamma,X,\mathcal{A})\in\mathcal{B}(\vec{j},\vec{k},\vec{l})$ ; by the preceding union bound, eq. A.6 holds for this $(\vec{j},\vec{k},\vec{l})$ . The remainder of the proof will massage eq. A.6 into the form in the statement of Theorem 1.1.

As such, first consider the case $j_{1}=1$ , meaning $\gamma<2/n$ ; then

where the last expression lower bounds the right hand side of eq. A.5, thus completing the proof in the case $j_{1}=1$ . Suppose henceforth that $j_{1}\geq 2$ (and $\gamma\geq 2/n$ ).

Combining the preceding bound $j_{2}\geq 2$ with the definition of $\mathcal{B}(\vec{j},\vec{k},\vec{l})$ , the elements of $(\vec{j},\vec{k},\vec{l})$ satisfy

For the term $\heartsuit$ , the factors with $(\vec{j},\vec{k},\vec{l})$ are bounded as

For the term $\clubsuit$ , the factors with $(\vec{j},\vec{k},\vec{l})$ are bounded as

Plugging these bounds on $\heartsuit$ and $\clubsuit$ into eq. A.6 gives eq. A.5. ∎

The proof of Theorem 1.1 is now a consequence of Lemma A.9, simplifying the bound with a $\widetilde{}\mathcal{O}(\cdot)$ . Before proceeding, it is useful to pin down the asymptotic notation $\widetilde{}\mathcal{O}(\cdot)$ , as it is not completely standard in the multivariate case. The notation can be understood via the $\limsup$ view of $\mathcal{O}(\cdot)$ ; namely, $f=\widetilde{}\mathcal{O}(g)$ if there exists a constant $C$ so that any sequence $((n^{(j)},\gamma^{(j)},X^{(j)},A_{1}^{(j)},\ldots,A_{L}^{(j)}))_{j=1}^{\infty}$ with $n^{(j)}\to\infty$ , $\gamma^{(j)}\to\infty$ , $\|X^{(j)}\|_{2}\to\infty$ , $\|A_{i}^{(j)}\|_{1}\to\infty$ satisfies

Let $f=f_{0}+f_{1}+f_{2}$ denote the three excess risk terms of the upper bound from Lemma A.9, and $g=g_{1}+g_{2}$ denote the two excess risk terms of the upper bound from Theorem 1.1; as discussed above, the goal is to show that there exists a universal constant $C$ so that for any sequence of tuples $((n^{(j)},\gamma^{(j)},X^{(j)},A_{1}^{(j)},\ldots,A_{L}^{(j)}))_{j=1}^{\infty}$ increasing as above, $\limsup_{j\to\infty}f/(g\ \textup{poly\,log}(g))\leq C$ .

It is immediate that $\limsup_{j\to\infty}f_{0}/g=0$ and $\limsup_{j\to\infty}f_{1}/(g_{1}\ln(g))\leq 144$ . The only trickiness arises when studying $f_{2}/(g_{2}\ln(g))$ , namely the term $\sum_{i}\ln(2+L\|A_{i}^{\top}-M_{i}^{\top}\|_{2,1})$ , since $g_{2}$ instead has the term $\ln(\sum_{i}\|A_{i}^{\top}-M_{i}^{\top}\|_{2,1}^{2/3})$ , and the ratio of these two can scale with $L$ . A solution however is to compare to $\ln(\prod_{i}\|A_{i}\|_{\sigma})$ , noting that $\|(A_{i})^{\top}\|_{2,1}\leq W^{1/2}\|A_{i}\|_{2}\leq W\|A_{i}\|_{\sigma}$ :

A.8 Proof of lower bound (Theorem 3.4)

Define a new class $\mathcal{G}(r)=\left\{x\mapsto{}\left\langle a,x\right\rangle\mid{}\left\|w\right\|_{2}\leq{}r\right\}$ . It will be shown that $\mathcal{G}(r)\subseteq{}\mathcal{F}(C\cdot{}r)$ for some $C>0$ , whereby the result easily follows from a standard lower bound on $\mathfrak{R}(\mathcal{G}(r)_{|S})$ .

Given any linear function $x\mapsto{}\left\langle a,x\right\rangle$ with $\|a\|_{2}\leq{}r$ , construct a network $f=A_{L}\sigma_{L-1}(A_{L-1}\cdots\sigma_{2}(A_{2}\sigma_{1}(A_{1}x)))$ as follows:

$A_{1}=(\mathbf{e}_{1}-\mathbf{e}_{2})a^{\top}$ .

$A_{k}=\mathbf{e}_{1}\mathbf{e}_{1}^{\top}+\mathbf{e}_{2}\mathbf{e}_{2}^{\top}$ for each $k\in\left\{2,\ldots,L-1\right\}$ .

It is now shown that $f(x)=\left\langle a,x\right\rangle$ pointwise. First, observe $\sigma(A_{1}x)=(\sigma(\left\langle a,x\right\rangle),\sigma(-\left\langle a,x\right\rangle),0,\ldots,0)$ . Since $\sigma$ is positive homogeneous, $\sigma_{L-1}(A_{L_{1}}\cdots\sigma_{2}(A_{2}{}y)=A_{L-1}A_{L-2}\cdots A_{2}{}y=(y_{1},y_{2},0,\ldots,0)$ for any $y$ in the non-negative orthant. Because $\sigma(A_{1}x)$ lies in the non-negative orthant, this means $\sigma_{L-1}(A_{L-1}\cdots\sigma_{2}(A_{2}\sigma_{1}(A_{1}x)))=(\sigma(\left\langle a,x\right\rangle),\sigma(-\left\langle a,x\right\rangle),0,\ldots,0)$ . Finally, the choice of $A_{L}=\mathbf{e}_{1}-\mathbf{e}_{2}$ gives $f(x)=\sigma(\left\langle a,x\right\rangle)-\sigma(-\left\langle a,x\right\rangle)=\left\langle a,x\right\rangle$ .

Observe that for all $k\in\left\{2,\ldots,L-1\right\}$ , $\|A_{k}\|_{\sigma}=1$ . For the other layers, $\|A_{L}\|_{\sigma}=\|A_{L}\|_{2}=\sqrt{2}$ and $\left\|A_{1}\right\|_{\sigma}=\sqrt{2}\cdot{}r$ , which implies $f\in\mathcal{F}(2r)$ .

Finally, by the Khintchine-Kahane inequality there exists $c>0$ such that