Size-Independent Sample Complexity of Neural Networks

Noah Golowich, Alexander Rakhlin, Ohad Shamir

Introduction

One of the major challenges involving neural networks is explaining their ability to generalize well, even if they are very large and have the potential to overfit the training data (Neyshabur et al., 2014; Zhang et al., 2016). Learning theory teaches us that this must be due to some inductive bias, which constrains one to learn networks of specific configurations (either explicitly, e.g., via regularization, or implicitly, via the algorithm used to train them). However, understanding the nature of this inductive bias is still largely an open problem.

A useful starting point is to consider the much more restricted class of linear predictors ( $\mathbf{x}\mapsto\mathbf{w}^{\top}\mathbf{x}$ ). For this class, we have a very good understanding of how its generalization behavior is dictated by the norm of $\mathbf{w}$ . In particular, assuming that $\|\mathbf{w}\|\leq M$ (where $\|\cdot\|$ signifies Euclidean norm), and the distribution is such that $\|\mathbf{x}\|\leq B$ almost surely, it is well-known that the generalization error (w.r.t. Lipschitz losses) given $m$ training examples scales as $\mathcal{O}(MB/\sqrt{m})$ , completely independent of the dimension of $\mathbf{w}$ . Thus, it is very natural to ask whether in the more general case of neural networks, one can obtain similar “size-independent” results (independent of the networks’ depth and width), under appropriate norm constraints on the parameters. This is also a natural question, considering that the size of modern neural networks is often much larger than the number of training examples.

Classical results on the sample complexity of neural networks do not satisfy this desideratum, and have a strong explicit dependence on the network size. For example, bounds relying on the VC dimension (see Anthony and Bartlett (2009)) strongly depend on both the depth and the width of the network, and can be trivial once the number of parameters exceeds the number of training examples. Scale-sensitive bounds, which rely on the magnitude of the network parameters, can alleviate the dependence on the width (see Bartlett (1998); Anthony and Bartlett (2009)). However, most such bounds in the literature have a strong (often exponential) dependence on the network depth. To give one recent example, Neyshabur et al. (2015) use Rademacher complexity tools to show that if the parameter matrices $W_{1},\ldots,W_{d}$ in each of the $d$ layers have Frobenius norms $\|\cdot\|_{F}$ upper-bounded by $M_{F}(1),\ldots,M_{F}(d)$ respectively, and under suitable assumptions on the activation functions, the generalization error scales (with high probability) as

Although this bound has no explicit dependence on the network width (that is, the dimensions of $W_{1},\ldots,W_{d}$ ), it has a very strong, exponential dependence on the network depth $d$ , even if $M_{F}(j)\leq 1$ for all $j$ . Neyshabur et al. (2015) also showed that this dependence can sometimes be avoided for anti-symmetric activations, but unfortunately this is a non-trivial assumption, which is not satisfied for common activations such as the ReLU. Bartlett et al. (2017) use a covering numbers argument to show a bound scaling as

where $\|W\|$ denotes the spectral norm of $W$ , $\|W^{T}\|_{2,1}:=\sum_{l}\sqrt{\sum_{k}W(l,k)^{2}}$ denotes the $1$ -norm of the 2-norms of the rows of $W$ , and where we ignore factors logarithmic in $m$ and the network width. Unlike Eq. (1), here there is no explicit exponential dependence on the depth. However, there is still a strong and unavoidable polynomial dependence: To see this, note that for any $W_{j}$ , $\frac{\sum_{l}\sqrt{\sum_{k}W_{j}(l,k)^{2}}}{\|W_{j}\|}\geq\frac{\|W_{j}\|_{F}}{\|W_{j}\|}\geq 1$ , so the bound above can never be smaller than

In particular, even if we assume that $B\left(\prod_{j=1}^{d}\|W_{j}\|\right)$ is a constant, the bound becomes trivial once $d\geq\Omega(m^{1/3})$ . Finally, and using the same notation, Neyshabur et al. (2017) utilize a PAC-Bayesian analysis to prove a bound scaling as

Can this depth dependency be avoided, assuming the norms are sufficiently constrained? We argue that in some cases, it must be true. To see this, let us return to the well-understood case of linear predictors, and consider generalized linear predictors of the form

where $\sigma(z)=\max\{0,z\}$ is the ReLU function. Like plain-vanilla linear predictors, the generalization error of this class is well-known to be $\mathcal{O}(MB/\sqrt{m})$ , assuming the inputs satisfy $\|\mathbf{x}\|\leq B$ almost surely. However, it is not difficult to show that this class can be equivalently written as a class of “ultra-thin” ReLU networks of the form

(where $\mathbf{w}_{1}$ is a vector and $w_{2},\ldots,w_{d}$ are scalars), where the depth $d$ is arbitrary. Therefore, the sample complexity of this class must also scale as $\mathcal{O}(MB/\sqrt{m})$ : This depends on the norm product $M$ , but is completely independent of the network depth $d$ as well as the dimension of $\mathbf{w}_{1}$ . We argue that a “satisfactory” sample complexity analysis should have similar independence properties when applied on this class.

In more general neural networks, the vector $\mathbf{w}_{1}$ and scalars $w_{2},\ldots,w_{d}$ become matrices $W_{1},\ldots,W_{d}$ , and the simple observation above no longer applies. However, using the same intuition, it is natural to try and derive generalization bounds by controlling $\prod_{j=1}^{d}\|W_{j}\|$ , where $\|\cdot\|$ is a suitable matrix norm. Perhaps the simplest choice is the spectral norm (and indeed, a product of spectral norms was utilized in some of the previous results mentioned earlier). However, as we formally show in Sec. 5, the spectral norm alone is too weak to get size-independent bounds, even if the network depth is small. Instead, we show that controlling other suitable norms can indeed lead to better depth dependence, or even fully size-independent bounds, improving on earlier works. Specifically, we make the following contributions:

In Sec. 3, we show that the exponential depth dependence in Rademacher complexity-based analysis (e.g. Neyshabur et al. (2015)) can be avoided by applying contraction to a slightly different object than what has become standard since the work of Bartlett and Mendelson (2002). For example, for networks with parameter matrices of Frobenius norm at most $M_{F}(1),\ldots,M_{F}(d)$ , the bound in Eq. (1) can be improved to

where $n$ is the input dimension. Again, the dependence on $d$ is polynomial and quite mild. In contrast, Neyshabur et al. (2015) studied a similar setup and only managed to obtain an exponential dependence on $d$ .

In Sec. 4, we develop a generic technique to convert depth-dependent bounds to depth-independent bounds, assuming some control over any Schatten norm of the parameter matrices (which includes, for instance, the Frobenius norm and the trace norm as special cases). The key observation we utilize is that the prediction function computed by such networks can be approximated by the composition of a shallow network and univariate Lipschitz functions. For example, again assuming that the Frobenius norms of the layers are bounded by $M_{F}(1),\ldots,M_{F}(d)$ , we can further improve Eq. (5) to

In contrast, we show the following bound for any $p\geq 1$ (ignoring some lower-order logarithmic factors):

where $M_{p}(j)$ is an upper bound on the Schatten $p$ -norm of $W_{j}$ , and $\Gamma$ is a lower bound on $\prod_{j=1}^{d}\|W_{j}\|$ . Again, by upper bounding the $\min$ by its first argument, we get a bound independent of the depth $d$ , assuming the norms are suitably constrained.

In Sec. 5, we provide a lower bound, showing that for any $p$ , the class of depth- $d$ , width- $h$ neural networks, where each parameter matrix $W_{j}$ has Schatten $p$ -norm at most $M_{p}(j)$ , can have Rademacher complexity of at least

This somewhat improves on Bartlett et al. (2017, Theorem 3.6), which only showed such a result for $p=\infty$ (i.e. with spectral norm control), and without the $h$ term. For $p=2$ , it matches the upper bound in Eq. (6) in terms of the norm dependencies and $B$ . Moreover, it establishes that controlling the spectral norm alone (and indeed, any Schatten $p$ -norm control with $p>2$ ) cannot lead to bounds independent of the size of the network. Finally, the bound shows (similar to Bartlett et al. (2017)) that a dependence on products of norms across layers is generally inevitable.

Besides the above, we provide some additional remarks and observations in Sec. 6. Most of our technical proofs are presented in Sec. 7.

Finally, we emphasize that the bounds of Sec. 4 are independent of the network size, only under the assumption that products of norms (or at least ratios of norms) across all layers are bounded by a constant, which is quite restrictive in practice. For example, it is enough that the Frobenius norm of each layer matrix is at least $2$ , to get that Eq. (6) scales as $2^{d}$ where $d$ is the number of layers. However, our focus here is to show that at least some norm-based assumptions lead to size-independent bounds, and hope these can be weakened in future work.

Preliminaries

Neural Networks. Given the domain $\mathcal{X}=\{\mathbf{x}:\|\mathbf{x}\|\leq B\}$ in Euclidean space, we consider (scalar or vector-valued) standard neural networks, of the form

Rademacher Complexity. The results in this paper focus on Rademacher complexity, which is a standard tool to control the uniform convergence (and hence the sample complexity) of given classes of predictors (see Bartlett and Mendelson (2002); Shalev-Shwartz and Ben-David (2014) for more details). Formally, given a real-valued function class $\mathcal{H}$ and some set of data points $\mathbf{x}_{1},\ldots,\mathbf{x}_{m}\in\mathcal{X}$ , we define the (empirical) Rademacher complexity $\hat{\mathcal{R}}_{m}(\mathcal{H})$ as

where $\boldsymbol{\varepsilon}=(\varepsilon_{1},\ldots,\varepsilon_{m})$ is a vector uniformly distributed in $\{-1,+1\}^{m}$ . Our main results provide bounds on the Rademacher complexity (sometimes independent of $\mathbf{x}_{1},\ldots,\mathbf{x}_{m}$ , as long as they are assumed to have norm at most $B$ ), with respect to classes of neural networks with various norm constraints. Using standard arguments, such bounds can be converted to bounds on the generalization error, assuming access to a sample of $m$ i.i.d. training examples.

From Exponential to Polynomial Depth Dependence

To get bounds on the Rademacher complexity of deep neural networks, a reasonable approach (employed in Neyshabur et al. (2015)) is to use a “peeling” argument, where the complexity bound for depth $d$ networks is reduced to a complexity bound for depth $d-1$ networks, and then applying the reduction $d$ times. For example, consider the class $\mathcal{H}_{d}$ of depth- $d$ ReLU real-valued neural networks, with each layer’s parameter matrix with Frobenius norm at most $M_{F}(j)$ . Using some straightforward manipulations, it is possible to show that $\hat{\mathcal{R}}_{m}(\mathcal{H}_{d})$ , which by definition equals

Iterating this inequality $d$ times, one ends up with a bound scaling as $2^{d}\prod_{j=1}^{d}M_{F}(j)$ (as in Neyshabur et al. (2015), see also Eq. (1)). The exponential $2^{d}$ factor follows from the $2$ factor in Eq. (8), which in turn follows from applying the Rademacher contraction principle to get rid of the $\sigma$ function. Unfortunately, this $2$ factor is generally unavoidable (see the discussion in Ledoux and Talagrand (1991) following Theorem 4.12).

where $\lambda>0$ is an arbitrary parameter. We then perform a “peeling” argument similar to before, resulting in a multiplicative $2$ factor after every peeling step. Crucially, these factors accumulate inside the log factor, so that the end result contains only a $\log(2^{d})=d$ factor, which by appropriate tuning of $\lambda$ , can be further reduced to $\sqrt{d}$ .

The formalization of this argument depends on the matrix norm we are using, and we will begin with the case of the Frobenius norm. A key technical condition for the argument to work is that we can perform the “peeling” inside the $\exp$ function. This is captured by the following lemma:

Letting $\mathbf{w}_{1},\mathbf{w}_{2},\ldots,\mathbf{w}_{h}$ be the rows of the matrix $W$ , we have

The supremum of this over all $\mathbf{w}_{1},\ldots,\mathbf{w}_{h}$ such that $\|W\|_{F}^{2}=\sum_{j=1}^{h}\|\mathbf{w}_{j}\|^{2}\leq R^{2}$ must be attained when $\|\mathbf{w}_{j}\|=R$ for some $j$ , and $\|\mathbf{w}_{i}\|=0$ for all $i\neq j$ . Therefore,

Since $g(|z|)\leq g(z)+g(-z)$ , this can be upper bounded by

where the equality follows from the symmetry in the distribution of the $\epsilon_{i}$ random variables. The right hand side in turn can be upper bounded by

(see equation 4.20 in Ledoux and Talagrand (1991)). ∎

With this lemma in hand, we can provide a bound on the Rademacher complexity of Frobnius-norm-bounded neural networks, which is as clean as Eq. (1), but where the $2^{d}$ factor is replaced by $\sqrt{d}$ :

Let $\mathcal{H}_{d}$ be the class of real-valued networks of depth $d$ over the domain $\mathcal{X}$ , where each parameter matrix $W_{j}$ has Frobenius norm at most $M_{F}(j)$ , and with activation functions satisfying Lemma 1. Then

Fix $\lambda>0$ , to be chosen later. The Rademacher complexity can be upper bounded as

where $f$ ranges over all possible functions $\sigma_{d-2}\circ N_{W_{1}^{d-2}}(\mathbf{x})$ . Here we applied Lemma 1 with $g(\alpha)=\exp\{M_{F}(d)\lambda\cdot\alpha\}$ . Repeating the process, we arrive at

where $M=\prod_{j=1}^{d}M_{F}(j)$ . Define a random variable

(random as a function of the random variables $\epsilon_{1},\ldots,\epsilon_{m}$ ). Then

This means that $Z$ satisfies a bounded-difference condition, which by the proof of Theorem 6.2 in (Boucheron et al., 2013), implies that $Z$ is sub-Gaussian, with variance factor

Choosing $\lambda=\frac{\sqrt{2\log(2)d}}{M\sqrt{\sum_{i=1}^{m}\|\mathbf{x}_{i}\|^{2}}}$ and using the above, we get that Eq. (9) can be upper bounded as follows:

We note that for simplicity, the bound in Thm. 1 is stated for real-valued networks, but the argument easily carries over to vector-valued networks, composed with some real-valued Lipschitz loss function. In that case, one uses a variant of Lemma 1 to peel off the losses, and then proceed in the same manner as in the proof of Thm. 1. We omit the precise details for brevity.

A result similar to the above can also be derived for other matrix norms. For example, given a matrix $W$ , let $\|W\|_{1,\infty}$ denote the maximal $1$ -norm of its rows, and consider the class $\mathcal{H}_{d}$ of depth- $d$ networks, where each parameter matrix $W_{j}$ satisfies $\|W_{j}\|_{1,\infty}\leq M(j)$ for all $j$ (this corresponds to a setting, also studied in Neyshabur et al. (2015), where the $1$ -norm of the weights of each neuron in the network is bounded). In this case, we can derive a variant of Lemma 1, which in fact does not require positive-homogeneity of the activation function:

where $\|\cdot\|_{\infty}$ denotes the vector infinity norm.

Using the same technique as before, we can use this lemma to get a bound on the Rademacher complexity for $\mathcal{H}_{d}$ :

Let $\mathcal{H}_{d}$ be the class of real-valued networks of depth $d$ over the domain $\mathcal{X}$ , where $\|W_{j}\|_{1,\infty}\leq M(j)$ for all $j\in\{1,\ldots,d\}$ , and with activation functions satisfying the condition of Lemma 2. Then

where $x_{i,j}$ is the $j$ -th coordinate of the vector $\mathbf{x}_{i}$ .

The proofs of the theorem, as well as Lemma 2, appear in Sec. 7.

From Depth Dependence to Depth Independence

In this section, we develop a general result, which allows one to convert any depth-dependent bound on the Rademacher complexity of neural networks, to a depth-independent one, assuming that the Schatten $p$ -norms of the parameter matrices (for any $p\in[1,\infty)$ ) is sufficiently controlled. We develop and formalize the main result in Subsection 4.1, and provide applications in Subsection 4.2. The proofs the results in this section appear in Sec. 7.

To motivate our approach, let us consider a special case of depth- $d$ networks, where

Each parameter matrix $W_{1},\ldots,W_{d-1}$ is constrained to be diagonal and of size $h\times h$ .

The Frobenius norm of every $W_{1},\ldots,W_{d}$ is at most $1$ .

All activation functions are the identity (so the network computes a linear function).

Letting $\mathbf{w}_{i}$ be the diagonal of $W_{i}$ , such networks are equivalent to

where $\circ$ denotes element-wise product. Therefore, if we would like the network to compute a non-trivial function, we clearly need that $\mathbf{w}_{d}\circ\ldots\circ\mathbf{w}_{1}$ be bounded away from zero (e.g., not exponentially small in $d$ ), while still satisfying the constraint $\|\mathbf{w}_{j}\|\leq 1$ for all $j$ . In fact, the only way to satisfy both requirements simultaneously is if $\mathbf{w}_{1},\ldots,\mathbf{w}_{d}$ are all close to some 1-sparse unit vector, which implies that the matrices $W_{1},\ldots,W_{d}$ must be close to being rank-1.

It turns out that this intuition holds much more generally, even if we do not restrict ourselves to identity activations and diagonal parameter matrices as above. Essentially, what we can show is that if some network computes a non-trivial function, and the product of its Schatten $p$ -norms (for any $p<\infty$ ) is bounded, then there must be at least one parameter matrix, which is not far from being rank-1. Therefore, if we replace that parameter matrix by an appropriate rank-1 matrix, the function computed by the network does not change much. This is captured by the following theorem:

We now make the following crucial observation: A real-valued network with a rank-1 parameter matrix $W_{r^{\prime}}=s\mathbf{u}\mathbf{v}^{\top}$ computes the function

This can be seen as the composition of the depth- $r^{\prime}$ network

Moreover, the norm constraints imply that the latter function is Lipschitz. Therefore, the class of networks we are considering is a subset of the class of depth- $r^{\prime}$ networks composed with univariate Lipschitz functions. Fortunately, given any class with bounded complexity, one can effectively bound the Rademacher complexity of its composition with univariate Lipschitz functions, as formalized in the following theorem.

The $\log^{3/2}(m)\cdot\hat{\mathcal{R}}_{m}(\mathcal{H})$ can be replaced by $\log(m)\cdot\hat{\mathcal{G}}_{m}(\mathcal{H})$ , where $\hat{\mathcal{G}}_{m}(\mathcal{H})$ is the empirical Gaussian complexity of $\mathcal{H}$ – see the proof in Sec. 7 for details.

Combining these ideas, our plan of attack is the following: Given some class of depth- $d$ networks, and an arbitrary parameter $r\in\{1,\ldots,d\}$ , we use Thm. 3 to relate their Rademacher complexity to the complexity of similar networks, but where for some $1\leq r^{\prime}\leq r$ , the $r^{\prime}$ -th parameter matrix is of rank-1. We then use Thm. 4 to bound that complexity in turn using the Rademacher complexity of depth- $r^{\prime}$ networks. Crucially, the resulting bound has no explicit dependence on the original depth $d$ , only on the new parameter $r$ . Formally, we have the following theorem, which is the main result of this section:

Consider the following hypothesis class of networks on $\mathcal{X}=\{\mathbf{x}:\|\mathbf{x}\|\leq B\}$ :

for some parameters $p,\Gamma\geq 1,\{M(j),M_{p}(j),\mathcal{W}_{j}\}_{j=1}^{d}$ . Also, for any $r\in\{1,\ldots,d\}$ , define

In particular, one can upper bound this result by any choice of $r$ in $\{1,\ldots,d\}$ . By tuning $r$ appropriately, we can get bounds independent of the depth $d$ . In the next subsection, we provide some concrete applications for specific choices of $\mathcal{H}$ .

The parameters $\Gamma$ and $\gamma$ , which divide the norm terms in Thm. 5, are both closely related to the notion of a margin. Indeed, if we consider binary or multi-class classification, then bounds as above w.r.t. $\frac{1}{\gamma}$ -Lipschitz losses can be converted to a bound on the misclassification error rate in terms of the average $\gamma$ -margin error on the training data (see Bartlett et al. (2017, Section 3.1) for a discussion). Also, $B\Gamma$ can be viewed as the “maximal” margin attainable over the input domain, since $\sup_{\mathbf{x}\in\mathcal{X}}\|N_{W_{1}^{d}}(\mathbf{x})\|\leq B\prod_{j=1}^{d}\|W_{j}\|=B\Gamma$ .

2 Applications of Thm. 5

In this section we exemplify how Thm. 5 can be used to obtain depth-independent bounds on the sample complexity of various classes of neural networks. The general technique is as follows: First, we prove a bound on $\hat{R}_{m}(\mathcal{H}_{r})$ , which generally depends on the depth $r$ , and scales as $r^{\alpha}/\sqrt{m}$ for some $\alpha>0$ . Then, we plug it into Thm. 5, and utilize the following lemma to tune $r$ appropriately:

For any $\alpha>0$ , $\beta\in(0,1]$ and $b,c,n\geq 1$ , it holds that

We begin with proving a depth-independent version of Thm. 1. That theorem implies that for the class $\mathcal{H}_{r}$ of depth- $r$ neural networks with Frobenius norm bounds $M_{F}(1),\ldots,M_{F}(r)$ (up to and including $r=d$ ),

Plugging this into Thm. 5, and using Lemma 3, it is straightforward to derive the following corollary (see Sec. 7 for a formal derivation):

where $\bar{\log}(z):=\max\{1,\log(z)\}$ .

Ignoring logarithmic factors and replacing the $\min$ by its first argument, the bound in the corollary is at most

Assuming $\prod_{j}M_{F}(j)$ and $\prod_{j}M_{F}(j)/\Gamma$ are bounded by a constant, we get a bound which does not depend on the width or depth of the network: In other words, it is possible to make this bound smaller than any fixed $\epsilon$ , with a sample size $m$ independent of the network’s size. On the other hand, the bound in Corollary 1 is also bounded by

which is the bound one would get from an immediate application of Thm. 1, and implies that the asymptotic rate (as a function of $m$ ) is still maintained. As discussed in the introduction, the assumption that $\prod_{j}M_{F}(j)$ is a constant is certainly a strong one in practice, but to the best of our knowledge, is the first norm-based assumption which leads to size independence.

Next, we apply Thm. 5 to the results in Bartlett et al. (2017), which as discussed in the introduction, provide a depth-dependent bound using a different set of norms. Specifically, they obtain the following intermediary result in deriving their generalization bound:

Let $\mathcal{H}$ be the hypothesis class of depth- $d$ , width- $h$ real-valued networks on $\mathcal{X}=\{\mathbf{x}:\|\mathbf{x}\|\leq B\}$ , using $1$ -Lipschitz activation functions, given by

for some fixed parameters $\{L(j),M(j)\}_{j=1}^{d}$ . Then the Rademacher complexity $\hat{\mathcal{R}}_{m}(\mathcal{H})$ is upper-bounded by

As discussed in the introduction, $L(j)$ can never be smaller than $1$ , hence the bound scales at least as $\sqrt{d^{3}/m}$ . However, using the bound above together with Thm. 5 and Lemma 3, we can get the following corollary (where for simplicity, we assume that $L(j)$ for all $j$ are uniformly bounded by some $L$ ):

where $\bar{\log}(z):=\max\{1,\log(z)\}$ .

As before, by replacing the $\min$ by its first argument, we get a bound which is fully independent of the network size, assuming the norms are suitably bounded. To give a concrete example, if we take $p=2$ (so that the $M_{p}(j)$ constraints correspond to the Frobenius norm), and ignore lower-order logarithmic factors, we get a bound scaling as

In contrast, a direct application of Thm. 6 in the same setting leads to a bound of

Finally, we note that since the Eq. (3), based on a PAC-Bayes analysis, is always weaker than Eq. (2) (as noted in Bartlett et al. (2017)), Corollary 2 also gives a size-independent version of Eq. (3).

A Lower Bound for Schatten Norms

In this section, we present a lower bound on the Rademacher complexity, for the class of neural networks with parameter matrices of bounded Schatten norms. The formal result is the following:

This theorem strengthens Theorem 3.6 in Bartlett et al. (2017), in the sense that they only considered the case $p=\infty$ , and did not have a dependence on $h$ . On the other hand, they consider bounds which hold for any choice of $\mathbf{x}_{1},\ldots,\mathbf{x}_{m}$ , while we consider bounds uniform over $\mathbf{x}_{1},\ldots,\mathbf{x}_{m}$ for simplicity. Moreover, for network depths larger than $1$ , our construction requires a non element-wise activation function. The lower bound has the following implications:

Like Bartlett et al. (2017), the theorem implies that by controlling just the norms of each parameter matrix, a dependence on the product of the norms is generally inevitable.

For $p=\infty$ , we see that there is an inevitable $h^{1/2}$ factor in the bound, which implies that controlling the spectral norm is insufficient to get size-independent bounds (at least, independent of the width $h$ ). More generally, any Schatten $p$ -norm control with $p>2$ will be insufficient to get such size independence.

For $p=2$ (i.e. Frobenius norm bounds $M_{F}(1),\ldots,M_{F}(d)$ ), the lower bound becomes size-independent, and on the order of $\frac{B\prod_{j=1}^{d}M_{F}(j)}{\sqrt{m}}$ . The dependence on $M_{F}(j)$ is similar to our upper bound in Corollary 1 up to logarithmic factors (although the dependence on $m$ is worse).

Additional Remarks

So far we have proved upper bounds on the empirical Rademacher complexity of a fixed class of neural networks, of the form

for some complexity measure $C(\cdot)$ and a parameter $L$ . These imply high-probability learning guarantees for algorithms which return predictors in $\mathcal{H}_{L}$ . However, in the context of norm-based constraints, practical algorithms for neural networks usually perform unconstrained optimization, and therefore are not guaranteed a-priori to return a predictor in some fixed $\mathcal{H}_{L}$ . Fortunately, it is straightforward to convert these bounds into probabilistic guarantees for any neural network $h$ , with the bound scaling appropriately with the complexity $C(h)$ of the particular network. We note that such post-hoc guarantees have also been stated in the context of some previous sample complexity bounds for neural networks (e.g., Bartlett et al. (2017); Neyshabur et al. (2017)). It is achieved, for instance, by a union bound over, say, a doubling scale of the complexity. We refer to the proof of the margin bound of Koltchinskii and Panchenko (2002, Theorem 2) for an example of this technique.

2 Complexity of Lipschitz Networks

Taking this to the extreme, we can also bound the complexity of neural networks computing Lipschitz functions, by studying the complexity of all Lipschitz functions over the domain $\mathcal{X}$ . It is easily verified that in our setting, if we consider the class $\mathcal{H}$ of depth- $d$ networks, where each parameter matrix has spectral norm at most $M(j)$ , then the network must be $\prod_{j=1}^{d}M(j)$ -Lipschitz. Using well-known estimates of the covering numbers of Lipschitz functions over $\mathcal{X}$ , we get that

Proofs

We first prove Lemma 2. Letting $\mathbf{w}_{j}$ denote the $j$ -th row of a matrix $W$ , we have

Since $g(|z|)\leq g(z)+g(-z)$ , the left-hand side of Eq. (11) is upper bounded by

and the proof is concluded exactly as in Lemma 1 by appealing to Ledoux and Talagrand (1991).

We now turn to Thm. 2, whose proof is rather similar to that of Thm. 1. Fixing $\lambda>0$ to be chosen later, the Rademacher complexity can be upper bounded as

Applying the same argument as in the proof of Thm. 1, and using Lemma 2, we can upper bound the above by

where $M=\prod_{j=1}^{d}M(j)$ . Letting $x_{i,j}$ denote the $j$ -th coordinate of $\mathbf{x}_{i}$ , and using symmetry, the expectation inside the log can be re-written as

where in the last step we used the fact that $\frac{\exp(z)+\exp(-z)}{2}\leq\exp(z^{2}/2)$ . Further upper bounding this by $2n\max_{j}\exp\left(M^{2}\lambda^{2}\sum_{i=1}^{d}x_{i,j}^{2}\right)$ and plugging back to Eq. (13), we get

Choosing $\lambda=\sqrt{\frac{d+1+\log(n)}{M^{2}\max_{j}\sum_{i=1}^{m}x_{i,j}^{2}}}$ , we can upper bound the above by

2 Proof of Thm. 3

The proof will build on the following few technical lemmas.

which equals $\left(\|W\|_{p}^{p}-\|W\|^{p}\right)^{1/p}$ . ∎

By a simple calculation, we have that the Lipschitz constant of the function $N_{W_{b}^{s}}$ is at most $\prod_{j=b}^{s}\|W_{j}\|$ .

Assume for now that $2\leq s\leq d-1$ . By definition, we have

The Lipschitz constant of the function $N_{W_{s+1}^{d}}$ is at most $\prod_{j=s+1}^{d}\|W_{j}\|$ , and the norm of $N_{W_{1}^{s-1}}(\mathbf{x})$ is at most $\|\mathbf{x}\|\prod_{j=1}^{s-1}\|W_{j}\|$ . Therefore, for any $\mathbf{x}\in\mathcal{X}$ ,

from which the result follows after a simplification. The cases $s=1$ and $s=d$ are handled in exactly the same manner. ∎

Suppose that $N_{W_{1}^{d}}$ is such that $\prod_{j=1}^{d}\|W_{j}\|\geq\Gamma$ and $\prod_{j=1}^{d}\|W_{j}\|_{p}\leq M$ . Then for any $r\in\{1,\ldots,d\}$ ,

Fixing some $r$ , and using the stated assumptions as well as the fact that $\|W\|_{p}\geq\|W\|$ for any $p$ , we have

Taking the $r$ -th root from both sides, the result follows. ∎

With these lemmas in hand, we can now turn to prove Thm. 3. A direct application of Lemma 6 implies that for any $r\in\{1,\ldots,d\}$ ,

Substituting Eq. (14) into Eq. (15), we get that

Suppose for now that $r$ is such that $r\geq p\log(M/\Gamma)$ . Using the fact that $\exp(z)\leq 1+2z$ for any $z\in$ , it follows that the above is at most

3 Proof of Thm. 4

To prove the theorem, we use a straightforward covering number argument, beginning with a few definitions.

Given any function class $\mathcal{F}$ , a metric $d$ on the elements of $\mathcal{F}$ , and $\epsilon>0$ , we let the covering number $\mathcal{N}(\mathcal{F},d,\epsilon)$ denote the minimal number $n$ of functions $f_{1},f_{2},\ldots,f_{n}$ in $\mathcal{F}$ , such that for all $f\in\mathcal{F}$ , $\min_{i=1,\ldots,n}d(f_{i},f)\leq\epsilon$ . In particular, fix some set of data points $\mathbf{x}_{1},\ldots,\mathbf{x}_{m}$ , and define the empirical $L_{2}$ distance

Also, given a function class $\mathcal{F}$ and $\mathbf{x}_{1},\ldots,\mathbf{x}_{m}$ , we let

denote the (empirical) Gaussian complexity of $\mathcal{F}$ , where $\eta_{1},\ldots,\eta_{n}$ are i.i.d. standard Gaussian random variables. It is well known that $\hat{\mathcal{R}}_{m}(\mathcal{H})$ and $\hat{\mathcal{G}}_{m}(\mathcal{H})$ are equivalent up to a $c\sqrt{\log(m)}$ factor (Ledoux and Talagrand, 1991, pg. 97). By Sudakov’s minoration theorem (see Theorem 3.18 in Ledoux and Talagrand (1991)), we have that for all $\alpha>0$

With these definitions in hand, we turn to prove the theorem. We first note that $\mathcal{F}_{L,a}\circ\mathcal{H}$ is equivalent to the class $\{Lg(\cdot)+a:g\in\mathcal{F}_{1,0}\circ\mathcal{H}\}$ , and therefore

again for some numerical constant $c^{\prime}>0$ . The first inequality follows from the fact that for any functions $f,f^{\prime}$ , it holds that $\hat{d}_{m}(f,f^{\prime})\leq\hat{d}_{\infty}(f,f^{\prime}):=\sup_{\mathbf{x}}|f(\mathbf{x})-f^{\prime}(\mathbf{x}^{\prime})|$ , so it is enough to upper bound $\mathcal{N}(\mathcal{F},\hat{d}_{\infty},\epsilon)$ . To do so, we first notice that the range of any $f\in\mathcal{F}$ is in $[-R,R]$ . Discretize $[-R,R]\times[-R,R]$ into a two-dimensional grid $\mathcal{U}_{x}\times\mathcal{U}_{y}$ , where

Given any $f\in\mathcal{F}$ , construct the piecewise-linear function $f^{\prime}$ as follows: For any input $x\in\mathcal{U}_{x}$ , let $f^{\prime}(x)$ be the point in $\mathcal{U}_{y}$ nearest to $f(x)$ (breaking ties arbitrarily), and let the rest of $f^{\prime}$ be constructed as a linear interpolation of these points on $\mathcal{U}_{x}$ . It is easily verified that $\sup_{x\in[-R,R]}|f(x)-f^{\prime}(x)|\leq\epsilon$ . Moreover, note that for two neighboring points $x,x^{\prime}$ in $\mathcal{U}_{x}$ , the points $f^{\prime}(x),f^{\prime}(x^{\prime})$ on $\mathcal{U}_{y}$ must be neighboring or the same. Therefore, each such function $f^{\prime}$ can be parameterized by a vector of the form $\{-,0,+\}^{|\mathcal{U}_{x}|-1}$ , which specifies whether (starting from the origin) $f^{\prime}$ goes up, down, or remains the same on each of its linear segments. The number of such functions is at most $3^{|\mathcal{U}_{x}|-1}\leq 3^{2R/\epsilon+1}$ , and therefore $\mathcal{N}(\mathcal{F},\hat{d}_{\infty},\epsilon)\leq 3^{2R/\epsilon+1}$ . Recalling that $\hat{d}_{\infty}(f,f^{\prime})$ majorizes $\hat{d}_{m}(f,f^{\prime})$ , we get Eq. (18).

To see this, pick any $f\in\mathcal{F}$ and $h\in\mathcal{H}$ , and let $f^{\prime},h^{\prime}$ be the respective closest functions in the cover of $\mathcal{F}$ and $\mathcal{H}$ (at scale $\epsilon/2$ ). By the triangle inequality and the easily verified fact that $f^{\prime}$ is $1$ -Lipschitz, we have

Therefore, we can cover $\mathcal{F}\circ\mathcal{H}$ (at scale $\epsilon$ ) by taking $f^{\prime}(h^{\prime}(\cdot))$ for all possible choices of $f^{\prime},h^{\prime}$ from the covers of $\mathcal{F},\mathcal{H}$ (at scale $\epsilon/2$ ), leading to Eq. (19).

Combining Eq. (16), Eq. (18) and Eq. (19), we get that

for some numerical constant $c^{\prime\prime}>0$ . Finally, we use Dudley’s entropy integral, which together with the equation above, implies the following for some numerical constant $c>0$ (possibly changing from row to row):

Choosing in particular $\alpha=R/\sqrt{m}$ , we get the upper bound

for some $c>0$ . Plugging this into Eq. (7.3), and upper bounding $\hat{\mathcal{G}}_{m}(\mathcal{H})$ by $c\sqrt{\log(m)}\hat{\mathcal{R}}_{m}(\mathcal{H})$ (see (Ledoux and Talagrand, 1991)), the result follows.

4 Proof of Thm. 5

We first need the following lemma, which bounds the empirical Rademacher complexity of the union of a collection of bounded function classes.

so $\phi$ satisfies a bounded differences assumption with variance factor $A^{2}/m$ . By McDiarmid’s inequality (Theorem 6.2, Boucheron et al. (2013)) it follows that for $t>0$ ,

(The same inequality, up to a constant factor, also follows directly from Theorem 4.7 in Ledoux and Talagrand (1991).) A union bound then shows that for $t>0$ ,

We next continue with the proof of Thm. 5. It is enough to prove the bound for any fixed $r\in\{1,\ldots,d\}$ , and then take the infimum over any such $r$ .

This function is equivalent to the composition of the function

Notice that for each $r^{\prime}\in\{1,\ldots,r\}$ , all functions of $\mathcal{H}_{r^{\prime}}$ have output is bounded in $\pm B\prod_{j=1}^{r^{\prime}}M(j)$ ). Recall also that $\mathcal{F}_{\frac{1}{\gamma}\prod_{j=r^{\prime}+1}^{d}M(j),a}$ is the class consisting of real-valued $\prod_{j=r^{\prime}+1}^{d}M(j)$ -Lipschitz functions $f$ such that $f(0)=a$ . As a result, we can apply Thm. 4 and obtain that for $r^{\prime}\in\{1,\ldots,r\}$ ,

Notice that all functions in $\bigcup_{r^{\prime}\in\{1,\ldots,r\}}\mathcal{F}_{\frac{1}{\gamma}\prod_{j=r^{\prime}+1}^{d}M(j),a}\circ\mathcal{H}_{r^{\prime}}$ have output bounded in $\pm A$ , where

By Lemma 7, for an appropriate constant $c^{\prime}$ ,

for an appropriate constant $c^{\prime\prime}$ . As mentioned at the beginning of the proof, this upper bound holds for any fixed $r\in\{1,\ldots,d\}$ , from which the result follows using the assumption that $|a|\leq\frac{B\prod_{j=1}^{d}M(j)}{\gamma}$ .

5 Proof of Lemma 3

We will show that for any $\alpha,\beta,b,c,n$ as stated in the lemma, there always exists a choice of $r\in\{1,\ldots,d\}$ such that

Since the left hand side is also trivially at most $\frac{d^{\alpha}}{n}$ , the result follows. We prove this inequality by a case analysis:

If $(bn/c)^{\frac{1}{\alpha+\beta}}\in[1,d]$ , pick $r=\left\lfloor(bn/c)^{\frac{1}{\alpha+\beta}}\right\rfloor\in\{1,2,\ldots,d\}$ , in which case

If $(bn/c)^{\frac{1}{\alpha+\beta}}>d$ , it follows that

6 Proof of Corollary 1

Next, we plug Eq. (12) into Thm. 5. We use $p=2$ , let each $\mathcal{W}_{j}$ be the space of all matrices, and choose $M(j)=M_{F}(j)$ for all $j$ , noting that $\frac{\|W_{j}\|}{M(j)}\leq\frac{\|W_{j}\|_{F}}{M_{F}(j)}\leq 1$ . Since $\sqrt{r}\geq\sqrt{r^{\prime}}$ for all $r^{\prime}\leq r$ , it follows that

7 Proof of Corollary 2

where $c=\bar{\log}^{3/2}(m)$ , $n=\sqrt{m}$ , and $b=\bar{\log}\left(\frac{1}{\Gamma}\prod_{j=1}^{d}M_{p}(j)\right)^{1/p}$ (note that we use here $\bar{\log}$ instead of $\log$ so the conditions of Lemma 3, to be used shortly, will be satisfied). As in Corollary 1, the $\mathcal{O}\left(\frac{1+\sqrt{\log r}}{\sqrt{m}}\right)$ term from Thm. 5 is absorbed into the $\mathcal{O}\left(\frac{cr^{3/2}}{n}\right)$ term in the minimization over $r$ in the above equation.

8 Proof of Thm. 7

In particular, by choosing $w_{k}=h^{-1/p}\cdot\text{sign}\left(\sum_{i\in A_{k}}\epsilon_{i}\right)$ for all $k$ , we can lower bound the above by

and since $|A_{k}|\geq\lfloor m/h\rfloor$ by its definition, we get a lower bound of

Taking the best of this lower bound and the lower bound in Eq. (23), the result follows.

Acknowledgements

We thank Ziwei Ji and Matus Telgarsky for pointing that an additional union bound is required in the proof of Thm. 5, resulting in an extra logarithmic factor, as well as Pritish Kamath for pointing out a bug in the proof of Theorem 7, resulting in a slight change to the construction. Noah Golowich is grateful for research funding from Harvard University, including the Herchel Smith and PRISE fellowships.