Size-Independent Sample Complexity of Neural Networks

Noah Golowich, Alexander Rakhlin, Ohad Shamir

Introduction

One of the major challenges involving neural networks is explaining their ability to generalize well, even if they are very large and have the potential to overfit the training data (Neyshabur et al., 2014; Zhang et al., 2016). Learning theory teaches us that this must be due to some inductive bias, which constrains one to learn networks of specific configurations (either explicitly, e.g., via regularization, or implicitly, via the algorithm used to train them). However, understanding the nature of this inductive bias is still largely an open problem.

A useful starting point is to consider the much more restricted class of linear predictors (xwx\mathbf{x}\mapsto\mathbf{w}^{\top}\mathbf{x}). For this class, we have a very good understanding of how its generalization behavior is dictated by the norm of w\mathbf{w} . In particular, assuming that wM\|\mathbf{w}\|\leq M (where \|\cdot\| signifies Euclidean norm), and the distribution is such that xB\|\mathbf{x}\|\leq B almost surely, it is well-known that the generalization error (w.r.t. Lipschitz losses) given mm training examples scales as O(MB/m)\mathcal{O}(MB/\sqrt{m}), completely independent of the dimension of w\mathbf{w}. Thus, it is very natural to ask whether in the more general case of neural networks, one can obtain similar “size-independent” results (independent of the networks’ depth and width), under appropriate norm constraints on the parameters. This is also a natural question, considering that the size of modern neural networks is often much larger than the number of training examples.

Classical results on the sample complexity of neural networks do not satisfy this desideratum, and have a strong explicit dependence on the network size. For example, bounds relying on the VC dimension (see Anthony and Bartlett (2009)) strongly depend on both the depth and the width of the network, and can be trivial once the number of parameters exceeds the number of training examples. Scale-sensitive bounds, which rely on the magnitude of the network parameters, can alleviate the dependence on the width (see Bartlett (1998); Anthony and Bartlett (2009)). However, most such bounds in the literature have a strong (often exponential) dependence on the network depth. To give one recent example, Neyshabur et al. (2015) use Rademacher complexity tools to show that if the parameter matrices W1,,WdW_{1},\ldots,W_{d} in each of the dd layers have Frobenius norms F\|\cdot\|_{F} upper-bounded by MF(1),,MF(d)M_{F}(1),\ldots,M_{F}(d) respectively, and under suitable assumptions on the activation functions, the generalization error scales (with high probability) as

Although this bound has no explicit dependence on the network width (that is, the dimensions of W1,,WdW_{1},\ldots,W_{d}), it has a very strong, exponential dependence on the network depth dd, even if MF(j)1M_{F}(j)\leq 1 for all jj. Neyshabur et al. (2015) also showed that this dependence can sometimes be avoided for anti-symmetric activations, but unfortunately this is a non-trivial assumption, which is not satisfied for common activations such as the ReLU. Bartlett et al. (2017) use a covering numbers argument to show a bound scaling as

where W\|W\| denotes the spectral norm of WW, WT2,1:=lkW(l,k)2\|W^{T}\|_{2,1}:=\sum_{l}\sqrt{\sum_{k}W(l,k)^{2}} denotes the 11-norm of the 2-norms of the rows of WW, and where we ignore factors logarithmic in mm and the network width. Unlike Eq. (1), here there is no explicit exponential dependence on the depth. However, there is still a strong and unavoidable polynomial dependence: To see this, note that for any WjW_{j}, lkWj(l,k)2WjWjFWj1\frac{\sum_{l}\sqrt{\sum_{k}W_{j}(l,k)^{2}}}{\|W_{j}\|}\geq\frac{\|W_{j}\|_{F}}{\|W_{j}\|}\geq 1, so the bound above can never be smaller than

In particular, even if we assume that B(j=1dWj)B\left(\prod_{j=1}^{d}\|W_{j}\|\right) is a constant, the bound becomes trivial once dΩ(m1/3)d\geq\Omega(m^{1/3}). Finally, and using the same notation, Neyshabur et al. (2017) utilize a PAC-Bayesian analysis to prove a bound scaling as

Can this depth dependency be avoided, assuming the norms are sufficiently constrained? We argue that in some cases, it must be true. To see this, let us return to the well-understood case of linear predictors, and consider generalized linear predictors of the form

where σ(z)=max{0,z}\sigma(z)=\max\{0,z\} is the ReLU function. Like plain-vanilla linear predictors, the generalization error of this class is well-known to be O(MB/m)\mathcal{O}(MB/\sqrt{m}), assuming the inputs satisfy xB\|\mathbf{x}\|\leq B almost surely. However, it is not difficult to show that this class can be equivalently written as a class of “ultra-thin” ReLU networks of the form

(where w1\mathbf{w}_{1} is a vector and w2,,wdw_{2},\ldots,w_{d} are scalars), where the depth dd is arbitrary. Therefore, the sample complexity of this class must also scale as O(MB/m)\mathcal{O}(MB/\sqrt{m}): This depends on the norm product MM, but is completely independent of the network depth dd as well as the dimension of w1\mathbf{w}_{1}. We argue that a “satisfactory” sample complexity analysis should have similar independence properties when applied on this class.

In more general neural networks, the vector w1\mathbf{w}_{1} and scalars w2,,wdw_{2},\ldots,w_{d} become matrices W1,,WdW_{1},\ldots,W_{d}, and the simple observation above no longer applies. However, using the same intuition, it is natural to try and derive generalization bounds by controlling j=1dWj\prod_{j=1}^{d}\|W_{j}\|, where \|\cdot\| is a suitable matrix norm. Perhaps the simplest choice is the spectral norm (and indeed, a product of spectral norms was utilized in some of the previous results mentioned earlier). However, as we formally show in Sec. 5, the spectral norm alone is too weak to get size-independent bounds, even if the network depth is small. Instead, we show that controlling other suitable norms can indeed lead to better depth dependence, or even fully size-independent bounds, improving on earlier works. Specifically, we make the following contributions:

In Sec. 3, we show that the exponential depth dependence in Rademacher complexity-based analysis (e.g. Neyshabur et al. (2015)) can be avoided by applying contraction to a slightly different object than what has become standard since the work of Bartlett and Mendelson (2002). For example, for networks with parameter matrices of Frobenius norm at most MF(1),,MF(d)M_{F}(1),\ldots,M_{F}(d), the bound in Eq. (1) can be improved to

where nn is the input dimension. Again, the dependence on dd is polynomial and quite mild. In contrast, Neyshabur et al. (2015) studied a similar setup and only managed to obtain an exponential dependence on dd.

In Sec. 4, we develop a generic technique to convert depth-dependent bounds to depth-independent bounds, assuming some control over any Schatten norm of the parameter matrices (which includes, for instance, the Frobenius norm and the trace norm as special cases). The key observation we utilize is that the prediction function computed by such networks can be approximated by the composition of a shallow network and univariate Lipschitz functions. For example, again assuming that the Frobenius norms of the layers are bounded by MF(1),,MF(d)M_{F}(1),\ldots,M_{F}(d), we can further improve Eq. (5) to

In contrast, we show the following bound for any p1p\geq 1 (ignoring some lower-order logarithmic factors):

where Mp(j)M_{p}(j) is an upper bound on the Schatten pp-norm of WjW_{j}, and Γ\Gamma is a lower bound on j=1dWj\prod_{j=1}^{d}\|W_{j}\|. Again, by upper bounding the min\min by its first argument, we get a bound independent of the depth dd, assuming the norms are suitably constrained.

In Sec. 5, we provide a lower bound, showing that for any pp, the class of depth-dd, width-hh neural networks, where each parameter matrix WjW_{j} has Schatten pp-norm at most Mp(j)M_{p}(j), can have Rademacher complexity of at least

This somewhat improves on Bartlett et al. (2017, Theorem 3.6), which only showed such a result for p=p=\infty (i.e. with spectral norm control), and without the hh term. For p=2p=2, it matches the upper bound in Eq. (6) in terms of the norm dependencies and BB. Moreover, it establishes that controlling the spectral norm alone (and indeed, any Schatten pp-norm control with p>2p>2) cannot lead to bounds independent of the size of the network. Finally, the bound shows (similar to Bartlett et al. (2017)) that a dependence on products of norms across layers is generally inevitable.

Besides the above, we provide some additional remarks and observations in Sec. 6. Most of our technical proofs are presented in Sec. 7.

Finally, we emphasize that the bounds of Sec. 4 are independent of the network size, only under the assumption that products of norms (or at least ratios of norms) across all layers are bounded by a constant, which is quite restrictive in practice. For example, it is enough that the Frobenius norm of each layer matrix is at least 22, to get that Eq. (6) scales as 2d2^{d} where dd is the number of layers. However, our focus here is to show that at least some norm-based assumptions lead to size-independent bounds, and hope these can be weakened in future work.

Preliminaries

Neural Networks. Given the domain X={x:xB}\mathcal{X}=\{\mathbf{x}:\|\mathbf{x}\|\leq B\} in Euclidean space, we consider (scalar or vector-valued) standard neural networks, of the form

Rademacher Complexity. The results in this paper focus on Rademacher complexity, which is a standard tool to control the uniform convergence (and hence the sample complexity) of given classes of predictors (see Bartlett and Mendelson (2002); Shalev-Shwartz and Ben-David (2014) for more details). Formally, given a real-valued function class H\mathcal{H} and some set of data points x1,,xmX\mathbf{x}_{1},\ldots,\mathbf{x}_{m}\in\mathcal{X}, we define the (empirical) Rademacher complexity R^m(H)\hat{\mathcal{R}}_{m}(\mathcal{H}) as

where ε=(ε1,,εm)\boldsymbol{\varepsilon}=(\varepsilon_{1},\ldots,\varepsilon_{m}) is a vector uniformly distributed in {1,+1}m\{-1,+1\}^{m}. Our main results provide bounds on the Rademacher complexity (sometimes independent of x1,,xm\mathbf{x}_{1},\ldots,\mathbf{x}_{m}, as long as they are assumed to have norm at most BB), with respect to classes of neural networks with various norm constraints. Using standard arguments, such bounds can be converted to bounds on the generalization error, assuming access to a sample of mm i.i.d. training examples.

From Exponential to Polynomial Depth Dependence

To get bounds on the Rademacher complexity of deep neural networks, a reasonable approach (employed in Neyshabur et al. (2015)) is to use a “peeling” argument, where the complexity bound for depth dd networks is reduced to a complexity bound for depth d1d-1 networks, and then applying the reduction dd times. For example, consider the class Hd\mathcal{H}_{d} of depth-dd ReLU real-valued neural networks, with each layer’s parameter matrix with Frobenius norm at most MF(j)M_{F}(j). Using some straightforward manipulations, it is possible to show that R^m(Hd)\hat{\mathcal{R}}_{m}(\mathcal{H}_{d}), which by definition equals

Iterating this inequality dd times, one ends up with a bound scaling as 2dj=1dMF(j)2^{d}\prod_{j=1}^{d}M_{F}(j) (as in Neyshabur et al. (2015), see also Eq. (1)). The exponential 2d2^{d} factor follows from the 22 factor in Eq. (8), which in turn follows from applying the Rademacher contraction principle to get rid of the σ\sigma function. Unfortunately, this 22 factor is generally unavoidable (see the discussion in Ledoux and Talagrand (1991) following Theorem 4.12).

where λ>0\lambda>0 is an arbitrary parameter. We then perform a “peeling” argument similar to before, resulting in a multiplicative 22 factor after every peeling step. Crucially, these factors accumulate inside the log factor, so that the end result contains only a log(2d)=d\log(2^{d})=d factor, which by appropriate tuning of λ\lambda, can be further reduced to d\sqrt{d}.

The formalization of this argument depends on the matrix norm we are using, and we will begin with the case of the Frobenius norm. A key technical condition for the argument to work is that we can perform the “peeling” inside the exp\exp function. This is captured by the following lemma:

Letting w1,w2,,wh\mathbf{w}_{1},\mathbf{w}_{2},\ldots,\mathbf{w}_{h} be the rows of the matrix WW, we have

The supremum of this over all w1,,wh\mathbf{w}_{1},\ldots,\mathbf{w}_{h} such that WF2=j=1hwj2R2\|W\|_{F}^{2}=\sum_{j=1}^{h}\|\mathbf{w}_{j}\|^{2}\leq R^{2} must be attained when wj=R\|\mathbf{w}_{j}\|=R for some jj, and wi=0\|\mathbf{w}_{i}\|=0 for all iji\neq j. Therefore,

Since g(z)g(z)+g(z)g(|z|)\leq g(z)+g(-z), this can be upper bounded by

where the equality follows from the symmetry in the distribution of the ϵi\epsilon_{i} random variables. The right hand side in turn can be upper bounded by

(see equation 4.20 in Ledoux and Talagrand (1991)). ∎

With this lemma in hand, we can provide a bound on the Rademacher complexity of Frobnius-norm-bounded neural networks, which is as clean as Eq. (1), but where the 2d2^{d} factor is replaced by d\sqrt{d}:

Let Hd\mathcal{H}_{d} be the class of real-valued networks of depth dd over the domain X\mathcal{X}, where each parameter matrix WjW_{j} has Frobenius norm at most MF(j)M_{F}(j), and with activation functions satisfying Lemma 1. Then

Fix λ>0\lambda>0, to be chosen later. The Rademacher complexity can be upper bounded as

where ff ranges over all possible functions σd2NW1d2(x)\sigma_{d-2}\circ N_{W_{1}^{d-2}}(\mathbf{x}). Here we applied Lemma 1 with g(α)=exp{MF(d)λα}g(\alpha)=\exp\{M_{F}(d)\lambda\cdot\alpha\}. Repeating the process, we arrive at

where M=j=1dMF(j)M=\prod_{j=1}^{d}M_{F}(j). Define a random variable

(random as a function of the random variables ϵ1,,ϵm\epsilon_{1},\ldots,\epsilon_{m}). Then

This means that ZZ satisfies a bounded-difference condition, which by the proof of Theorem 6.2 in (Boucheron et al., 2013), implies that ZZ is sub-Gaussian, with variance factor

Choosing λ=2log(2)dMi=1mxi2\lambda=\frac{\sqrt{2\log(2)d}}{M\sqrt{\sum_{i=1}^{m}\|\mathbf{x}_{i}\|^{2}}} and using the above, we get that Eq. (9) can be upper bounded as follows:

We note that for simplicity, the bound in Thm. 1 is stated for real-valued networks, but the argument easily carries over to vector-valued networks, composed with some real-valued Lipschitz loss function. In that case, one uses a variant of Lemma 1 to peel off the losses, and then proceed in the same manner as in the proof of Thm. 1. We omit the precise details for brevity.

A result similar to the above can also be derived for other matrix norms. For example, given a matrix WW, let W1,\|W\|_{1,\infty} denote the maximal 11-norm of its rows, and consider the class Hd\mathcal{H}_{d} of depth-dd networks, where each parameter matrix WjW_{j} satisfies Wj1,M(j)\|W_{j}\|_{1,\infty}\leq M(j) for all jj (this corresponds to a setting, also studied in Neyshabur et al. (2015), where the 11-norm of the weights of each neuron in the network is bounded). In this case, we can derive a variant of Lemma 1, which in fact does not require positive-homogeneity of the activation function:

where \|\cdot\|_{\infty} denotes the vector infinity norm.

Using the same technique as before, we can use this lemma to get a bound on the Rademacher complexity for Hd\mathcal{H}_{d}:

Let Hd\mathcal{H}_{d} be the class of real-valued networks of depth dd over the domain X\mathcal{X}, where Wj1,M(j)\|W_{j}\|_{1,\infty}\leq M(j) for all j{1,,d}j\in\{1,\ldots,d\}, and with activation functions satisfying the condition of Lemma 2. Then

where xi,jx_{i,j} is the jj-th coordinate of the vector xi\mathbf{x}_{i}.

The proofs of the theorem, as well as Lemma 2, appear in Sec. 7.

From Depth Dependence to Depth Independence

In this section, we develop a general result, which allows one to convert any depth-dependent bound on the Rademacher complexity of neural networks, to a depth-independent one, assuming that the Schatten pp-norms of the parameter matrices (for any p[1,)p\in[1,\infty)) is sufficiently controlled. We develop and formalize the main result in Subsection 4.1, and provide applications in Subsection 4.2. The proofs the results in this section appear in Sec. 7.

To motivate our approach, let us consider a special case of depth-dd networks, where

Each parameter matrix W1,,Wd1W_{1},\ldots,W_{d-1} is constrained to be diagonal and of size h×hh\times h.

The Frobenius norm of every W1,,WdW_{1},\ldots,W_{d} is at most 11.

All activation functions are the identity (so the network computes a linear function).

Letting wi\mathbf{w}_{i} be the diagonal of WiW_{i}, such networks are equivalent to

where \circ denotes element-wise product. Therefore, if we would like the network to compute a non-trivial function, we clearly need that wdw1\mathbf{w}_{d}\circ\ldots\circ\mathbf{w}_{1} be bounded away from zero (e.g., not exponentially small in dd), while still satisfying the constraint wj1\|\mathbf{w}_{j}\|\leq 1 for all jj. In fact, the only way to satisfy both requirements simultaneously is if w1,,wd\mathbf{w}_{1},\ldots,\mathbf{w}_{d} are all close to some 1-sparse unit vector, which implies that the matrices W1,,WdW_{1},\ldots,W_{d} must be close to being rank-1.

It turns out that this intuition holds much more generally, even if we do not restrict ourselves to identity activations and diagonal parameter matrices as above. Essentially, what we can show is that if some network computes a non-trivial function, and the product of its Schatten pp-norms (for any p<p<\infty) is bounded, then there must be at least one parameter matrix, which is not far from being rank-1. Therefore, if we replace that parameter matrix by an appropriate rank-1 matrix, the function computed by the network does not change much. This is captured by the following theorem:

We now make the following crucial observation: A real-valued network with a rank-1 parameter matrix Wr=suvW_{r^{\prime}}=s\mathbf{u}\mathbf{v}^{\top} computes the function

This can be seen as the composition of the depth-rr^{\prime} network

Moreover, the norm constraints imply that the latter function is Lipschitz. Therefore, the class of networks we are considering is a subset of the class of depth-rr^{\prime} networks composed with univariate Lipschitz functions. Fortunately, given any class with bounded complexity, one can effectively bound the Rademacher complexity of its composition with univariate Lipschitz functions, as formalized in the following theorem.

The log3/2(m)R^m(H)\log^{3/2}(m)\cdot\hat{\mathcal{R}}_{m}(\mathcal{H}) can be replaced by log(m)G^m(H)\log(m)\cdot\hat{\mathcal{G}}_{m}(\mathcal{H}), where G^m(H)\hat{\mathcal{G}}_{m}(\mathcal{H}) is the empirical Gaussian complexity of H\mathcal{H} – see the proof in Sec. 7 for details.

Combining these ideas, our plan of attack is the following: Given some class of depth-dd networks, and an arbitrary parameter r{1,,d}r\in\{1,\ldots,d\}, we use Thm. 3 to relate their Rademacher complexity to the complexity of similar networks, but where for some 1rr1\leq r^{\prime}\leq r, the rr^{\prime}-th parameter matrix is of rank-1. We then use Thm. 4 to bound that complexity in turn using the Rademacher complexity of depth-rr^{\prime} networks. Crucially, the resulting bound has no explicit dependence on the original depth dd, only on the new parameter rr. Formally, we have the following theorem, which is the main result of this section:

Consider the following hypothesis class of networks on X={x:xB}\mathcal{X}=\{\mathbf{x}:\|\mathbf{x}\|\leq B\}:

for some parameters p,Γ1,{M(j),Mp(j),Wj}j=1dp,\Gamma\geq 1,\{M(j),M_{p}(j),\mathcal{W}_{j}\}_{j=1}^{d}. Also, for any r{1,,d}r\in\{1,\ldots,d\}, define

In particular, one can upper bound this result by any choice of rr in {1,,d}\{1,\ldots,d\}. By tuning rr appropriately, we can get bounds independent of the depth dd. In the next subsection, we provide some concrete applications for specific choices of H\mathcal{H}.

The parameters Γ\Gamma and γ\gamma, which divide the norm terms in Thm. 5, are both closely related to the notion of a margin. Indeed, if we consider binary or multi-class classification, then bounds as above w.r.t. 1γ\frac{1}{\gamma}-Lipschitz losses can be converted to a bound on the misclassification error rate in terms of the average γ\gamma-margin error on the training data (see Bartlett et al. (2017, Section 3.1) for a discussion). Also, BΓB\Gamma can be viewed as the “maximal” margin attainable over the input domain, since supxXNW1d(x)Bj=1dWj=BΓ\sup_{\mathbf{x}\in\mathcal{X}}\|N_{W_{1}^{d}}(\mathbf{x})\|\leq B\prod_{j=1}^{d}\|W_{j}\|=B\Gamma.

2 Applications of Thm. 5

In this section we exemplify how Thm. 5 can be used to obtain depth-independent bounds on the sample complexity of various classes of neural networks. The general technique is as follows: First, we prove a bound on R^m(Hr)\hat{R}_{m}(\mathcal{H}_{r}), which generally depends on the depth rr, and scales as rα/mr^{\alpha}/\sqrt{m} for some α>0\alpha>0. Then, we plug it into Thm. 5, and utilize the following lemma to tune rr appropriately:

For any α>0\alpha>0, β(0,1]\beta\in(0,1] and b,c,n1b,c,n\geq 1, it holds that

We begin with proving a depth-independent version of Thm. 1. That theorem implies that for the class Hr\mathcal{H}_{r} of depth-rr neural networks with Frobenius norm bounds MF(1),,MF(r)M_{F}(1),\ldots,M_{F}(r) (up to and including r=dr=d),

Plugging this into Thm. 5, and using Lemma 3, it is straightforward to derive the following corollary (see Sec. 7 for a formal derivation):

where logˉ(z):=max{1,log(z)}\bar{\log}(z):=\max\{1,\log(z)\}.

Ignoring logarithmic factors and replacing the min\min by its first argument, the bound in the corollary is at most

Assuming jMF(j)\prod_{j}M_{F}(j) and jMF(j)/Γ\prod_{j}M_{F}(j)/\Gamma are bounded by a constant, we get a bound which does not depend on the width or depth of the network: In other words, it is possible to make this bound smaller than any fixed ϵ\epsilon, with a sample size mm independent of the network’s size. On the other hand, the bound in Corollary 1 is also bounded by

which is the bound one would get from an immediate application of Thm. 1, and implies that the asymptotic rate (as a function of mm) is still maintained. As discussed in the introduction, the assumption that jMF(j)\prod_{j}M_{F}(j) is a constant is certainly a strong one in practice, but to the best of our knowledge, is the first norm-based assumption which leads to size independence.

Next, we apply Thm. 5 to the results in Bartlett et al. (2017), which as discussed in the introduction, provide a depth-dependent bound using a different set of norms. Specifically, they obtain the following intermediary result in deriving their generalization bound:

Let H\mathcal{H} be the hypothesis class of depth-dd, width-hh real-valued networks on X={x:xB}\mathcal{X}=\{\mathbf{x}:\|\mathbf{x}\|\leq B\}, using 11-Lipschitz activation functions, given by

for some fixed parameters {L(j),M(j)}j=1d\{L(j),M(j)\}_{j=1}^{d}. Then the Rademacher complexity R^m(H)\hat{\mathcal{R}}_{m}(\mathcal{H}) is upper-bounded by

As discussed in the introduction, L(j)L(j) can never be smaller than 11, hence the bound scales at least as d3/m\sqrt{d^{3}/m}. However, using the bound above together with Thm. 5 and Lemma 3, we can get the following corollary (where for simplicity, we assume that L(j)L(j) for all jj are uniformly bounded by some LL):

where logˉ(z):=max{1,log(z)}\bar{\log}(z):=\max\{1,\log(z)\}.

As before, by replacing the min\min by its first argument, we get a bound which is fully independent of the network size, assuming the norms are suitably bounded. To give a concrete example, if we take p=2p=2 (so that the Mp(j)M_{p}(j) constraints correspond to the Frobenius norm), and ignore lower-order logarithmic factors, we get a bound scaling as

In contrast, a direct application of Thm. 6 in the same setting leads to a bound of

Finally, we note that since the Eq. (3), based on a PAC-Bayes analysis, is always weaker than Eq. (2) (as noted in Bartlett et al. (2017)), Corollary 2 also gives a size-independent version of Eq. (3).

A Lower Bound for Schatten Norms

In this section, we present a lower bound on the Rademacher complexity, for the class of neural networks with parameter matrices of bounded Schatten norms. The formal result is the following:

This theorem strengthens Theorem 3.6 in Bartlett et al. (2017), in the sense that they only considered the case p=p=\infty, and did not have a dependence on hh. On the other hand, they consider bounds which hold for any choice of x1,,xm\mathbf{x}_{1},\ldots,\mathbf{x}_{m}, while we consider bounds uniform over x1,,xm\mathbf{x}_{1},\ldots,\mathbf{x}_{m} for simplicity. Moreover, for network depths larger than 11, our construction requires a non element-wise activation function. The lower bound has the following implications:

Like Bartlett et al. (2017), the theorem implies that by controlling just the norms of each parameter matrix, a dependence on the product of the norms is generally inevitable.

For p=p=\infty, we see that there is an inevitable h1/2h^{1/2} factor in the bound, which implies that controlling the spectral norm is insufficient to get size-independent bounds (at least, independent of the width hh). More generally, any Schatten pp-norm control with p>2p>2 will be insufficient to get such size independence.

For p=2p=2 (i.e. Frobenius norm bounds MF(1),,MF(d)M_{F}(1),\ldots,M_{F}(d)), the lower bound becomes size-independent, and on the order of Bj=1dMF(j)m\frac{B\prod_{j=1}^{d}M_{F}(j)}{\sqrt{m}}. The dependence on MF(j)M_{F}(j) is similar to our upper bound in Corollary 1 up to logarithmic factors (although the dependence on mm is worse).

Additional Remarks

So far we have proved upper bounds on the empirical Rademacher complexity of a fixed class of neural networks, of the form

for some complexity measure C()C(\cdot) and a parameter LL. These imply high-probability learning guarantees for algorithms which return predictors in HL\mathcal{H}_{L}. However, in the context of norm-based constraints, practical algorithms for neural networks usually perform unconstrained optimization, and therefore are not guaranteed a-priori to return a predictor in some fixed HL\mathcal{H}_{L}. Fortunately, it is straightforward to convert these bounds into probabilistic guarantees for any neural network hh, with the bound scaling appropriately with the complexity C(h)C(h) of the particular network. We note that such post-hoc guarantees have also been stated in the context of some previous sample complexity bounds for neural networks (e.g., Bartlett et al. (2017); Neyshabur et al. (2017)). It is achieved, for instance, by a union bound over, say, a doubling scale of the complexity. We refer to the proof of the margin bound of Koltchinskii and Panchenko (2002, Theorem 2) for an example of this technique.

2 Complexity of Lipschitz Networks

Taking this to the extreme, we can also bound the complexity of neural networks computing Lipschitz functions, by studying the complexity of all Lipschitz functions over the domain X\mathcal{X}. It is easily verified that in our setting, if we consider the class H\mathcal{H} of depth-dd networks, where each parameter matrix has spectral norm at most M(j)M(j), then the network must be j=1dM(j)\prod_{j=1}^{d}M(j)-Lipschitz. Using well-known estimates of the covering numbers of Lipschitz functions over X\mathcal{X}, we get that

Proofs

We first prove Lemma 2. Letting wj\mathbf{w}_{j} denote the jj-th row of a matrix WW, we have

Since g(z)g(z)+g(z)g(|z|)\leq g(z)+g(-z), the left-hand side of Eq. (11) is upper bounded by

and the proof is concluded exactly as in Lemma 1 by appealing to Ledoux and Talagrand (1991).

We now turn to Thm. 2, whose proof is rather similar to that of Thm. 1. Fixing λ>0\lambda>0 to be chosen later, the Rademacher complexity can be upper bounded as

Applying the same argument as in the proof of Thm. 1, and using Lemma 2, we can upper bound the above by

where M=j=1dM(j)M=\prod_{j=1}^{d}M(j). Letting xi,jx_{i,j} denote the jj-th coordinate of xi\mathbf{x}_{i}, and using symmetry, the expectation inside the log can be re-written as

where in the last step we used the fact that exp(z)+exp(z)2exp(z2/2)\frac{\exp(z)+\exp(-z)}{2}\leq\exp(z^{2}/2). Further upper bounding this by 2nmaxjexp(M2λ2i=1dxi,j2)2n\max_{j}\exp\left(M^{2}\lambda^{2}\sum_{i=1}^{d}x_{i,j}^{2}\right) and plugging back to Eq. (13), we get

Choosing λ=d+1+log(n)M2maxji=1mxi,j2\lambda=\sqrt{\frac{d+1+\log(n)}{M^{2}\max_{j}\sum_{i=1}^{m}x_{i,j}^{2}}}, we can upper bound the above by

2 Proof of Thm. 3

The proof will build on the following few technical lemmas.

which equals (WppWp)1/p\left(\|W\|_{p}^{p}-\|W\|^{p}\right)^{1/p}. ∎

By a simple calculation, we have that the Lipschitz constant of the function NWbsN_{W_{b}^{s}} is at most j=bsWj\prod_{j=b}^{s}\|W_{j}\|.

Assume for now that 2sd12\leq s\leq d-1. By definition, we have

The Lipschitz constant of the function NWs+1dN_{W_{s+1}^{d}} is at most j=s+1dWj\prod_{j=s+1}^{d}\|W_{j}\|, and the norm of NW1s1(x)N_{W_{1}^{s-1}}(\mathbf{x}) is at most xj=1s1Wj\|\mathbf{x}\|\prod_{j=1}^{s-1}\|W_{j}\|. Therefore, for any xX\mathbf{x}\in\mathcal{X},

from which the result follows after a simplification. The cases s=1s=1 and s=ds=d are handled in exactly the same manner. ∎

Suppose that NW1dN_{W_{1}^{d}} is such that j=1dWjΓ\prod_{j=1}^{d}\|W_{j}\|\geq\Gamma and j=1dWjpM\prod_{j=1}^{d}\|W_{j}\|_{p}\leq M. Then for any r{1,,d}r\in\{1,\ldots,d\},

Fixing some rr, and using the stated assumptions as well as the fact that WpW\|W\|_{p}\geq\|W\| for any pp, we have

Taking the rr-th root from both sides, the result follows. ∎

With these lemmas in hand, we can now turn to prove Thm. 3. A direct application of Lemma 6 implies that for any r{1,,d}r\in\{1,\ldots,d\},

Substituting Eq. (14) into Eq. (15), we get that

Suppose for now that rr is such that rplog(M/Γ)r\geq p\log(M/\Gamma). Using the fact that exp(z)1+2z\exp(z)\leq 1+2z for any zz\in, it follows that the above is at most

3 Proof of Thm. 4

To prove the theorem, we use a straightforward covering number argument, beginning with a few definitions.

Given any function class F\mathcal{F}, a metric dd on the elements of F\mathcal{F}, and ϵ>0\epsilon>0, we let the covering number N(F,d,ϵ)\mathcal{N}(\mathcal{F},d,\epsilon) denote the minimal number nn of functions f1,f2,,fnf_{1},f_{2},\ldots,f_{n} in F\mathcal{F}, such that for all fFf\in\mathcal{F}, mini=1,,nd(fi,f)ϵ\min_{i=1,\ldots,n}d(f_{i},f)\leq\epsilon. In particular, fix some set of data points x1,,xm\mathbf{x}_{1},\ldots,\mathbf{x}_{m}, and define the empirical L2L_{2} distance

Also, given a function class F\mathcal{F} and x1,,xm\mathbf{x}_{1},\ldots,\mathbf{x}_{m}, we let

denote the (empirical) Gaussian complexity of F\mathcal{F}, where η1,,ηn\eta_{1},\ldots,\eta_{n} are i.i.d. standard Gaussian random variables. It is well known that R^m(H)\hat{\mathcal{R}}_{m}(\mathcal{H}) and G^m(H)\hat{\mathcal{G}}_{m}(\mathcal{H}) are equivalent up to a clog(m)c\sqrt{\log(m)} factor (Ledoux and Talagrand, 1991, pg. 97). By Sudakov’s minoration theorem (see Theorem 3.18 in Ledoux and Talagrand (1991)), we have that for all α>0\alpha>0

With these definitions in hand, we turn to prove the theorem. We first note that FL,aH\mathcal{F}_{L,a}\circ\mathcal{H} is equivalent to the class {Lg()+a:gF1,0H}\{Lg(\cdot)+a:g\in\mathcal{F}_{1,0}\circ\mathcal{H}\}, and therefore

again for some numerical constant c>0c^{\prime}>0. The first inequality follows from the fact that for any functions f,ff,f^{\prime}, it holds that d^m(f,f)d^(f,f):=supxf(x)f(x)\hat{d}_{m}(f,f^{\prime})\leq\hat{d}_{\infty}(f,f^{\prime}):=\sup_{\mathbf{x}}|f(\mathbf{x})-f^{\prime}(\mathbf{x}^{\prime})|, so it is enough to upper bound N(F,d^,ϵ)\mathcal{N}(\mathcal{F},\hat{d}_{\infty},\epsilon). To do so, we first notice that the range of any fFf\in\mathcal{F} is in [R,R][-R,R]. Discretize [R,R]×[R,R][-R,R]\times[-R,R] into a two-dimensional grid Ux×Uy\mathcal{U}_{x}\times\mathcal{U}_{y}, where

Given any fFf\in\mathcal{F}, construct the piecewise-linear function ff^{\prime} as follows: For any input xUxx\in\mathcal{U}_{x}, let f(x)f^{\prime}(x) be the point in Uy\mathcal{U}_{y} nearest to f(x)f(x) (breaking ties arbitrarily), and let the rest of ff^{\prime} be constructed as a linear interpolation of these points on Ux\mathcal{U}_{x}. It is easily verified that supx[R,R]f(x)f(x)ϵ\sup_{x\in[-R,R]}|f(x)-f^{\prime}(x)|\leq\epsilon. Moreover, note that for two neighboring points x,xx,x^{\prime} in Ux\mathcal{U}_{x}, the points f(x),f(x)f^{\prime}(x),f^{\prime}(x^{\prime}) on Uy\mathcal{U}_{y} must be neighboring or the same. Therefore, each such function ff^{\prime} can be parameterized by a vector of the form {,0,+}Ux1\{-,0,+\}^{|\mathcal{U}_{x}|-1}, which specifies whether (starting from the origin) ff^{\prime} goes up, down, or remains the same on each of its linear segments. The number of such functions is at most 3Ux132R/ϵ+13^{|\mathcal{U}_{x}|-1}\leq 3^{2R/\epsilon+1}, and therefore N(F,d^,ϵ)32R/ϵ+1\mathcal{N}(\mathcal{F},\hat{d}_{\infty},\epsilon)\leq 3^{2R/\epsilon+1}. Recalling that d^(f,f)\hat{d}_{\infty}(f,f^{\prime}) majorizes d^m(f,f)\hat{d}_{m}(f,f^{\prime}), we get Eq. (18).

To see this, pick any fFf\in\mathcal{F} and hHh\in\mathcal{H}, and let f,hf^{\prime},h^{\prime} be the respective closest functions in the cover of F\mathcal{F} and H\mathcal{H} (at scale ϵ/2\epsilon/2). By the triangle inequality and the easily verified fact that ff^{\prime} is 11-Lipschitz, we have

Therefore, we can cover FH\mathcal{F}\circ\mathcal{H} (at scale ϵ\epsilon) by taking f(h())f^{\prime}(h^{\prime}(\cdot)) for all possible choices of f,hf^{\prime},h^{\prime} from the covers of F,H\mathcal{F},\mathcal{H} (at scale ϵ/2\epsilon/2), leading to Eq. (19).

Combining Eq. (16), Eq. (18) and Eq. (19), we get that

for some numerical constant c>0c^{\prime\prime}>0. Finally, we use Dudley’s entropy integral, which together with the equation above, implies the following for some numerical constant c>0c>0 (possibly changing from row to row):

Choosing in particular α=R/m\alpha=R/\sqrt{m}, we get the upper bound

for some c>0c>0. Plugging this into Eq. (7.3), and upper bounding G^m(H)\hat{\mathcal{G}}_{m}(\mathcal{H}) by clog(m)R^m(H)c\sqrt{\log(m)}\hat{\mathcal{R}}_{m}(\mathcal{H}) (see (Ledoux and Talagrand, 1991)), the result follows.

4 Proof of Thm. 5

We first need the following lemma, which bounds the empirical Rademacher complexity of the union of a collection of bounded function classes.

so ϕ\phi satisfies a bounded differences assumption with variance factor A2/mA^{2}/m. By McDiarmid’s inequality (Theorem 6.2, Boucheron et al. (2013)) it follows that for t>0t>0,

(The same inequality, up to a constant factor, also follows directly from Theorem 4.7 in Ledoux and Talagrand (1991).) A union bound then shows that for t>0t>0,

We next continue with the proof of Thm. 5. It is enough to prove the bound for any fixed r{1,,d}r\in\{1,\ldots,d\}, and then take the infimum over any such rr.

This function is equivalent to the composition of the function

Notice that for each r{1,,r}r^{\prime}\in\{1,\ldots,r\}, all functions of Hr\mathcal{H}_{r^{\prime}} have output is bounded in ±Bj=1rM(j)\pm B\prod_{j=1}^{r^{\prime}}M(j)). Recall also that F1γj=r+1dM(j),a\mathcal{F}_{\frac{1}{\gamma}\prod_{j=r^{\prime}+1}^{d}M(j),a} is the class consisting of real-valued j=r+1dM(j)\prod_{j=r^{\prime}+1}^{d}M(j)-Lipschitz functions ff such that f(0)=af(0)=a. As a result, we can apply Thm. 4 and obtain that for r{1,,r}r^{\prime}\in\{1,\ldots,r\},

Notice that all functions in r{1,,r}F1γj=r+1dM(j),aHr\bigcup_{r^{\prime}\in\{1,\ldots,r\}}\mathcal{F}_{\frac{1}{\gamma}\prod_{j=r^{\prime}+1}^{d}M(j),a}\circ\mathcal{H}_{r^{\prime}} have output bounded in ±A\pm A, where

By Lemma 7, for an appropriate constant cc^{\prime},

for an appropriate constant cc^{\prime\prime}. As mentioned at the beginning of the proof, this upper bound holds for any fixed r{1,,d}r\in\{1,\ldots,d\}, from which the result follows using the assumption that aBj=1dM(j)γ|a|\leq\frac{B\prod_{j=1}^{d}M(j)}{\gamma}.

5 Proof of Lemma 3

We will show that for any α,β,b,c,n\alpha,\beta,b,c,n as stated in the lemma, there always exists a choice of r{1,,d}r\in\{1,\ldots,d\} such that

Since the left hand side is also trivially at most dαn\frac{d^{\alpha}}{n}, the result follows. We prove this inequality by a case analysis:

If (bn/c)1α+β[1,d](bn/c)^{\frac{1}{\alpha+\beta}}\in[1,d], pick r=(bn/c)1α+β{1,2,,d}r=\left\lfloor(bn/c)^{\frac{1}{\alpha+\beta}}\right\rfloor\in\{1,2,\ldots,d\}, in which case

If (bn/c)1α+β>d(bn/c)^{\frac{1}{\alpha+\beta}}>d, it follows that

6 Proof of Corollary 1

Next, we plug Eq. (12) into Thm. 5. We use p=2p=2, let each Wj\mathcal{W}_{j} be the space of all matrices, and choose M(j)=MF(j)M(j)=M_{F}(j) for all jj, noting that WjM(j)WjFMF(j)1\frac{\|W_{j}\|}{M(j)}\leq\frac{\|W_{j}\|_{F}}{M_{F}(j)}\leq 1. Since rr\sqrt{r}\geq\sqrt{r^{\prime}} for all rrr^{\prime}\leq r, it follows that

7 Proof of Corollary 2

where c=logˉ3/2(m)c=\bar{\log}^{3/2}(m), n=mn=\sqrt{m}, and b=logˉ(1Γj=1dMp(j))1/pb=\bar{\log}\left(\frac{1}{\Gamma}\prod_{j=1}^{d}M_{p}(j)\right)^{1/p} (note that we use here logˉ\bar{\log} instead of log\log so the conditions of Lemma 3, to be used shortly, will be satisfied). As in Corollary 1, the O(1+logrm)\mathcal{O}\left(\frac{1+\sqrt{\log r}}{\sqrt{m}}\right) term from Thm. 5 is absorbed into the O(cr3/2n)\mathcal{O}\left(\frac{cr^{3/2}}{n}\right) term in the minimization over rr in the above equation.

8 Proof of Thm. 7

In particular, by choosing wk=h1/psign(iAkϵi)w_{k}=h^{-1/p}\cdot\text{sign}\left(\sum_{i\in A_{k}}\epsilon_{i}\right) for all kk, we can lower bound the above by

and since Akm/h|A_{k}|\geq\lfloor m/h\rfloor by its definition, we get a lower bound of

Taking the best of this lower bound and the lower bound in Eq. (23), the result follows.

Acknowledgements

We thank Ziwei Ji and Matus Telgarsky for pointing that an additional union bound is required in the proof of Thm. 5, resulting in an extra logarithmic factor, as well as Pritish Kamath for pointing out a bug in the proof of Theorem 7, resulting in a slight change to the construction. Noah Golowich is grateful for research funding from Harvard University, including the Herchel Smith and PRISE fellowships.

References