Uniform convergence may be unable to explain generalization in deep learning

Vaishnavh Nagarajan, J. Zico Kolter

Introduction

Explaining why overparameterized deep networks generalize well has become an important open question in deep learning. How is it possible that a large network can be trained to perfectly fit randomly labeled data (essentially by memorizing the labels), and yet, the same network when trained to perfectly fit real training data, generalizes well to unseen data? This called for a “rethinking” of conventional, algorithm-independent techniques to explain generalization. Specifically, it was argued that learning-theoretic approaches must be reformed by identifying and incorporating the implicit bias/regularization of stochastic gradient descent (SGD) . Subsequently, a huge variety of novel and refined, algorithm-dependent generalization bounds for deep networks have been developed, all based on uniform convergence, the most widely used tool in learning theory. The ultimate goal of this ongoing endeavor is to derive bounds on the generalization error that (a) are small, ideally non-vacuous (i.e., <1<1), (b) reflect the same width/depth dependence as the generalization error (e.g., become smaller with increasing width, as has been surprisingly observed in practice), (c) apply to the network learned by SGD (without any modification or explicit regularization) and (d) increase with the proportion of randomly flipped training labels (i.e., increase with memorization).

While every bound meets some of these criteria (and sheds a valuable but partial insight into generalization in deep learning), there is no known bound that meets all of them simultaneously. While most bounds apply to the original network, they are neither numerically small for realistic dataset sizes, nor exhibit the desired width/depth dependencies (in fact, these bounds grow exponentially with the depth). The remaining bounds hold either only on a compressed network or a stochastic network or a network that has been further modified via optimization or more than one of the above . Extending these bounds to the original network is understood to be highly non-trivial . While strong width-independent bounds have been derived for two-layer ReLU networks , these rely on a carefully curated, small learning rate and/or large batch size. (We refer the reader to Appendix A for a tabular summary of these bounds.)

In our paper, we bring to light another fundamental issue with existing bounds. We demonstrate that these bounds violate another natural but largely overlooked criterion for explaining generalization: (e) the bounds should decrease with the dataset size at the same rate as the generalization error. In fact, we empirically observe that these bounds can increase with dataset size, which is arguably a more concerning observation than the fact that they are large for a specific dataset size.

Motivated by the seemingly insurmountable hurdles towards developing bounds satisfying all the above five necessary criteria, we take a step back and examine how the underlying technique of uniform convergence may itself be inherently limited in the overparameterized regime. Specifically, we present examples of overparameterized linear classifiers and neural networks trained by GD (or SGD) where uniform convergence can provably fail to explain generalization. Intuitively, our examples highlight that overparameterized models trained by gradient descent can learn decision boundaries that are largely “simple” – and hence generalize well – but have “microscopic complexities” which cannot be explained away by uniform convergence. Thus our results call into question the active ongoing pursuit of using uniform convergence to fully explain generalization in deep learning.

We first show that in practice certain weight norms of deep ReLU networks, such as the distance from initialization, increase polynomially with the number of training examples (denoted by mm). We then show that as a result, existing generalization bounds – all of which depend on such weight norms – fail to reflect even a dependence on mm even reasonably similar to the actual test error, violating criterion (e); for sufficiently small batch sizes, these bounds even grow with the number of examples. This observation uncovers a conceptual gap in our understanding of the puzzle, by pointing towards a source of vacuity unrelated to parameter count.

As our second contribution, we consider three example setups of overparameterized models trained by (stochastic) gradient descent – a linear classifier, a sufficiently wide neural network with ReLUs and an infinite width neural network with exponential activations (with the hidden layer weights frozen) – that learn some underlying data distribution with small generalization error (say, at most ϵ\epsilon). These settings also simulate our observation that norms such as distance from initialization grow with dataset size mm. More importantly, we prove that, in these settings, any two-sided uniform convergence bound would yield a (nearly) vacuous generalization bound.

Notably, this vacuity holds even if we “aggressively” take implicit regularization into account while applying uniform convergence – described more concretely as follows. Recall that roughly speaking a uniform convergence bound essentially evaluates the complexity of a hypothesis class (see Definition 3.2). As suggested by Zhang et al. , one can tighten uniform convergence bounds by pruning the hypothesis class to remove extraneous hypotheses never picked by the learning algorithm for the data distribution of interest. In our setups, even if we apply uniform convergence on the set of only those hypotheses picked by the learner whose test errors are all negligible (at most ϵ\epsilon), one can get no better than a nearly vacuous bound on the generalization error (that is at least 1ϵ1-\epsilon). In this sense, we say that uniform convergence provably cannot explain generalization in our settings. Finally, we note that while nearly all existing uniform convergence-based techniques are two-sided, we show that even PAC-Bayesian bounds, which are typically presented only as one-sided convergence, also boil down to nearly vacuous guarantees in our settings.

1 Related Work

Prior works like Neyshabur et al. and Nagarajan and Kolter have studied the behavior of weight norms in deep learning. Although these works do not explicitly study the dependence of these norms on training set size mm, one can infer from their plots that weight norms of deep networks show some increase with mm. Belkin et al. reported a similar paradox in kernel learning, observing that norms that appear in kernel generalization bounds increase with mm, and that this is due to noise in the labels. Kawaguchi et al. showed that there exist linear models with arbitrarily large weight norms that can generalize well, although such weights are not necessarily found by gradient descent. We crucially supplement these observations in three ways. First, we empirically and theoretically demonstrate how, even with zero label noise (unlike ) and by gradient descent (unlike ), a significant level of mm-dependence can arise in the weight norms – significant enough to make even the generalization bound grow with mm. Next, we identify uniform convergence as the root cause behind this issue, and thirdly and most importantly, we provably demonstrate this is so.

Weaknesses of Uniform Convergence.

Traditional wisdom is that uniform convergence bounds are a bad choice for complex classifiers like k-nearest neighbors because these hypotheses classes have infinite VC-dimension (which motivated the need for stability based generalization bounds in these cases ). However, this sort of an argument against uniform convergence may still leave one with the faint hope that, by aggressively pruning the hypothesis class (depending on the algorithm and the data distribution), one can achieve meaningful uniform convergence. In contrast, we seek to rigorously and thoroughly rule out uniform convergence in the settings we study. We do this by first defining the tightest form of uniform convergence in Definition 3.3 – one that lower bounds any uniform convergence bound – and then showing that even this bound is vacuous in our settings. Additionally, we note that we show this kind of failure of uniform convergence for linear classifiers, which is a much simpler model compared to k-nearest neighbors.

For deep networks, Zhang et al. showed that applying uniform convergence on the whole hypothesis class fails, and that it should instead be applied in an algorithm-dependent way. Ours is a much different claim – that uniform convergence is inherently problematic in that even the algorithm-dependent application would fail – casting doubt on the rich line of post-Zhang et al. algorithm-dependent approaches. At the same time, we must add the disclaimer that our results do not preclude the fact that uniform convergence may still work if GD is run with explicit regularization (such as weight decay). Such a regularized setting however, is not the main focus of the generalization puzzle .

Prior works have also focused on understanding uniform convergence for learnability of learning problems. Roughly speaking, learnability is a strict notion that does not have to hold even though an algorithm may generalize well for simple distributions in a learning problem. While we defer the details of these works in Appendix I, we emphasize here that these results are orthogonal to (i.e., neither imply nor contradict) our results.

Existing bounds vs. training set size

As we stated in criterion (e) in the introduction, a fundamental requirement from a generalization bound, however numerically large the bound may be, is that it should vary inversely with the size of the training dataset size (m)(m) like the observed generalization error. Such a requirement is satisfied even by standard parameter-count-based VC-dimension bounds, like O(dh/m)\mathcal{O}(dh/\sqrt{m}) for depth dd, width hh ReLU networks . Recent works have “tightened” the parameter-count-dependent terms in these bounds by replacing them with seemingly innocuous norm-based quantities; however, we show below that this has also inadvertently introduced training-set-size-count dependencies in the numerator, contributing to the vacuity of bounds. With these dependencies, the generalization bounds even increase with training dataset size for small batch sizes.

Setup and notations. We focus on fully connected networks of depth d=5d=5, width h=1024h=1024 trained on MNIST, although we consider other settings in Appendix B. We use SGD with learning rate 0.10.1 and batch size 11 to minimize cross-entropy loss until 99%99\% of the training data are classified correctly by a margin of at least γ=10\gamma^{\star}=10 i.e., if we denote by f(x)[y]f(\boldsymbol{\mathbf{x}})[y] the real-valued logit output (i.e., pre-softmax) on class yy for an input x\boldsymbol{\mathbf{x}}, we ensure that for 99%99\% of the data (x,y)(\boldsymbol{\mathbf{x}},y), the margin Γ(f(x),y):=f(x)[y]maxyyf(x)[y]\Gamma(f(\boldsymbol{\mathbf{x}}),y):=f(\boldsymbol{\mathbf{x}})[y]-\max_{y^{\prime}\neq y}f(\boldsymbol{\mathbf{x}})[y^{\prime}] is at least γ\gamma^{\star}. We emphasize that, from the perspective of generalization guarantees, this stopping criterion helps standardize training across different hyperparameter values, including different values of mm . Now, observe that for this particular stopping criterion, the test error empirically decreases with size mm as 1/m0.431/m^{0.43} as seen in Figure 1 (third plot). However, we will see that the story is starkly different for the generalization bounds.

The bounds grow with training set size m𝑚m.

We now turn to evaluating existing guarantees from Neyshabur et al. and Bartlett et al. . As we note later, our observations apply to many other bounds too. Let W1,,WdW_{1},\ldots,W_{d} be the weights of the learned network (with W1W_{1} being the weights adjacent to the inputs), Z1,,ZdZ_{1},\ldots,Z_{d} the random initialization, D\mathcal{D} the true data distribution and SS the training dataset. For all inputs x\boldsymbol{\mathbf{x}}, let x2B\|\boldsymbol{\mathbf{x}}\|_{2}\leq B. Let 2,F,2,1\|\cdot\|_{2},\|\cdot\|_{F},\|\cdot\|_{2,1} denote the spectral norm, the Frobenius norm and the matrix (2,1)(2,1)-norm respectively; let 1[]\mathbf{1}[\cdot] be the indicator function. Recall that Γ(f(x),y):=f(x)[y]maxyyf(x)[y]\Gamma(f(\boldsymbol{\mathbf{x}}),y):=f(\boldsymbol{\mathbf{x}})[y]-\max_{y^{\prime}\neq y}f(\boldsymbol{\mathbf{x}})[y^{\prime}] denotes the margin of the network on a datapoint. Then, for any constant γ\gamma, these generalization guarantees are written as follows, ignoring log factors:

Here the generalization error bound is of the form O(Bdhγmk=1dWk2×dist)\mathcal{O}\left(\frac{Bd\sqrt{h}}{\gamma\sqrt{m}}\prod_{k=1}^{d}\|W_{k}\|_{2}\times\texttt{dist}\right) where dist equals k=1dWkZkF2Wk22\sqrt{\sum_{k=1}^{d}\frac{\|W_{k}-Z_{k}\|_{F}^{2}}{\|W_{k}\|^{2}_{2}}} in and 1dh(k=1d(WkZk2,1Wk2)2/3)3/2\frac{1}{d\sqrt{h}}\left({\sum_{k=1}^{d}\left(\frac{\|W_{k}-Z_{k}\|_{2,1}}{\|W_{k}\|_{2}}\right)^{2/3}}\right)^{3/2} in .

In our experiments, since we train the networks to fit at least 99%99\% of the datapoints with a margin of 1010, in the above bounds, we set γ=10\gamma=10 so that the first train error term in the right hand side of Equation 1 becomes a small value of at most 0.010.01. We then plot in Figure 1 (fourth plot), the second term above, namely the generalization error bounds, and observe that all these bounds grow with the sample size mm as Ω(m0.68)\Omega(m^{0.68}), thanks to the fact that the terms in the numerator of these bounds grow with mm. Since we are free to plug in γ\gamma in Equation 1, one may wonder whether there exists a better choice of γ\gamma for which we can observe a smaller increase on mm (since the plotted terms inversely depend on γ\gamma). However, in Appendix Figure 6 we establish that even for larger values of γ\gamma, this mm-dependence remains. Also note that, although we do not plot the bounds from , these have nearly identical norms in their numerator, and so one would not expect these bounds to show radically better behavior with respect to mm. Finally, we defer experiments conducted for other varied settings, and the neural network bound from to Appendix B.

While the bounds might show better mm-dependence for other settings – indeed, for larger batches, we show in Appendix B that the bounds behave better – we believe that the egregious break down of these bounds in this setting (and many other hyperparameter settings as presented in Appendix B) must imply fundamental issues with the bounds themselves. While this may be addressed to some extent with a better understanding of implicit regularization in deep learning, we regard our observations as a call for taking a step back and clearly understanding any inherent limitations in the theoretical tool underlying all these bounds namely, uniform convergence. Side note: Before we proceed to the next section, where we blame uniform convergence for the above problems, we briefly note that we considered another simpler possibility. Specifically, we hypothesized that, for some (not all) existing bounds, the above problems could arise from an issue that does not involve uniform convergence, which we term as pseudo-overfitting. Roughly speaking, a classifier pseudo-overfits when its decision boundary is simple but its real-valued output has large “bumps” around some or all of its training datapoint. As discussed in Appendix C, deep networks pseudo-overfit only to a limited extent, and hence psuedo-overfitting does not provide a complete explanation for the issues faced by these bounds.

Provable failure of uniform convergence

For a given δ(0,1)\delta\in(0,1), the generalization error of the algorithm is essentially a bound on the difference between the error of the hypothesis hSh_{S} learned on a training set SS and the expected error over D\mathcal{D}, that holds with high probability of at least 1δ1-\delta over the draws of SS. More formally:

The generalization error of A\mathcal{A} with respect to loss L\mathcal{L} is the smallest value ϵgen(m,δ)\epsilon_{\text{gen}}(m,\delta) such that: PrSDm[LD(hS)L^S(hS)ϵgen(m,δ)]1δ\textrm{Pr}_{S\sim\mathcal{D}^{m}}\left[\mathcal{L}_{\mathcal{D}}(h_{S})-\hat{\mathcal{L}}_{S}(h_{S})\leq\epsilon_{\text{gen}}(m,\delta)\right]\geq 1-\delta.

To theoretically bound the generalization error of the algorithm, the most common approach is to provide a two-sided uniform convergence bound on the hypothesis class used by the algorithm, where, for a given draw of SS, we look at convergence for all the hypotheses in H\mathcal{H} instead of just hSh_{S}:

The uniform convergence bound with respect to loss L\mathcal{L} is the smallest value ϵunif(m,δ)\epsilon_{\text{unif}}(m,\delta) such that: PrSDm[suphHLD(h)L^S(h)ϵunif(m,δ)]1δ\textrm{Pr}_{S\sim\mathcal{D}^{m}}\left[\sup_{h\in\mathcal{H}}\left|\mathcal{L}_{\mathcal{D}}(h)-\hat{\mathcal{L}}_{S}(h)\right|\leq\epsilon_{\text{unif}}(m,\delta)\right]\geq 1-\delta.

Tightest algorithm-dependent uniform convergence.

The bound given by ϵunif\epsilon_{\text{unif}} can be tightened by ignoring many extraneous hypotheses in H\mathcal{H} never picked by A\mathcal{A} for a given simple distribution D\mathcal{D}. This is typically done by focusing on a norm-bounded class of hypotheses that the algorithm A\mathcal{A} implicitly restricts itself to. Let us take this to the extreme by applying uniform convergence on “the smallest possible class” of hypotheses, namely, only those hypotheses that are picked by A\mathcal{A} under D\mathcal{D}, excluding everything else. Observe that pruning the hypothesis class any further would not imply a bound on the generalization error, and hence applying uniform convergence on this aggressively pruned hypothesis class would yield the tightest possible uniform convergence bound. Recall that we care about this formulation because our goal is to rigorously and thoroughly rule out the possibility that no kind of uniform convergence bound, however cleverly applied, can explain generalization in our settings of interest (which we will describe later).

To formally capture this bound, it is helpful to first rephrase the above definition of ϵunif\epsilon_{\text{unif}}: we can say that ϵunif(m,δ)\epsilon_{\text{unif}}(m,\delta) is the smallest value for which there exists a set of sample sets Sδ(X×{1,1})m\mathcal{S}_{\delta}\subseteq(\mathcal{X}\times\{-1,1\})^{m} for which PrSDm[SSδ]1δPr_{S\sim\mathcal{D}^{m}}[S\in\mathcal{S}_{\delta}]\geq 1-\delta and furthermore, supSSδsuphHLD(h)L^S(h)ϵunif(m,δ)\sup_{S\in\mathcal{S}_{\delta}}\sup_{h\in\mathcal{H}}|\mathcal{L}_{\mathcal{D}}(h)-\hat{\mathcal{L}}_{S}(h)|\leq\epsilon_{\text{unif}}(m,\delta). Observe that this definition is equivalent to Definition 3.2. Extending this rephrased definition, we can define the tightest uniform convergence bound by replacing H\mathcal{H} here with only those hypotheses that are explored by the algorithm A\mathcal{A} under the datasets belonging to Sδ\mathcal{S}_{\delta}:

The tightest algorithm-dependent uniform convergence bound with respect to loss L\mathcal{L} is the smallest value ϵunif-alg(m,δ)\epsilon_{\text{unif-alg}}(m,\delta) for which there exists a set of sample sets Sδ\mathcal{S}_{\delta} such that PrSDm[SSδ]1δPr_{S\sim\mathcal{D}^{m}}[S\in\mathcal{S}_{\delta}]\geq 1-\delta and if we define the space of hypotheses explored by A\mathcal{A} on Sδ\mathcal{S}_{\delta} as Hδ:=SSδ{hS}H\mathcal{H}_{\delta}:=\bigcup_{S\in\mathcal{S}_{\delta}}\{h_{S}\}\subseteq\mathcal{H}, the following holds: supSSδsuphHδLD(h)L^S(h)ϵunif-alg(m,δ)\sup_{S\in\mathcal{S}_{\delta}}\sup_{h\in\mathcal{H}_{\delta}}\left|\mathcal{L}_{\mathcal{D}}(h)-\hat{\mathcal{L}}_{S}(h)\right|\leq\epsilon_{\text{unif-alg}}(m,\delta).

In the following sections, through examples of overparameterized models trained by GD (or SGD), we argue how even the above tightest algorithm-dependent uniform convergence can fail to explain generalization. i.e., in these settings, even though ϵgen\epsilon_{\text{gen}} is smaller than a negligible value ϵ\epsilon, we show that ϵunif-alg\epsilon_{\text{unif-alg}} is large (specifically, at least 1ϵ1-\epsilon). Before we delve into these examples, below we quickly outline the key mathematical idea by which uniform convergence is made to fail.

1 High-dimensional linear classifier

Although we present a neural network example in the next section, we first emphasize why it is also important to understand how uniform convergence could fail for linear classifiers trained using GD. First, it is more natural to expect uniform convergence to yield poorer bounds in more complicated classifiers; linear models are arguably the simplest of classifiers, and hence showing failure of uniform convergence in these models is, in a sense, the most interesting. Secondly, recent works (e.g., ) have shown that as the width of a deep network goes to infinity, under some conditions, the network converges to a high-dimensional linear model (trained on a high-dimensional transformation of the data) – thus making the study of high-dimensional linear models relevant to us. Note that our example is not aimed at modeling the setup of such linearized neural networks. However, it does provide valuable intuition about the mechanism by which uniform convergence fails, and we show how this extends to neural networks in the later sections.

For any ϵ,δ>0,δ1/4\epsilon,\delta>0,\delta\leq 1/4, when D=Ω(max(mlnmδ,mln1ϵ))D=\Omega\left(\max\left(m\ln\frac{m}{\delta},m\ln\frac{1}{\epsilon}\right)\right), γ\gamma\in, the L(γ)\mathcal{L}^{(\gamma)} loss satisfies ϵgen(m,δ)ϵ\epsilon_{\text{gen}}(m,\delta)\leq\epsilon, while ϵunif-alg(m,δ)1ϵ\epsilon_{\text{unif-alg}}(m,\delta)\geq 1-\epsilon. Furthermore, for all γ0\gamma\geq 0, for the L(γ)\mathcal{L}^{(\gamma)} loss, ϵunif-alg(m,δ)1ϵgen(m,δ)\epsilon_{\text{unif-alg}}(m,\delta)\geq 1-\epsilon_{\text{gen}}(m,\delta).

Proof outline.

We now provide an outline of our argument for Theorem 3.1, deferring the proof to the appendix. First, the small generalization (and test) error arises from the fact that w1\boldsymbol{\mathbf{w}}_{1} is aligned correctly along the true boundary; at the same time, the noisy part of the classifier w2\boldsymbol{\mathbf{w}}_{2} is poorly aligned with at least 1ϵ1-\epsilon mass of the test inputs, and hence does not dominate the output of the classifier on test data – preserving the good fit of w1\boldsymbol{\mathbf{w}}_{1} on the test data. On the other hand, at a very high level, under the purview of uniform convergence, we can argue that the noise vector w2\boldsymbol{\mathbf{w}}_{2} is effectively stripped of its randomness. This misleads uniform convergence into believing that the DD noisy dimensions (where D>mD>m) contribute meaningfully to the representational complexity of the classifier, thereby giving nearly vacuous bounds. We describe this more concretely below.

As a key step in our argument, we show that w.h.p over draws of SS, even though the learned classifier hSh_{S} correctly classifies most of the randomly picked test data, it completely misclassifies a “bad” dataset, namely S={((x1,x2),y)    (x,y)S}S^{\prime}=\{((\boldsymbol{\mathbf{x}}_{1},-\boldsymbol{\mathbf{x}}_{2}),y)\;|\;(\boldsymbol{\mathbf{x}},y)\in S\} which is the noise-negated version of SS. Now recall that to compute ϵunif-alg\epsilon_{\text{unif-alg}} one has to begin by picking a sample set space Sδ\mathcal{S}_{\delta} of mass 1δ1-\delta. We first argue that for any choice of Sδ\mathcal{S}_{\delta}, there must exist SS_{\star} such that all the following four events hold: (i) SSδS_{\star}\in\mathcal{S}_{\delta}, (ii) the noise-negated SSδS_{\star}^{\prime}\in\mathcal{S}_{\delta}, (iii) hSh_{S_{\star}} has test error less than ϵ\epsilon and (iv) hSh_{S_{\star}} completely misclassifies SS_{\star}^{\prime}. We prove the existence of such an SS_{\star} by arguing that over draws from Dm\mathcal{D}^{m}, there is non-zero probability of picking a dataset that satisfies these four conditions. Note that our argument for this crucially makes use of the fact that we have designed the “bad” dataset in a way that it has the same distribution as the training set, namely Dm\mathcal{D}^{m}. Finally, for a given Sδ\mathcal{S}_{\delta}, if we have an SS_{\star} satisfying (i) to (iv), we can prove our claim as ϵunif-alg(m,δ)=supSSδsuphHδLD(h)L^S(h)LD(hS)L^S(hS)=ϵ1=1ϵ\epsilon_{\text{unif-alg}}(m,\delta)=\sup_{S\in\mathcal{S}_{\delta}}\sup_{h\in\mathcal{H}_{\delta}}|{\mathcal{L}}_{\mathcal{D}}(h)-\hat{\mathcal{L}}_{S}(h)|\geq|{\mathcal{L}}_{\mathcal{D}}(h_{S_{\star}})-\hat{\mathcal{L}}_{S_{\star}^{\prime}}(h_{S_{\star}})|=|\epsilon-1|=1-\epsilon.

Our analysis depends on the fact that ϵunif-alg\epsilon_{\text{unif-alg}} is a two-sided convergence bound – which is what existing techniques bound – and our result would not apply for hypothetical one-sided uniform convergence bounds. While PAC-Bayes based bounds are typically presented as one-sided bounds, we show in Appendix J that even these are lower-bounded by the two-sided ϵunif-alg\epsilon_{\text{unif-alg}}. To the best of our knowledge, it is non-trivial to make any of these tools purely one-sided.

The classifier modified by setting w20\boldsymbol{\mathbf{w}}_{2}\leftarrow 0, has small test error and also enjoys non-vacuous bounds as it has very few parameters. However, such a bound would not fully explain why the original classifier generalizes well. One might then wonder if such a bound could be extended to the original classifier, like it was explored in Nagarajan and Kolter for deep networks. Our result implies that no such extension is possible in this particular example.

2 ReLU neural network

We now design a non-linearly separable task (with no “noisy” dimensions) where a sufficiently wide ReLU network trained in the standard manner, like in the experiments of Section 2 leads to failure of uniform convergence. For our argument, we will rely on a classifier trained empirically, in contrast to our linear examples where we rely on an analytically derived expression for the learned classifier. Thus, this section illustrates that the effects we modeled theoretically in the linear classifier are indeed reflected in typical training settings, even though here it is difficult to precisely analyze the learning process. We also refer the reader to Appendix F, where we present an example of a neural network with exponential activation functions for which we do derive a closed form expression.

Setup. We consider a distribution that was originally proposed in as the “adversarial spheres” dataset (although with slightly different hyperparameters) and was used to study the independent phenomenon of adversarial examples. Specifically, we consider 1000-dimensional data, where two classes are distributed uniformly over two origin-centered hyperspheres with radius 11 and 1.11.1 respectively. We vary the number of training examples from 4k4k to 65k65k (thus ranging through typical dataset sizes like that of MNIST). Observe that compared to the linear example, this data distribution is more realistic in two ways. First, we do not have specific dimensions in the data that are noisy and second, the data dimensionality here as such is a constant less than mm. Given samples from this distribution, we train a two-layer ReLU network with h=100kh=100k to minimize cross entropy loss using SGD with learning rate 0.10.1 and batch size 6464. We train the network until 99%99\% of the data is classified by a margin of 1010.

As shown in Figure 2 (blue line), in this setup, the 0-1 error (i.e., L(0)\mathcal{L}^{(0)}) as approximated by the test set, decreases with m[212,216]m\in[2^{12},2^{16}] at the rate of O(m0.5)O(m^{-0.5}). Now, to prove failure of uniform convergence, we empirically show that a completely misclassified “bad” dataset SS^{\prime} can be constructed in a manner similar to that of the previous example. In this setting, we pick SS^{\prime} by simply projecting every training datapoint on the inner hypersphere onto the outer and vice versa, and then flipping the labels. Then, as shown in Figure 2 (orange line), SS^{\prime} is completely misclassified by the learned network. Furthermore, like in the previous example, we have SDmS^{\prime}\sim\mathcal{D}^{m} because the distributions are uniform over the hyperspheres. Having established these facts, the rest of the argument follows like in the previous setting, implying failure of uniform convergence as in Theorem 3.1 here too.

In Figure 2 (right), we visualize how the learned boundaries are skewed around the training data in a way that SS^{\prime} is misclassified. Note that SS^{\prime} is misclassified even when it has as many as 60k60k points, and even though the network was not explicitly trained to misclassify those points. Intuitively, this demonstrates that the boundary learned by the ReLU network has sufficient complexity that hurts uniform convergence while not affecting the generalization error, at least in this setting. We discuss the applicability of this observation to other hyperparameter settings in Appendix G.2.

While we use the same adversarial spheres distribution as and similarly show the existence a certain kind of an adversarial dataset, it is important to note that neither of our observations implies the other. Indeed, the observations in are insufficient to prove failure of uniform convergence. Specifically, show that in the adversarial spheres setting, it is possible to slightly perturb random test examples in some arbitrary direction to discover a misclassified example. However, to show failure of uniform convergence, we need to find a set of misclassified examples SS^{\prime} corresponding to the training examples SS, and furthermore, we do not want SS^{\prime} to be arbitrary. We want SS^{\prime} to have the same underlying distribution, Dm\mathcal{D}^{m}.

Deep learning conjecture.

Extending the above insights more generally, we conjecture that in overparameterized deep networks, SGD finds a fit that is simple at a macroscopic level (leading to good generalization) but also has many microscopic fluctuations (hurting uniform convergence). To make this more concrete, for illustration, consider the high-dimensional linear model that sufficiently wide networks have been shown to converge to . That is, roughly, these networks can be written as h(x)=wTϕ(x)h(\boldsymbol{\mathbf{x}})=\boldsymbol{\mathbf{w}}^{T}\boldsymbol{\mathbf{\phi}}(\boldsymbol{\mathbf{x}}) where ϕ(x)\boldsymbol{\mathbf{\phi}}(\boldsymbol{\mathbf{x}}) is a rich high-dimensional representation of x\boldsymbol{\mathbf{x}} computed from many random features (chosen independent of training data). Inspired by our linear model in Section 3.1, we conjecture that the weights w\boldsymbol{\mathbf{w}} learned on a dataset SS can be expressed as w1+w2\boldsymbol{\mathbf{w}}_{1}+\boldsymbol{\mathbf{w}}_{2}, where w1Tϕ(x)\boldsymbol{\mathbf{w}}_{1}^{T}\boldsymbol{\mathbf{\phi}}(\boldsymbol{\mathbf{x}}) dominates the output on most test inputs and induces a simple decision boundary. That is, it may be possible to apply uniform convergence on the function w1Tϕ(x)\boldsymbol{\mathbf{w}}_{1}^{T}\boldsymbol{\mathbf{\phi}}(\boldsymbol{\mathbf{x}}) to obtain a small generalization bound. On the other hand, w2\boldsymbol{\mathbf{w}}_{2} corresponds to meaningless signals that gradient descent gathered from the high-dimensional representation of the training set SS. Crucially, these signals would be specific to SS, and hence not likely to correlate with most of the test data i.e., w2ϕ(x)\boldsymbol{\mathbf{w}}_{2}\phi(\boldsymbol{\mathbf{x}}) would be negligible on most test data, thereby not affecting the generalization error significantly. However, w2ϕ(x)\boldsymbol{\mathbf{w}}_{2}\phi(\boldsymbol{\mathbf{x}}) can still create complex fluctuations on the boundary, in low-probability regions of the input space (whose locations would depend on SS, like in our examples). As we argued, this can lead to failure of uniform convergence. Perhaps, existing works that have achieved strong uniform convergence bounds on modified networks, may have done so by implicitly suppressing w2\boldsymbol{\mathbf{w}}_{2}, either by compression, optimization or stochasticization. Revisiting these works may help verify our conjecture.

Conclusion and Future Work

A growing variety of uniform convergence based bounds have sought to explain generalization in deep learning. While these may provide partial intuition about the puzzle, we ask a critical, high level question: by pursuing this broad direction, is it possible to achieve the grand goal of a small generalization bound that shows appropriate dependence on the sample size, width, depth, label noise, and batch size? We cast doubt on this by first, empirically showing that existing bounds can surprisingly increase with training set size for small batch sizes. We then presented example setups, including that of a ReLU neural network, for which uniform convergence provably fails to explain generalization, even after taking implicit bias into account.

Future work in understanding implicit regularization in deep learning may be better guided with our knowledge of the sample-size-dependence in the weight norms. To understand generalization, it may also be promising to explore other learning-theoretic techniques like, say, algorithmic stability ; our linear setup might also inspire new tools. Overall, through our work, we call for going beyond uniform convergence to fully explain generalization in deep learning.

Vaishnavh Nagarajan is supported by a grant from the Bosch Center for AI.

References

Appendix A Summary of existing generalization bounds.

In this section, we provide an informal summary of the properties of (some of the) existing generalization bounds for ReLU networks in Table 1.

Appendix B More Experiments

In this section, we present more experiments along the lines of what we presented in Section 2.

Recall that in the main paper, we show how the distance from initialization and the product of spectral norms vary with mm for network with six layers. In Figure 3, we show how the terms grow with sample size mm for each layer individually. Our main observation is that the first layer suffers from the largest dependence on mm.

Distance between trajectories of shuffled datasets grows with m𝑚m.

In the main paper, we saw that the distance between the solutions learned on different draws of the dataset grow substantially with mm. In Figure 5 (left), we show that even the distance between the solutions learned on the same draw, but a different shuffling of the dataset grows substantially with mm.

Flat minima

We note a similar kind of paradox concerning noise in training. Specifically, it is intriguing that on one hand, generalization is aided by larger learning rates and smaller batch sizes due to increased noise in SGD. On the other, theoretical analyses benefit from the opposite; Allen-Zhu et al. even explicitly regularize SGD for their three-layer-network result to help “forget false information” gathered by SGD. In other words, it seems that noise aids generalization, yet hinders attempts at explaining generalization. The intuition from our examples (such as the linear example) is that such “false information” could provably impair uniform convergence without affecting generalization.

Frobenius norms grow with m𝑚m when m≫hmuch-greater-than𝑚ℎm\gg h.

Some bounds like depend on the Frobenius norms of the weight matrices (or the distance from origin), which as noted in are in fact width-dependent, and grow as Ω(h)\Omega(\sqrt{h}). However, even these terms do grow with the number of samples in the regime where mm is larger than hh. In Figure 5, we report the total distance from origin of the learned parameters for a network with h=256h=256 (we choose a smaller width to better emphasize the growth of this term with mm); here, we see that for m>8192m>8192, the distance from origin grows at a rate of Ω(m0.42)\Omega(m^{0.42}) that is quite similar to what we observed for distance from initialization.

Even a relaxed notion of margin does not address the m𝑚m-dependency.

Recall that in the main paper, we computed the generalization error bound in Equation 1 by setting γ\gamma to be γ\gamma^{\star}, the margin achieved by the network on at least 99%99\% of the data. One may hope that by choosing a larger value of γ\gamma, this bound would become smaller, and that the mm-dependence may improve. We consider this possibility by computing the median margin of the network over the training set (instead of the 1%1\%-percentile’th margin) and substituting this in the second term in the right hand side of the guarantee in Equation 1. By doing this, the first margin-based train error term in the right hand side of Equation 1 would simplify to 0.50.5 (as half the training data are misclassified by this large margin). Thereby we already forgo an explanation of half of the generalization behavior. At least we could hope that the second term no longer grows with mm. Unfortunately, we observe in Figure 6 (left) that the bounds still grow with mm. This is because, as shown in Figure 6 (right), the median margin value does not grow as fast with mm as the numerators of these bounds grow.

Effect of depth.

We observed that as the network gets shallower the bounds show better dependence with mm. As an extreme case, we consider a network with only one hidden layer, and with h=50000h=50000. Here we also present a third bound, namely that of Neyshabur et al. , besides the two bounds discussed in the main paper. Specifically, if Z1,Z2Z_{1},Z_{2} are the random initializations of the weight matrices in the network, the generalization error bound (the last term in Equation 1) here is of the following form, ignoring log factors:

The first term here is meant to be width-independent, while the second term clearly depends on the width and does decrease with mm at the rate of m0.5m^{-0.5}. Hence, in our plots in Figure 7, we only focus on the first term. We see that these bounds are almost constant and decrease at a minute rate of Ω(m0.066)\Omega(m^{-0.066}) while the test errors decrease much faster, at the rate of O(m0.35)\mathcal{O}(m^{-0.35}).

Effect of width.

In Figure 8, we demonstrate that our observation that the bounds increase with mm extends to widths h=128h=128 and h=2000h=2000 too.

B.1 Effect of batch size

In Figure 9, we show how the bounds vary with the batch size for a fixed sample size of 1638416384. It turns out that even though the test error decreases with decreasing batch size (for our fixed stopping criterion), all these bounds increase (by a couple of orders of magnitude) with decreasing batch size. Again, this is because the terms like distance from initialization increase for smaller batch sizes (perhaps because of greater levels of noise in the updates). Overall, existing bounds do not reflect the same behavior as the actual generalization error in terms of their dependence on the batch size.

Bounds vs. m𝑚m for batch size of 323232.

In the main paper, we only dealt with a small batch size of 11. In Figure 10, we show bounds vs. sample size plots for a batch size of 3232. We observe that in this case, the bounds do decrease with sample size, although only at a rate of O(m0.23)\mathcal{O}(m^{-0.23}) which is not as fast as the observed decrease in test error which is Ω(m0.44)\Omega(m^{-0.44}). Our intuition as to why the bounds behave better (in terms of mm-dependence) in the larger batch size regime is that here the amount of noise in the parameter updates is much less compared to smaller batch sizes (and as we discussed earlier, uniform convergence finds it challenging to explain away such noise).

Squared error loss.

All the experiments presented so far deal with the cross-entropy loss, for which the optimization procedure ideally diverges to infinity; thus, one might suspect that our results are sensitive to the stopping criterion. It would therefore be useful to consider the squared error loss where the optimum on the training loss can be found in a finite distance away from the random initialization. Specifically, we consider the case where the squared error loss between the outputs of the network and the one-hot encoding of the true labels is minimized to a value of 0.050.05 on average over the training data.

We observe in Figure 11 that even for this case, the distance from initialization and the spectral norms grow with the sample size at a rate of at least m0.3m^{0.3}. On the other hand, the test error decreases with sample size as 1/m0.381/m^{0.38}, indicating that even for the squared error loss, these terms hurt would hurt the generalization bound with respect to its dependence on mm.

Appendix C Pseudo-overfitting

Recall that in the main paper, we briefly discussed a new notion that we call as pseudo-overfitting and noted that it is not the reason behind why some techniques lead to vacuous generalization bounds. We describe this in more detail here. We emphasize this discussion because i) it brings up a fundamental and so far unknown issue that might potentially exist in current approaches to explaining generalization and ii) rules it out before making more profound claims about uniform convergence.

Our argument specifically applies to margin-based Rademacher complexity approaches (such as Bartlett et al. , Neyshabur et al. ). These result in a bound like in Equation 1 that we recall here:

These methods upper bound the uniform convergence bound on the L(γ)\mathcal{L}^{(\gamma)} error on the network in terms of a uniform convergence bound on the margins of the network (see for more details about margin theory of Rademacher complexity). The resulting generalization error bound in Equation 1 would take the following form, as per our notation from Definition 3.3:

This particular upper bound on the generalization gap in the L(γ)\mathcal{L}^{(\gamma)} loss is also an upper bound on the generalization gap on the margins. That is, with high probability 1δ1-\delta over the draws of SS, the above bound is larger than the following term that corresponds to the difference in test/train margins:

We first argue that it is possible for the generalization error of the algorithm to decrease with mm (as roughly m0.5m^{-0.5}), but for the above quantity to be independent of mm. As a result, the margin-based bound in Equation 2 (which is larger than Equation 3) will be non-decreasing in mm, and even vacuous. Below we describe such a scenario.

Consider a network that first learns a simple hypothesis to fit the data, say, by learning a simple linear input-output mapping on linearly separable data. But subsequently, the classifier proceeds to pseudo-overfit to the samples by skewing up (down) the real-valued output of the network by some large constant Δ\Delta in a tiny neighborhood around the positive (negative) training inputs. Note that this would be possible if and only if the network is overparameterized. Now, even though the classifier’s real-valued output is skewed around the training data, the decision boundary is still linear as the sign of the classifier’s output has not changed on any input. Thus, the boundary is still simple and linear and the generalization error small.

However, the training margins are at least a constant Δ\Delta larger than the test margins (which are not affected by the bumps created in tiny regions around the training data). Then, the term in Equation 3 would be larger than Δ/γ\Delta/\gamma and as a result, so would the term in Equation 2. Now in the generalization guarantee of Equation 1, recall that we must pick a value of γ\gamma such that the first term is low i.e., most of the training datapoints must be classified by at least γ\gamma margin. In this case, we can at best let γΔ\gamma\approx\Delta as any larger value of γ\gamma would make the margin-based training error non-negligible; as a result of this choice of γ\gamma, the bound in Equation 3 would be an mm-independent constant close to 11. The same would also hold for its upper bound in Equation 2, which is the generalization bound provided by the margin-based techniques.

Clearly, this is a potential fundamental limitation in existing approaches, and if deep networks were indeed pseudo-overfitting this way, we would have identified the reason why at least some existing bounds are vacuous. However, (un)fortunately, we rule this out by observing that the difference in the train and test margins in Equation 3 does decrease with training dataset size mm (see Figure 12) as O(m0.33)\mathcal{O}(m^{-0.33}). Additionally, this difference is numerically much less than γ=10\gamma^{\star}=10 (which is the least margin by which 99%99\% of the training data is classified) as long as mm is large, implying that Equation 3 is non-vacuous.

It is worth noting that the generalization error decreases at a faster rate of O(m0.43)\mathcal{O}(m^{-0.43}) implying that the upper bound in Equation 3 which decreases only as m0.33m^{-0.33}, is loose. This already indicates a partial weakness in this specific approach to deriving generalization guarantees. Nevertheless, even this upper bound decreases at a significant rate with mm which the subsequent uniform convergence-based upper bound in Equation 2 is unable to capture, thus hinting at more fundamental weaknesses specific to uniform convergence.

Before we wrap up this section, we discuss a question brought up by an anonymous reviewer, which we believe is worth addressing. Recall that in Section 3.1 and Section 3.2, we presented a linear and hypersphere classification task where we showed that uniform convergence provably fails. In light of the above discussion, one may be tempted to ask: do these two models fail to obey uniform convergence because of pseudo-overfitting?

The answer to this is that our proof for failure of uniform convergence in both these examples did not rely on any kind of pseudo-overfitting – had our proof relied on it, then we would have been able to show failure of only specific kinds of uniform convergence bounds (as discussed above). More formally, pseudo-overfitting in itself does not imply the lower bounds on ϵunif-alg\epsilon_{{\textrm{\tiny unif-alg}}} that we have shown in these settings.

One may still be curious to understand the level of pseudo-overfitting in these examples, to get a sense of the similarity of this scenario with that of the MNIST setup. To this end, we note that our linear setup does indeed suffer from significant pseudo-overfitting – the classifier’s output does indeed have bumps around each training point (which can be concluded from our proof).

In the case of the hypersphere example, we present Figure 13, where we plot of the average margins in this setup like in Figure 12. Here, we observe that, the mean margins on the test data (orange line) and on training data (blue line) do converge to each other with more training data size mm i.e., the gap in the mean test and training margins (green line) does decrease with mm. Thus our setup exhibits a behavior similar to deep networks on MNIST in Figure 12. As noted in our earlier discussion, since the rate of decrease of the mean margin gap in MNIST is not as large as the decrease in test error itself, there should be “a small amount” of psuedo-overfitting in MNIST. The same holds in this setting, although, here we observe an even milder decrease, implying a larger amount of pseudo-overfitting. Nevertheless, we emphasize that, our proof shows that uniform convergence cannot capture even this decrease with mm.

To conclude, pseudo-overfitting is certainly a phenomenon worth exploring better; however, our examples elucidate that there is a phenomenon beyond pseudo-overfitting that is at play in deep learning.

Appendix D Useful Lemmas

In this section, we state some standard results we will use in our proofs. We first define some constants: c1=1/2048c_{1}=1/2048, c2=15/16c_{2}=\sqrt{15/16} and c3=17/16c_{3}=\sqrt{17/16} and c4=2c_{4}=\sqrt{2}.

First, we state a tail bound for sub-exponential random variables .

For a sub-exponential random variable XX with parameters (ν,b)(\nu,b) and mean μ\mu, for all t>0t>0,

As a corollary, we have the following bound on the sum of squared normal variables:

For z1,z2,,zDN(0,1)z_{1},z_{2},\ldots,z_{D}\sim\mathcal{N}(0,1), we have that

We now state the Hoeffding bound for sub-Gaussian random variable.

Let z1,z2,,zDz_{1},z_{2},\ldots,z_{D} be independently drawn sub-Gaussian variables with mean and sub-gaussian parameter σi\sigma_{i}. Then,

Appendix E Proof for Theorem 3.1

In this section, we prove the failure of uniform convergence for our linear model. We first recall the setup:

Learning algorithm A\mathcal{A}: We consider a linear classifier with weights w=(w1,w2)\boldsymbol{\mathbf{w}}=(\boldsymbol{\mathbf{w}}_{1},\boldsymbol{\mathbf{w}}_{2}). The output is computed as h(x)=w1x1+w2x2h(\boldsymbol{\mathbf{x}})=\boldsymbol{\mathbf{w}}_{1}\boldsymbol{\mathbf{x}}_{1}+\boldsymbol{\mathbf{w}}_{2}\boldsymbol{\mathbf{x}}_{2}. Assume the weights are initialized to origin. Given S={(x(1),y(1)),,(x(m),y(m))}S=\{(\boldsymbol{\mathbf{x}}^{(1)},y^{(1)}),\ldots,(\boldsymbol{\mathbf{x}}^{(m)},y^{(m)})\}, A\mathcal{A} takes a gradient step of learning rate 11 to maximize yh(x)y\cdot h(\boldsymbol{\mathbf{x}}) for each (x,y)S(\boldsymbol{\mathbf{x}},y)\in S. Regardless of the batch size, the learned weights would satisfy, w1=2mu\boldsymbol{\mathbf{w}}_{1}=2m\boldsymbol{\mathbf{u}} and w2=iy(i)x2(i)\boldsymbol{\mathbf{w}}_{2}=\sum_{i}y^{(i)}\boldsymbol{\mathbf{x}}_{2}^{(i)}.

Below, we state the precise theorem statement (where we’ve used the constants c1=1/32c_{1}=1/32, c2=1/2c_{2}=1/2 and c3=3/2c_{3}=3/2 and c4=2c_{4}=\sqrt{2}):

Theorem 3.1 In the setup above, for any ϵ,δ>0\epsilon,\delta>0 and δ<1/4\delta<1/4, let DD be sufficiently large that it satisfies

then we have that for all γ0\gamma\geq 0, for the L(γ)\mathcal{L}^{(\gamma)} loss, ϵunif-alg(m,δ)1ϵgen(m,δ)\epsilon_{\text{unif-alg}}(m,\delta)\geq 1-\epsilon_{\text{gen}}(m,\delta).

Specifically, for γ\gamma\in, ϵgen(m,δ)ϵ\epsilon_{\text{gen}}(m,\delta)\leq\epsilon, and so ϵunif-alg(m,δ)1ϵ\epsilon_{\text{unif-alg}}(m,\delta)\geq 1-\epsilon.

The above follows from Lemma E.1, where we upper bound the generalization error, and from Lemma E.2 where we lower bound uniform convergence. ∎

We first prove that the above algorithm generalizes well with respect to the losses corresponding to γ\gamma\in. First for the training data, we argue that both w1\boldsymbol{\mathbf{w}}_{1} and a small part of the noise vector w2\boldsymbol{\mathbf{w}}_{2} align along the correct direction, while the remaining part of the high-dimensional noise vector are orthogonal to the input; this leads to correct classification of the training set. Then, on the test data, we argue that w1\boldsymbol{\mathbf{w}}_{1} aligns well, while w2\boldsymbol{\mathbf{w}}_{2} contributes very little to the output of the classifier because it is high-dimensional noise. As a result, for most test data, the classification is correct, and hence the test and generalization error are both small.

In the setup of Section 3, when γ\gamma\in, for L(γ)\mathcal{L}^{(\gamma)}, ϵgen(m,δ)ϵ\epsilon_{\text{gen}}(m,\delta)\leq\epsilon.

The parameters learned by our algorithm satisfies w1=2mu\boldsymbol{\mathbf{w}}_{1}=2m\cdot\boldsymbol{\mathbf{u}} and w2=y(i)x2(i)N(0,8mc22D)\boldsymbol{\mathbf{w}}_{2}=\sum y^{(i)}\boldsymbol{\mathbf{x}}_{2}^{(i)}\sim\mathcal{N}(0,\frac{8m}{c_{2}^{2}D}).

First, we have from Corollary D.1.1 that with probability 1δ3m1-\frac{\delta}{3m} over the draws of x2(i)\boldsymbol{\mathbf{x}}_{2}^{(i)}, as long as δ3m2ec1D\frac{\delta}{3m}\geq 2e^{-c_{1}D} (which is given to hold by Equation 4),

Next, for a given x(i)\boldsymbol{\mathbf{x}}^{(i)}, we have from Corollary D.2.1, with probability 1δ3m1-\frac{\delta}{3m} over the draws of jiy(j)x2(j)\sum_{j\neq i}y^{(j)}\boldsymbol{\mathbf{x}}_{2}^{(j)},

Then, with probability 123δ1-\frac{2}{3}\delta over the draws of the training dataset we have for all ii,

Thus, for all γ\gamma\in, the L(γ)\mathcal{L}^{(\gamma)} loss of this classifier on the training dataset SS is zero.

Now, from Corollary D.1.1, with probability 1δ31-\frac{\delta}{3} over the draws of the training data, we also have that, as long as δ3m2ec1D\frac{\delta}{3m}\geq 2e^{-c_{1}D} (which is given to hold by Equation 4),

Next, conditioned on the draw of SS and the learned classifier, for any ϵ>0\epsilon^{\prime}>0, with probability 1ϵ1-\epsilon^{\prime} over the draws of a test data point, (z,y)(\boldsymbol{\mathbf{z}},y), we have from Corollary D.2.1 that

Using this, we have that with probability 12exp(12(c224c4c3Dm)2)1-2\exp\left(-\frac{1}{2}\left({\frac{c_{2}^{2}}{4c_{4}c_{3}}\sqrt{\frac{D}{m}}}\right)^{2}\right) over the draws of a test data point, (z,y)(\boldsymbol{\mathbf{z}},y),

Thus, we have that for γ\gamma\in, the L(γ)\mathcal{L}^{(\gamma)} loss of the classifier on the distribution D\mathcal{D} is 2exp(12(c224c4c3Dm)2)2\exp\left(-\frac{1}{2}\left({\frac{c_{2}^{2}}{4c_{4}c_{3}}\sqrt{\frac{D}{m}}}\right)^{2}\right) which is at most ϵ\epsilon as assumed in Equation 6. In other words, the absolute difference between the distribution loss and the train loss is at most ϵ\epsilon and this holds for at least 1δ1-\delta draws of the samples SS. Then, by the definition of ϵgen\epsilon_{\text{gen}} we have the result.

We next prove our uniform convergence lower bound. The main idea is that when the noise vectors in the training samples are negated, with high probability, the classifier misclassifies the training data. We can then show that for any choice of Sδ\mathcal{S}_{\delta} as required by the definition of ϵunif-alg\epsilon_{\text{unif-alg}}, we can always find an SS_{\star} and its noise-negated version SS_{\star}^{\prime} both of which belong to Sδ\mathcal{S}_{\delta}. Furthermore, we can show that hSh_{S_{\star}} has small test error but high empirical error on SS_{\star}^{\prime}, and that this leads to a nearly vacuous uniform convergence bound.

In the setup of Section 3, for any ϵ>0\epsilon>0 and for any δ1/4\delta\leq 1/4, and for the same lower bounds on DD, and for any γ0\gamma\geq 0, we have that

For any SS, let SS^{\prime} denote the set of noise-negated samples S={((x1,x2),u)    ((x1,x2),y)S}S^{\prime}=\{((\boldsymbol{\mathbf{x}}_{1},-\boldsymbol{\mathbf{x}}_{2}),u)\;|\;((\boldsymbol{\mathbf{x}}_{1},\boldsymbol{\mathbf{x}}_{2}),y)\in S\}. We first show with high probability 12δ/31-2\delta/3 over the draws of SS, that the classifier learned on SS, misclassifies SS^{\prime} completely. The proof for this is nearly identical to our proof for why the training loss is zero, except for certain sign changes. For any xneg(i)=(x1(i),x2(i))\boldsymbol{\mathbf{x}}_{\text{neg}}^{(i)}=(\boldsymbol{\mathbf{x}}_{1}^{(i)},-\boldsymbol{\mathbf{x}}_{2}^{(i)}), we have

Since the learned hypothesis misclassifies all of SS^{\prime}, it has loss of 11 on SS^{\prime}.

Now recall that, by definition, to compute ϵunif-alg\epsilon_{\text{unif-alg}}, one has to pick a sample set space Sδ\mathcal{S}_{\delta} of mass 1δ1-\delta i.e., PrSSm[SSδ]1δPr_{S\sim\mathcal{S}^{m}}[S\in\mathcal{S}_{\delta}]\geq 1-\delta. We first argue that for any choice of Sδ\mathcal{S}_{\delta}, there must exist a ‘bad’ SS_{\star} such that (i) SSδS_{\star}\in\mathcal{S}_{\delta}, (ii) SSδS_{\star}^{\prime}\in\mathcal{S}_{\delta}, (iii) hSh_{S_{\star}} has test error less than ϵgen(m,δ)\epsilon_{\text{gen}}(m,\delta) and (iv) hSh_{S_{\star}} completely misclassifies SS_{\star}^{\prime}.

We show the existence of such an SS_{\star}, by arguing that over the draws of SS, there is non-zero probability of picking an SS that satisfies all the above conditions. Specifically, we have by the union bound that

By definition of Sδ\mathcal{S}_{\delta}, we know PrSDm[SSδ]δPr_{S\sim\mathcal{D}^{m}}\left[S\notin\mathcal{S}_{\delta}\right]\leq\delta. Similarly, by definition of the generalization error, we know that PrSDm[LD(hS)>ϵgen(m,δ)]δPr_{S\sim\mathcal{D}^{m}}\left[\mathcal{L}_{\mathcal{D}}(h_{S})>\epsilon_{\text{gen}}(m,\delta)\right]\leq\delta. We have also established above that PrSDm[L^S(hS)1]2δ/3Pr_{S\sim\mathcal{D}^{m}}\left[\hat{\mathcal{L}}_{S^{\prime}}(h_{S})\neq 1\right]\leq 2\delta/3. As for the term PrSDm[SSδ]Pr_{S\sim\mathcal{D}^{m}}\left[S^{\prime}\notin\mathcal{S}_{\delta}\right], observe that under the draws of SS, the distribution of the noise-negated dataset SS^{\prime} is identical to Dm\mathcal{D}^{m}. This is because the isotropic Gaussian noise vectors have the same distribution under negation. Hence, again by definition of Sδ\mathcal{S}_{\delta}, even this probability is at most δ\delta. Thus, we have that the probability in the left hand side of Equation 13 is at least 14δ1-4\delta, which is positive as long as δ<1/4\delta<1/4.

This implies that for any given choice of Sδ\mathcal{S}_{\delta}, there exists SS_{\star} that satisfies our requirement. Then, from the definition of ϵunif-alg(m,δ)\epsilon_{\text{unif-alg}}(m,\delta), we essentially have that,

Appendix F Neural Network with Exponential Activations

In this section, we prove the failure of uniform convergence for a neural network model with exponential activations. We first define the setup.

Let u\boldsymbol{\mathbf{u}} be an arbitrary vector in DD dimensional space such that u=D/2\|\boldsymbol{\mathbf{u}}\|=\sqrt{D}/2. Consider an input distribution in 2D2D dimensional space such that, conditioned on the label yy drawn from uniform distribution over {1,+1}\{-1,+1\}, the first DD dimensions x1\boldsymbol{\mathbf{x}}_{1} of a random point is given by yuy\boldsymbol{\mathbf{u}} and the remaining DD dimensions are drawn from N(0,1)\mathcal{N}(0,1). Note that in this section, we require DD to be only as large as lnm\ln m, and not as large as mm.

Architecture.

Algorithm

where p(w)p(\boldsymbol{\mathbf{w}}) equals the p.d.f of w\boldsymbol{\mathbf{w}} under the distribution it is drawn from. In this case p(w)=1(2π)Dexp(w22)p(\boldsymbol{\mathbf{w}})=\frac{1}{(2\pi)^{D}}\exp\left(-\frac{\|\boldsymbol{\mathbf{w}}\|^{2}}{2}\right).

In order to simplify our calculations we will set η=(4π)D\eta=(4\pi)^{D}, although our analysis would extend to other values of the learning rate too. Similarly, our results would only differ by constants if we consider the alternative update rule, awaw+ηyexp(wx)a_{\boldsymbol{\mathbf{w}}}\leftarrow a_{\boldsymbol{\mathbf{w}}}+\eta y\cdot\exp(\boldsymbol{\mathbf{w}}\cdot\boldsymbol{\mathbf{x}}).

We now state our main theorem (in terms of constants c1,c2,c3,c4c_{1},c_{2},c_{3},c_{4} defined in Section D).

In the set up above, for any ϵ,δ>0\epsilon,\delta>0 and δ<1/4\delta<1/4, let DD and mm be sufficiently large that it satisfies

then we have that for all γ0\gamma\geq 0, for the L(γ)\mathcal{L}^{(\gamma)} loss, ϵunif-alg(m,δ)1ϵgen(m,δ)\epsilon_{\text{unif-alg}}(m,\delta)\geq 1-\epsilon_{\text{gen}}(m,\delta).

Specifically, for γ\gamma\in, ϵgen(m,δ)ϵ\epsilon_{\text{gen}}(m,\delta)\leq\epsilon, and so ϵunif-alg(m,δ)1ϵ\epsilon_{\text{unif-alg}}(m,\delta)\geq 1-\epsilon.

The result follows from the following lemmas. First in Lemma F.2, we derive the closed form expression for the function computed by the learned network. In Lemma F.3, we upper bound the generalization error and in Lemma F.4, we lower bound uniform convergence. ∎

We first derive a closed form expression for how the output of the network changes under a gradient descent step on a particular datapoint.

Let h(0)()h^{(0)}(\cdot) denote the function computed by the network before updating the weights. After updating the weights on a particular input (x,y)(\boldsymbol{\mathbf{x}},y) according to Equation 14, the learned network corresponds to:

Next, we argue that the generalization error of the algorithm is small. From Lemma F.2, we have that the output of the network is essentially determined by a summation of contributions from every training point. To show that the training error is zero, we argue that on any training point, the contribution from that training point dominates all other contributions, thus leading to correct classification. On any test point, we similarly show that the contribution of training points of the same class as that test point dominates the output of the network. Note that our result requires DD to scale only logarithmically with training samples mm.

In the setup of Section 3, when γ\gamma\in, for L(γ)\mathcal{L}^{(\gamma)}, ϵgen(m,δ)ϵ\epsilon_{\text{gen}}(m,\delta)\leq\epsilon.

We first establish a few facts that hold with high probability over the draws of the training set SS. First, from Corollary D.1.1 we have that, since D1c2ln3mδD\geq\frac{1}{c_{2}}\ln\frac{3m}{\delta} (from Equation 15), with probability at least 1δ/31-\delta/3 over the draws of SS, for all ii, the noisy part of each training input can be bounded as

Next, from Corollary D.2.1, we have that with probability at least 1δ3m21-\frac{\delta}{3m^{2}} over the draws of x2(i)\boldsymbol{\mathbf{x}}_{2}^{(i)} and x2(j)\boldsymbol{\mathbf{x}}_{2}^{(j)} for iji\neq j,

Then, by a union bound, the above two equations hold for all iji\neq j with probability at least 1δ/21-\delta/2.

Next, since each y(i)y^{(i)} is essentially an independent sub-Gaussian with mean and sub-Gaussian parameter σ=1\sigma=1, we can apply Hoeffding’s bound (Lemma D.2) to conclude that with probability at least 1δ/31-\delta/3 over the draws of SS,

Note that this means that there must exist at least one training data in each class.

Given these facts, we first show that the training error is zero by showing that for all ii, y(i)h(x(i))y^{(i)}h(\boldsymbol{\mathbf{x}}^{(i)}) is sufficiently large. On any training input (x(i),y(i))(\boldsymbol{\mathbf{x}}^{(i)},y^{(i)}), using Lemma F.2, we can write

Now, for any jj such that y(j)y(i)y^{(j)}\neq y^{(i)}, we have that

Plugging this back in the previous equation we have that

Hence, x(i)\boldsymbol{\mathbf{x}}^{(i)} is correctly classified by a margin of 11 for every ii.

Now consider any test data point (z,y)(\boldsymbol{\mathbf{z}},y). Since D1c2ln2ϵD\geq\frac{1}{c_{2}}\ln\frac{2}{\epsilon} (Equation 16), we have that with probability at least 1ϵ/21-\epsilon/2 over the draws of z2\boldsymbol{\mathbf{z}}_{2}, by Corollary D.1.1

Similarly, for each ii, we have that with probability at least 1ϵ/2m1-\epsilon/2m over the draws of z\boldsymbol{\mathbf{z}}, the following holds good by Corollary D.2.1

Hence, the above holds over at least 1ϵ/21-\epsilon/2 draws of z\boldsymbol{\mathbf{z}}, and by extension, both the above equations hold over at least 1ϵ1-\epsilon draws of z\boldsymbol{\mathbf{z}}.

Now, for any ii such that y(i)=yy^{(i)}=y, we have that

Similarly, for any ii such that y(i)yy^{(i)}\neq y, we have that

Since from Equation 21 we know there exists at least one training sample with a given label, we have that

Thus, at least 1ϵ1-\epsilon of the test datapoints are classified correctly.

We next show that the uniform convergence bound is nearly vacuous. In order to do this, we create a set SS^{\prime} from SS by negating all values but the noise vector. We then show that for every point in SS^{\prime}, the contribution from the corresponding point in SS dominates over the contribution from all other points. (This is because of how the non-negated noise vector in the point from SS^{\prime} aligns adversarially with the noise vector from the corresponding point in SS). As a result, the points in SS^{\prime} are all labeled like in SS, implying that SS^{\prime} is completely misclassified. Then, similar to our previous arguments, we can show that uniform convergence is nearly vacuous.

In the setup of Section F, for any ϵ>0\epsilon>0 and for any δ1/4\delta\leq 1/4, and for the same lower bounds on DD and mm as in Theorem F.1, and for any γ0\gamma\geq 0, we have that

Let SS^{\prime} be a modified version of the training set where all values are negated except that of the noise vectors i.e., S={((x1,x2),y)    ((x1,x2),y)S}S^{\prime}=\{((-\boldsymbol{\mathbf{x}}_{1},\boldsymbol{\mathbf{x}}_{2}),-y)\;|\;((\boldsymbol{\mathbf{x}}_{1},\boldsymbol{\mathbf{x}}_{2}),y)\in S\}. First we show that with probability at least 12δ/31-2\delta/3 over the draws of SS, SS^{\prime} is completely misclassified. First, we have that with probability 12δ/31-2\delta/3, Equations 19 and 20 hold good. Let (xneg(i),yneg(i))(\boldsymbol{\mathbf{x}}_{\textrm{neg}}^{(i)},y_{\textrm{neg}}^{(i)}) denote the iith sample from SS^{\prime}. Then, we have that

Now, consider jj such that y(j)=yneg(i)y^{(j)}=y_{\textrm{neg}}^{(i)}. we have that

Plugging the above back in Equation 24, we have

implying that xneg(i)\boldsymbol{\mathbf{x}}_{\textrm{neg}}^{(i)} is misclassified. This holds simultaneously for all ii, implying that SS^{\prime} is misclassified with high probability 12δ/31-2\delta/3 over the draws of SS. Furthermore, SS^{\prime} has the same distribution as Dm\mathcal{D}^{m}. Then, by the same argument as that of Lemma E.2, we can prove our final claim.

Appendix G Further Remarks.

In this section, we make some clarifying remarks about our theoretical results.

Typically, like in Mohri et al. , Bartlett et al. , the 0-1 test error is upper bounded in terms of the L(γ)\mathcal{L}^{(\gamma)} test error for some optimal choice of γ>0\gamma>0 (as it is easier to apply uniform convergence for γ>0\gamma>0). From the result in the main paper, it is obvious that for γ1\gamma\leq 1, this approach would yield vacuous bounds. We now establish that this is the case even for γ>1\gamma>1.

To help state this more clearly, for the scope of this particular section, let ϵunif-alg(γ),ϵgen(γ)\epsilon^{(\gamma)}_{\textrm{unif-alg}},\epsilon^{(\gamma)}_{\textrm{gen}} denote the uniform convergence and generalization error for L(γ)\mathcal{L}^{(\gamma)} loss. Then, the following inequality is used to derive a bound on the 0-1 error:

where the second inequality above holds with probability at least 1δ1-\delta over the draws of SS, while the first holds for all SS (which follows by definition of L(γ)\mathcal{L}^{(\gamma)} and L(0)\mathcal{L}^{(0)}).

To establish that uniform convergence is nearly vacuous in any setting of γ\gamma, we must show that the right hand side of the above bound is nearly vacuous for any choice of γ0\gamma\geq 0 (despite the fact that LD(0)(S)ϵ\mathcal{L}_{\mathcal{D}}^{(0)}(S)\leq\epsilon). In our results, we explicitly showed this to be true for only small values of γ\gamma, by arguing that the second term in the R.H.S, namely ϵunif-alg(γ)(m,δ)\epsilon^{(\gamma)}_{\textrm{unif-alg}}(m,\delta), is nearly vacuous.

Below, we show that the above bound is indeed nearly vacuous for any value of γ\gamma, when we have that ϵunif-alg(γ)(m,δ)1ϵgen(γ)(m,δ)\epsilon_{\text{unif-alg}}^{(\gamma)}(m,\delta)\geq 1-\epsilon_{\text{gen}}^{(\gamma)}(m,\delta). Note that we established the relation ϵunif-alg(γ)(m,δ)1ϵgen(γ)(m,δ)\epsilon_{\text{unif-alg}}^{(\gamma)}(m,\delta)\geq 1-\epsilon_{\text{gen}}^{(\gamma)}(m,\delta) to be true in all of our setups.

Given that for all γ0\gamma\geq 0, ϵunif-alg(γ)(m,δ)1ϵgen(γ)(m,δ)\epsilon_{\text{unif-alg}}^{(\gamma)}(m,\delta)\geq 1-\epsilon_{\text{gen}}^{(\gamma)}(m,\delta) then, we then have that for all γ0\gamma\geq 0,

or in other words, the guarantee from the right hand side of Equation 25 is nearly vacuous.

Assume on the contrary that for some choice of γ\gamma, we are able to show that with probability at least 1δ1-\delta over the draws of SS, the right hand side of Equation 25 is less than 1/21/2. This means that ϵunif-alg(γ)(m,δ)<1/2\epsilon^{(\gamma)}_{\textrm{unif-alg}}(m,\delta)<1/2. Furthermore, this also means that with probability at least 1δ1-\delta over the draws of SS, L^S(γ)(hS)<1/2\hat{\mathcal{L}}^{(\gamma)}_{S}(h_{S})<1/2 and LD(γ)(hS)<1/2\mathcal{L}^{(\gamma)}_{\mathcal{D}}(h_{S})<1/2 (which follows from the second inequality in Equation 25).

As a result, we have that with probability at least 1δ1-\delta, LD(γ)(hS)L^S(γ)(hS)<1/2\mathcal{L}^{(\gamma)}_{\mathcal{D}}(h_{S})-\hat{\mathcal{L}}^{(\gamma)}_{S}(h_{S})<1/2. In other words, ϵgen(γ)(m,δ)<1/2\epsilon_{\textrm{gen}}^{(\gamma)}(m,\delta)<1/2. Since we are given that ϵunif-alg(γ)(m,δ)1ϵgen(γ)(m,δ)\epsilon_{\text{unif-alg}}^{(\gamma)}(m,\delta)\geq 1-\epsilon_{\text{gen}}^{(\gamma)}(m,\delta), by our upper bound on the generalization error, we have ϵunif-alg(γ)(m,δ)1/2\epsilon_{\text{unif-alg}}^{(\gamma)}(m,\delta)\geq 1/2, which is a contradiction to our earlier inference that ϵunif-alg(γ)(m,δ)<1/2\epsilon_{\text{unif-alg}}^{(\gamma)}(m,\delta)<1/2. Hence, our assumption is wrong.

G.2 Applicability of the observation in Section 3.2 to other settings

Recall that in the main paper, we discussed a setup where two hyperspheres of radius 11 and 1.11.1 respectively are classified by a sufficiently overparameterized ReLU network. We saw that even when the number of training examples was as large as 6553665536, we could project all of these examples on to the other corresponding hypersphere, to create a completely misclassified set SS^{\prime}. How well does this observation extend to other hyperparameter settings?

First, we note that in order to achieve full misclassification of SS^{\prime}, the network would have to be sufficiently overparameterized i.e., either the width or the input dimension must be larger. When the training set size mm is too large, one would observe that SS^{\prime} is not as significantly misclassified as observed. (Note that on the other hand, increasing the parameter count would not hurt the generalization error. In fact it would improve it.)

Second, we note that our observation is sensitive to the choice of the difference in the radii between the hyperspheres (and potentially to other hyperparameters too). For example, when the outer sphere has radius 22, SGD learns to classify these spheres perfectly, resulting in zero error on both test data and on SS^{\prime}. As a result, our lower bound on ϵunif-alg\epsilon_{\text{unif-alg}} would not hold in this setting.

However, here we sketch a (very) informal argument as to why there is reason to believe that our lower bound can still hold on a weaker notion of uniform convergence, a notion that is always applied in practice (in the main paper we focus on a strong notion of uniform convergence as a negative result about it is more powerful). More concretely, in reality, uniform convergence is computed without much knowledge about the data distribution, save a few weakly informative assumptions such as those bounding its support. Such a uniform convergence bound is effectively computed uniformly in supremum over a class of distributions.

Going back to the hypersphere example, the intuition is that even when the radii of the spheres are far apart, and hence, the classification perfect, the decision boundary learned by the network could still be microscopically complex – however these complexities are not exaggerated enough to misclassify SS^{\prime}. Now, for this given decision boundary, one would be able to construct an SS^{\prime\prime} which corresponds to projecting SS on two concentric hyperspheres that fall within these skews. Such an SS^{\prime\prime} would have a distribution that comes from some D\mathcal{D}^{\prime} which, although not equal to D\mathcal{D}, still obeys our assumptions about the underlying distribution. The uniform convergence bound which also holds for D\mathcal{D^{\prime}} would thus have to be vacuous.

As seen in the proof of Lemma E.1, the generalization error ϵ\epsilon depends on mm and DD as O(eD/m)\mathcal{O}(e^{-D/m}) ignoring some constants in the exponent. Clearly, this error decreases with the parameter count DD.

On the other hand, one may also observe that this generalization error grows with the number of samples mm, which might at first make this model seem inconsistent with our real world observations. However, we emphasize that this is a minor artefact of the simplifications in our setup, rather than a conceptual issue. With a small modification to our setup, we can make the generalization error decrease with mm, mirroring our empirical observations. Specifically, in the current setup, we learn the true boundary along the first KK dimensions exactly. We can however modify it to a more standard learning setup where the boundary is not exactly recoverable and needs to be estimated from the examples. This would lead to an additional generalization error that scales as O(Km)\mathcal{O}(\sqrt{\frac{K}{m}}) that is non-vacuous as long as KmK\ll m. Thus, the overall generalization error would be O(eD/m+Km)\mathcal{O}(e^{-D/m}+\sqrt{\frac{K}{m}}).

What about the overall dependence on mm? Now, assume we have an overparameterization level of Dmln(m/K)D\gg m\ln(m/K), so that eD/mK/me^{-D/m}\ll\sqrt{K/m}. Hence, in the sufficiently overparameterized regime, the generalization error O(eD/m)\mathcal{O}(e^{-D/m}) that comes from the noise we have modeled, pales in comparison with the generalization error that would stem from estimating the low-complexity boundary. Overall, as a function of mm, the resulting error would behave like O(Km)\mathcal{O}(\sqrt{\frac{K}{m}}) and hence show a decrease with increasing mm (as long the increase in mm is within the overparameterized regime).

G.4 Failure of hypothesis-dependent uniform convergence bounds.

Appendix H An abstract setup

That is, the learner misclassifies inputs that correspond to the negations of the samples in the training data – this would be possible if and only if the classifier is overparameterized with Ω(mD)\Omega(mD) parameters to store SS^{\prime}. We will show that uniform convergence fails to explain generalization for this learner.

First we establish that this learner generalizes well. Note that a given SS has zero probability mass under D\mathcal{D}, and so does SS^{\prime}. Then, the training and test error are zero – except for pathological draws of SS that intersect with SS^{\prime}, which are almost surely never drawn from Dm\mathcal{D}^{m} – and hence, the generalization error of A\mathcal{A} is zero too.

It might thus seem reasonable to expect that one could explain this generalization using implicit-regularization-based uniform convergence by showing ϵunif-alg(m,δ)=0\epsilon_{\text{unif-alg}}(m,\delta)=0. Surprisingly, this is not the case as ϵunif-alg(m,δ)\epsilon_{\text{unif-alg}}(m,\delta) is in fact 11!

First it is easy to see why the looser bound ϵunif(m,δ)\epsilon_{\text{unif}}(m,\delta) equals 1, if we let H\mathcal{H} be the space of all hypotheses the algorithm could output: there must exist a non-pathological SSδS\in\mathcal{S}_{\delta}, and we know that hSHh_{S^{\prime}}\in\mathcal{H} misclassifies the negation of its training set, namely SS. Then, suphHLD(h)L^S(h)=LD(hS)L^S(hS)=01=1\sup_{h\in\mathcal{H}}|\mathcal{L}_{\mathcal{D}}(h)-\hat{\mathcal{L}}_{S}(h)|=|\mathcal{L}_{\mathcal{D}}(h_{S^{\prime}})-\hat{\mathcal{L}}_{S}(h_{S^{\prime}})|=|0-1|=1.

One might hope that in the stronger bound of ϵunif-alg(m,δ)\epsilon_{\text{unif-alg}}(m,\delta) since we truncate the hypothesis space, it is possible that the above adversarial situation would fall apart. However, with a more nuanced argument, we can similarly show that ϵunif-alg(m,δ)=1\epsilon_{\text{unif-alg}}(m,\delta)=1. First, recall that any bound on ϵunif-alg(m,δ)\epsilon_{\text{unif-alg}}(m,\delta), would have to pick a truncated sample set space Sδ\mathcal{S}_{\delta}. Consider any choice of Sδ\mathcal{S}_{\delta}, and the corresponding set of explored hypotheses Hδ\mathcal{H}_{\delta}. We will show that for any choice of Sδ\mathcal{S}_{\delta}, there exists SSδS_{\star}\in\mathcal{S}_{\delta} such that (i) hSh_{S_{\star}} has zero test error and (ii) the negated training set SS_{\star}^{\prime} belongs to Sδ\mathcal{S}_{\delta} and (iii) hSh_{S_{\star}} has error 11 on SS_{\star}. Then, it follows that ϵunif-alg(m,δ)=supSSδsuphHδLD(h)L^S(h)LD(hS)L^S(h)=01=1\epsilon_{\text{unif-alg}}(m,\delta)=\sup_{S\in\mathcal{S}_{\delta}}\sup_{h\in\mathcal{H}_{\delta}}|{\mathcal{L}}_{\mathcal{D}}(h)-\hat{\mathcal{L}}_{S}(h)|\geq|{\mathcal{L}}_{\mathcal{D}}(h_{S_{\star}})-\hat{\mathcal{L}}_{S_{\star}^{\prime}}(h)|=|0-1|=1.

We can prove the existence of such an SS_{\star} by showing that the probability of picking one such set under Dm\mathcal{D}^{m} is non-zero for δ<1/2\delta<1/2. Specifically, under SDmS\sim\mathcal{D}^{m}, we have by the union bound that

Since the pathological draws have probability zero, the first probability term on the right hand side is zero. The second term is at most δ\delta by definition of Sδ\mathcal{S}_{\delta}. Crucially, the last term too is at most δ\delta because SS^{\prime} (which is the negated version of SS) obeys the same distribution as SS (since the isotropic Gaussian is invariant to a negation). Thus, the above probability is at least 12δ>01-2\delta>0, implying that there exist (many) SS_{\star}, proving our main claim.

While our particular learner might seem artificial, much of this artificiality is only required to make the argument simple. The crucial trait of the learner that we require is that the misclassified region in the input space (i) covers low probability and yet (ii) is complex and highly dependent on the training set draw. Our intuition is that SGD-trained deep networks possess these traits.

Appendix I Learnability and Uniform Convergence

Below, we provide a detailed discussion on learnability, uniform convergence and generalization. Specifically, we argue why the fact that uniform convergence is necessary for learnability does not preclude the fact that uniform convergence maybe unable to explain generalization of a particular algorithm for a particular distribution.

We first recall the notion of learnability. First, formally, a binary classification problem consists of a hypothesis class H\mathcal{H} and an instance space X×{1,1}\mathcal{X}\times\{-1,1\}. The problem is said to be learnable if there exists a learning rule A:m=1ZmH\mathcal{A}^{\prime}:\bigcup\limits_{m=1}^{\infty}\mathcal{Z}^{m}\to\mathcal{H} and a monotonically decreasing sequence ϵlnblty(m)\epsilon_{\text{lnblty}}(m) such that ϵlnblty(m)m0\epsilon_{\text{lnblty}}(m)\xrightarrow{m\to\infty}0 and

Vapnik and Chervonenkis showed that finite VC dimension of the hypothesis class is necessary and sufficient for learnability in binary classification problems. As Shalev-Shwartz et al. note, since finite VC dimension is equivalent to uniform convergence, it can thus be concluded that uniform convergence is necessary and sufficient for learnability binary classification problems.

However, learnability is a strong notion that does not necessarily have to hold for a particular learning algorithm to generalize well for a particular underlying distribution. Roughly speaking, this is because learnability evaluates the algorithm under all possible distributions, including many complex distributions; while a learning algorithm may generalize well for a particular distribution under a given hypothesis class, it may fail to do so on more complex distributions under the same hypothesis class.

For more intuition, we present a more concrete but informal argument below. However, this argument is technically redundant because learnability is equivalent to uniform convergence for binary classification, and since we established the lack of necessity of uniform convergence, we effectively established the same for learnability too. However, we still provide the following informal argument as it provides a different insight into why learnability and uniform convergence are not necessary to explain generalization.

Our goal is to establish that in the set up of Section 3, even if we considered the binary classification problem corresponding to Hδ\mathcal{H}_{\delta} (the class consisting of only those hypotheses explored by the algorithm A\mathcal{A} under a distribution D\mathcal{D}), the corresponding binary classification problem is not learnable i.e., Equation 26 does not hold when we plug in Hδ\mathcal{H}_{\delta} in place of H\mathcal{H}.

First consider distributions of the following form that is more complex than the linearly separable D\mathcal{D}: for any dataset SS^{\prime}, let DS\mathcal{D}_{S^{\prime}} be the distribution that has half its mass on the part of the linearly separable distribution D\mathcal{D} excluding SS^{\prime}, and half its mass on the distribution that is uniformly distributed over SS^{\prime}. Now let SS^{\prime} be a random dataset drawn from D\mathcal{D} but with all its labels flipped; consider the corresponding complex distribution DS\mathcal{D}_{S^{\prime}}.

We first show that there exists hHδh\in\mathcal{H}_{\delta} that fits this distribution well. Now, for most draws of the “wrongly” labeled SS^{\prime}, we can show that the hypothesis hh for which w1=2u\boldsymbol{\mathbf{w}}_{1}=2\cdot\boldsymbol{\mathbf{u}} and w2=(x,y)Syx2\boldsymbol{\mathbf{w}}_{2}=\sum_{(x,y)\in S^{\prime}}y\cdot\boldsymbol{\mathbf{x}}_{2} fits the “wrong” labels of SS^{\prime} perfectly; this is because, just as argued in Lemma E.2, w2\boldsymbol{\mathbf{w}}_{2} dominates the output on all these inputs, although w1\boldsymbol{\mathbf{w}}_{1} would be aligned incorrectly with these inputs. Furthermore, since w2\boldsymbol{\mathbf{w}}_{2} does not align with most inputs from D\mathcal{D}, by an argument similar to Lemma E.1, we can also show that this hypothesis has at most ϵ\epsilon error on D\mathcal{D}, and that this hypothesis belongs to Hδ\mathcal{H}_{\delta}. Overall this means that, w.h.p over the choice of SS^{\prime}, there exists a hypothesis hHδh\in\mathcal{H}_{\delta} for which the error on the complex distribution DS\mathcal{D}_{S^{\prime}} is at most ϵ/2\epsilon/2 i.e.,

On the other hand, let A\mathcal{A}^{\prime} be any learning rule which outputs a hypothesis given SDSS\sim\mathcal{D}_{S^{\prime}}. With high probability over the draws of SDSS\sim\mathcal{D}_{S^{\prime}}, only at most, say 3/43/4th of SS (i.e., 0.75m0.75m examples) will be sampled from SS^{\prime} (and the rest from D\mathcal{D}). Since the learning rule which has access only to SS, has not seen at least a quarter of SS^{\prime}, with high probability over the random draws of SS^{\prime}, the learning rule will fail to classify roughly half of the unseen examples from SS^{\prime} correctly (which would be about (m/4)1/2=m/8(m/4)\cdot 1/2=m/8). Then, the error on DS\mathcal{D}_{S^{\prime}} will be at least 1/161/16. From the above arguments, we have that ϵlearnability(m)1/16ϵ/2\epsilon_{\text{learnability}}(m)\geq 1/16-\epsilon/2, which is a non-negligible constant that is independent of mm.

Appendix J Deterministic PAC-Bayes bounds are two-sided uniform convergence bounds

By definition, VC-dimension, Rademacher complexity and other covering number based bounds are known to upper bound the term ϵunif-alg\epsilon_{\text{unif-alg}} and therefore our negative result immediately applies to all these bounds. However, it may not be immediately clear if bounds derived through the PAC-Bayesian approach fall under this category too. In this discussion, we show that existing deterministic PAC-Bayes based bounds are in fact two-sided in that they are lower bounded by ϵunif-alg\epsilon_{\text{unif-alg}} too.

For a given prior distribution PP over the parameters, a PAC-Bayesian bound is of the following form: with high probability 1δ1-\delta over the draws of the data SS, we have that for all distributions QQ over the hypotheses space:

Note that here for any a,ba,b\in, KL(ab)=alnab+(1a)ln1a1bKL(a\|b)=a\ln\frac{a}{b}+(1-a)\ln\frac{1-a}{1-b}. Since the precise form of the PAC-Bayesian bound on the right hand side is not relevant for the rest of the discussion, we will concisely refer to it as ϵpb(P,Q,m,δ)\epsilon_{\textrm{pb}}(P,Q,m,\delta). What is of interest to us is the fact that the above bound holds for all QQ for most draws of SS and that the KL-divergence on the right-hand side is in itself two-sided, in some sense.

Typically, the above bound is simplified to derive the following one-sided bound on the difference between the expected and empirical errors of a stochastic network (see for example):

This bound is then manipulated in different ways to obtain bounds on the deterministic network. In the rest of this discussion, we focus on the two major such derandomizing techniques and argue that both these techniques boil down to two-sided convergence. While, we do not formally establish that there may exist other techniques which ensure that the resulting deterministic bound is strictly one-sided, we suspect that no such techniques may exist. This is because the KL-divergence bound in Equation 27 is in itself two-sided in the sense that for the right hand side bound to be small, both the stochastic test and train errors must be close to each other; it is not sufficient if the stochastic test error is smaller than the stochastic train error.

To derive a deterministic generalization bound, one approach is to add extra terms that account for the perturbation in the loss of the network . That is, define:

Then, one can get a deterministic upper bound as:

Note that while applying this technique, for any hypothesis hh, one picks a posterior QhQ_{h} specific to that hypothesis (typically, centered at that hypothesis).

We formally define the deterministic bound resulting from this technique below. We consider the algorithm-dependent version and furthermore, we consider a bound that results from the best possible choice of QhQ_{h} for all hh. We define this deterministic bound in the format of ϵunif-alg\epsilon_{\text{unif-alg}} as follows:

The distribution-dependent, algorithm-dependent, deterministic PAC-Bayesian bound of (the hypothesis class H\mathcal{H}, algorithm A\mathcal{A})-pair with respect to L\mathcal{L} is defined to be the smallest value ϵpb-det-A(m,δ)\epsilon_{\text{pb-det-A}}(m,\delta) such that the following holds:

there exists a set of mm-sized samples Sδ(X×{1,+1})m\mathcal{S}_{\delta}\subseteq(\mathcal{X}\times\{-1,+1\})^{m} for which:

and if we define Hδ=SSδ{hS}\mathcal{H}_{\delta}=\bigcup_{S\in\mathcal{S}_{\delta}}\{h_{S}\} to be the space of hypotheses explored only on these samples, then there must exist a prior PP and for each hHδh\in\mathcal{H}_{\delta}, a distribution QhQ_{h}, such that uniform convergence must hold as follows:

as a result of which, by Equation 28, the following one-sided uniform convergence also holds:

Now, recall that ϵunif-alg(m,δ)\epsilon_{\text{unif-alg}}(m,\delta) is a two-sided bound, and in fact our main proof crucially depended on this fact in order to lower bound ϵunif-alg(m,δ)\epsilon_{\text{unif-alg}}(m,\delta). Hence, to extend our lower bound to ϵpb-det-A(m,δ)\epsilon_{\text{pb-det-A}}(m,\delta) we need to show that it is also two-sided in that it is lower bounded by ϵunif-alg(m,δ)\epsilon_{\text{unif-alg}}(m,\delta). The following result establishes this:

Let A\mathcal{A} be an algorithm such that on at least 1δ1-\delta draws of the training dataset SS, the algorithm outputs a hypothesis hSh_{S} that has ϵ^(m,δ)\hat{\epsilon}(m,\delta) loss on the training data SS. Then

First, by the definition of the generalization error, we know that with probability at least 1δ1-\delta over the draws of SS,

Furthermore since the training loss it at most ϵ^(m,δ)\hat{\epsilon}(m,\delta) on at least 1δ1-\delta draws we have that on at least 12δ1-2\delta draws of the dataset,

Let Hδ\mathcal{H}_{\delta} and Sδ\mathcal{S}_{\delta} be the subset of hypotheses and sample sets as in the definition of ϵpb-det-A\epsilon_{\text{pb-det-A}}. Then, from the above, there exist H3δHδ\mathcal{H}_{3\delta}\subseteq\mathcal{H}_{\delta} and S3δSδ\mathcal{S}_{3\delta}\subseteq\mathcal{S}_{\delta} such that

and H3δ=SS3δ{hS}\mathcal{H}_{3\delta}=\bigcup_{S\in\mathcal{S}_{3\delta}}\{h_{S}\}, and furthermore,

Using the above, and the definition of Δ\Delta, we have for all hH3δh\in\mathcal{H}_{3\delta}, the following upper bound on its stochastic test error:

We consider two cases. First, for some hH3δh\in\mathcal{H}_{3\delta} and SS3δS\in\mathcal{S}_{3\delta}, consider the case that e3/2b>ae^{3/2}b>a. Then, we have

Now consider the case where a>e3/2ba>e^{3/2}b. This means that (1a)<(1b)(1-a)<(1-b). Then, if we consider the PAC-Bayesian bound of Equation 27,

on the second term, we can apply the inequality lnx(x1)(x+1)2x=12(x1x)\ln x\geq\frac{(x-1)(x+1)}{2x}=\frac{1}{2}\left(x-\frac{1}{x}\right) which holds for xx\in to get:

Plugging this back in Equation 33, we have,

Since, for all hH3δh\in\mathcal{H}_{3\delta} and SS3δS\in\mathcal{S}_{3\delta}, one of Equations 32 and 34 hold, we have that:

It follows from Equation 30 that the above bound holds good even after we take the absolute value of the first term in the left hand side. However, the absolute value is lower-bounded by ϵunif-alg(m,3δ)\epsilon_{\text{unif-alg}}(m,3\delta) (which follows from how ϵunif-alg(m,3δ)\epsilon_{\text{unif-alg}}(m,3\delta) is defined to be the smallest possible value over the choices of H3δ,S3δ\mathcal{H}_{3\delta},\mathcal{S}_{3\delta}).

As a result of the above theorem, we can show that ϵpb-det-A(m,δ)=Ω(1)O(ϵ)\epsilon_{\textrm{pb-det-A}}(m,\delta)={\Omega}(1)-\mathcal{O}(\epsilon), thus establishing that, for sufficiently large DD, even though the generalization error would be negligibly small, the PAC-Bayes based bound would be as large as a constant.

In the setup of Section 3, for any ϵ,δ>0,δ<1/12\epsilon,\delta>0,\delta<1/12, when D=Ω(max(mln3δ,mln1ϵ))D=\Omega\left(\max\left(m\ln\frac{3}{\delta},m\ln\frac{1}{\epsilon}\right)\right), we have,

The fact that ϵgen(m,δ)ϵ\epsilon_{\text{gen}}(m,\delta)\leq\epsilon follows from Theorem 3.1. Additionally, ϵ^(m,δ)=0\hat{\epsilon}(m,\delta)=0 follows from the proof of Theorem 3.1. Now, as long as 3δ<1/43\delta<1/4, and DD is sufficiently large (i.e., in the lower bounds on DD in Theorem 3.1, if we replace δ\delta by 3δ3\delta), we have from Theorem 3.1 that ϵunif-alg(m,3δ)>1ϵ\epsilon_{\text{unif-alg}}(m,3\delta)>1-\epsilon. Plugging these in Theorem J.1, we get the result in the above corollary. ∎

J.2 Deterministic PAC-Bayesian Bounds of Type B

In this section, we consider another standard approach to making PAC-Bayesian bounds deterministic . Here, the idea is to pick for each hh a distribution QhQ_{h} such that for all x\boldsymbol{\mathbf{x}}:

Then, by applying the PAC-Bayesian bound of Equation 28 for the loss Lγ/2\mathcal{L}^{\prime}_{\gamma/2}, one can get a deterministic upper bound as follows, without having to introduce the extra Δ\Delta terms,

The distribution-dependent, algorithm-dependent, deterministic PAC-Bayesian bound of (the hypothesis class H\mathcal{H}, algorithm A\mathcal{A})-pair is defined to be the smallest value ϵpb-det-B(m,δ)\epsilon_{\text{pb-det-B}}(m,\delta) such that the following holds:

there exists a set of mm-sized samples Sδ(X×{1,+1})m\mathcal{S}_{\delta}\subseteq(\mathcal{X}\times\{-1,+1\})^{m} for which:

and if we define Hδ=SSδ{hS}\mathcal{H}_{\delta}=\bigcup_{S\in\mathcal{S}_{\delta}}\{h_{S}\} to be the space of hypotheses explored only on these samples, then there must exist a prior PP and for each hh a distribution QhQ_{h}, such that uniform convergence must hold as follows: for all SSδS\in\mathcal{S}_{\delta} and for all hHδh\in\mathcal{H}_{\delta},

as a result of which the following one-sided uniform convergence also holds:

We can similarly show that ϵpb-det-B(m,δ)\epsilon_{\text{pb-det-B}}(m,\delta) is lower-bounded by the uniform convergence bound of ϵunif-alg\epsilon_{\text{unif-alg}} too.

Let A\mathcal{A} be an algorithm such that on at least 1δ1-\delta draws of the training dataset SS, the algorithm outputs a hypothesis hSh_{S} such that the margin-based training loss can be bounded as:

and with high probability 1δ1-\delta over the draws of SS, the generalization error can be bounded as:

Then there exists a set of samples S3δ\mathcal{S}_{3\delta} of mass at least 13δ1-3\delta, and a corresponding set of hypothesis H3δ\mathcal{H}_{3\delta} learned on these sample sets such that:

Note that the above statement is slightly different from how Theorem J.1 is stated as it is not expressed in terms of ϵunif-alg\epsilon_{\text{unif-alg}}. In the corollary that follows the proof of this statement, we will see how it can be reduced in terms of ϵunif-alg\epsilon_{\text{unif-alg}}.

Most of the proof is similar to the proof of Theorem J.1. Like in the proof of Theorem J.1, we can argue that there exists S3δ\mathcal{S}_{3\delta} and H3δ\mathcal{H}_{3\delta} for which the test error can be bounded as,

where we have used ϵgen(m,δ)\epsilon_{\text{gen}}(m,\delta) to denote the generalization error of L(γ)\mathcal{L}^{\prime(\gamma)} and not the 0-1 error (we note that this is ambiguous notation, but we keep it this way for simplicity).

Now consider the case where a>e3/2ba>e^{3/2}b. Again, by similar arithmetic manipulation in the PAC-Bayesian bound of Equation 28 applied on L(γ/2)\mathcal{L}^{\prime(\gamma/2)}, we get,

Since, for all hH3δh\in\mathcal{H}_{3\delta} and SS3δS\in\mathcal{S}_{3\delta}, one of Equations 37 and 38 hold, we have the claimed result.

Similarly, as a result of the above theorem, we can show that ϵpb-det-B(m,δ)=Ω(1)O(ϵ)\epsilon_{\textrm{pb-det-B}}(m,\delta)={\Omega}(1)-\mathcal{O}(\epsilon), thus establishing that, for sufficiently large DD, even though the generalization error would be negligibly small, the PAC-Bayes based bound would be as large as a constant and hence cannot explain generalization.

In the setup of Section 3, for any ϵ,δ>0,δ<1/12\epsilon,\delta>0,\delta<1/12, when D=Ω(max(mln3δ,mln1ϵ))D=\Omega\left(\max\left(m\ln\frac{3}{\delta},m\ln\frac{1}{\epsilon}\right)\right), we have,

It follows from the proof of Theorem 3.1 that ϵ^(m,δ)=0\hat{\epsilon}(m,\delta)=0, since all training points are classified by a margin of γ\gamma (see Equation 9). Similarly, from Equation 12 in that proof, since most test points are classified by a margin of γ\gamma, ϵgen(m,δ)ϵ\epsilon_{\text{gen}}(m,\delta)\leq\epsilon. Now, as long as 3δ<1/43\delta<1/4, and DD is sufficiently large (i.e., in the lower bounds on DD in Theorem 3.1, if we replace δ\delta by 3δ3\delta), we will get that there exists SS3δS\in\mathcal{S}_{3\delta} and hH3δh\in\mathcal{H}_{3\delta} for which the empirical loss L(0)\mathcal{L}^{(0)} loss is 11. Then, by Theorem J.2, we get the result in the above corollary. ∎