Uniform convergence may be unable to explain generalization in deep learning

Vaishnavh Nagarajan, J. Zico Kolter

Introduction

Explaining why overparameterized deep networks generalize well has become an important open question in deep learning. How is it possible that a large network can be trained to perfectly fit randomly labeled data (essentially by memorizing the labels), and yet, the same network when trained to perfectly fit real training data, generalizes well to unseen data? This called for a “rethinking” of conventional, algorithm-independent techniques to explain generalization. Specifically, it was argued that learning-theoretic approaches must be reformed by identifying and incorporating the implicit bias/regularization of stochastic gradient descent (SGD) . Subsequently, a huge variety of novel and refined, algorithm-dependent generalization bounds for deep networks have been developed, all based on uniform convergence, the most widely used tool in learning theory. The ultimate goal of this ongoing endeavor is to derive bounds on the generalization error that (a) are small, ideally non-vacuous (i.e., $<1$ ), (b) reflect the same width/depth dependence as the generalization error (e.g., become smaller with increasing width, as has been surprisingly observed in practice), (c) apply to the network learned by SGD (without any modification or explicit regularization) and (d) increase with the proportion of randomly flipped training labels (i.e., increase with memorization).

While every bound meets some of these criteria (and sheds a valuable but partial insight into generalization in deep learning), there is no known bound that meets all of them simultaneously. While most bounds apply to the original network, they are neither numerically small for realistic dataset sizes, nor exhibit the desired width/depth dependencies (in fact, these bounds grow exponentially with the depth). The remaining bounds hold either only on a compressed network or a stochastic network or a network that has been further modified via optimization or more than one of the above . Extending these bounds to the original network is understood to be highly non-trivial . While strong width-independent bounds have been derived for two-layer ReLU networks , these rely on a carefully curated, small learning rate and/or large batch size. (We refer the reader to Appendix A for a tabular summary of these bounds.)

In our paper, we bring to light another fundamental issue with existing bounds. We demonstrate that these bounds violate another natural but largely overlooked criterion for explaining generalization: (e) the bounds should decrease with the dataset size at the same rate as the generalization error. In fact, we empirically observe that these bounds can increase with dataset size, which is arguably a more concerning observation than the fact that they are large for a specific dataset size.

Motivated by the seemingly insurmountable hurdles towards developing bounds satisfying all the above five necessary criteria, we take a step back and examine how the underlying technique of uniform convergence may itself be inherently limited in the overparameterized regime. Specifically, we present examples of overparameterized linear classifiers and neural networks trained by GD (or SGD) where uniform convergence can provably fail to explain generalization. Intuitively, our examples highlight that overparameterized models trained by gradient descent can learn decision boundaries that are largely “simple” – and hence generalize well – but have “microscopic complexities” which cannot be explained away by uniform convergence. Thus our results call into question the active ongoing pursuit of using uniform convergence to fully explain generalization in deep learning.

We first show that in practice certain weight norms of deep ReLU networks, such as the distance from initialization, increase polynomially with the number of training examples (denoted by $m$ ). We then show that as a result, existing generalization bounds – all of which depend on such weight norms – fail to reflect even a dependence on $m$ even reasonably similar to the actual test error, violating criterion (e); for sufficiently small batch sizes, these bounds even grow with the number of examples. This observation uncovers a conceptual gap in our understanding of the puzzle, by pointing towards a source of vacuity unrelated to parameter count.

As our second contribution, we consider three example setups of overparameterized models trained by (stochastic) gradient descent – a linear classifier, a sufficiently wide neural network with ReLUs and an infinite width neural network with exponential activations (with the hidden layer weights frozen) – that learn some underlying data distribution with small generalization error (say, at most $\epsilon$ ). These settings also simulate our observation that norms such as distance from initialization grow with dataset size $m$ . More importantly, we prove that, in these settings, any two-sided uniform convergence bound would yield a (nearly) vacuous generalization bound.

Notably, this vacuity holds even if we “aggressively” take implicit regularization into account while applying uniform convergence – described more concretely as follows. Recall that roughly speaking a uniform convergence bound essentially evaluates the complexity of a hypothesis class (see Definition 3.2). As suggested by Zhang et al. , one can tighten uniform convergence bounds by pruning the hypothesis class to remove extraneous hypotheses never picked by the learning algorithm for the data distribution of interest. In our setups, even if we apply uniform convergence on the set of only those hypotheses picked by the learner whose test errors are all negligible (at most $\epsilon$ ), one can get no better than a nearly vacuous bound on the generalization error (that is at least $1-\epsilon$ ). In this sense, we say that uniform convergence provably cannot explain generalization in our settings. Finally, we note that while nearly all existing uniform convergence-based techniques are two-sided, we show that even PAC-Bayesian bounds, which are typically presented only as one-sided convergence, also boil down to nearly vacuous guarantees in our settings.

1 Related Work

Prior works like Neyshabur et al. and Nagarajan and Kolter have studied the behavior of weight norms in deep learning. Although these works do not explicitly study the dependence of these norms on training set size $m$ , one can infer from their plots that weight norms of deep networks show some increase with $m$ . Belkin et al. reported a similar paradox in kernel learning, observing that norms that appear in kernel generalization bounds increase with $m$ , and that this is due to noise in the labels. Kawaguchi et al. showed that there exist linear models with arbitrarily large weight norms that can generalize well, although such weights are not necessarily found by gradient descent. We crucially supplement these observations in three ways. First, we empirically and theoretically demonstrate how, even with zero label noise (unlike ) and by gradient descent (unlike ), a significant level of $m$ -dependence can arise in the weight norms – significant enough to make even the generalization bound grow with $m$ . Next, we identify uniform convergence as the root cause behind this issue, and thirdly and most importantly, we provably demonstrate this is so.

Weaknesses of Uniform Convergence.

Traditional wisdom is that uniform convergence bounds are a bad choice for complex classifiers like k-nearest neighbors because these hypotheses classes have infinite VC-dimension (which motivated the need for stability based generalization bounds in these cases ). However, this sort of an argument against uniform convergence may still leave one with the faint hope that, by aggressively pruning the hypothesis class (depending on the algorithm and the data distribution), one can achieve meaningful uniform convergence. In contrast, we seek to rigorously and thoroughly rule out uniform convergence in the settings we study. We do this by first defining the tightest form of uniform convergence in Definition 3.3 – one that lower bounds any uniform convergence bound – and then showing that even this bound is vacuous in our settings. Additionally, we note that we show this kind of failure of uniform convergence for linear classifiers, which is a much simpler model compared to k-nearest neighbors.

For deep networks, Zhang et al. showed that applying uniform convergence on the whole hypothesis class fails, and that it should instead be applied in an algorithm-dependent way. Ours is a much different claim – that uniform convergence is inherently problematic in that even the algorithm-dependent application would fail – casting doubt on the rich line of post-Zhang et al. algorithm-dependent approaches. At the same time, we must add the disclaimer that our results do not preclude the fact that uniform convergence may still work if GD is run with explicit regularization (such as weight decay). Such a regularized setting however, is not the main focus of the generalization puzzle .

Prior works have also focused on understanding uniform convergence for learnability of learning problems. Roughly speaking, learnability is a strict notion that does not have to hold even though an algorithm may generalize well for simple distributions in a learning problem. While we defer the details of these works in Appendix I, we emphasize here that these results are orthogonal to (i.e., neither imply nor contradict) our results.

Existing bounds vs. training set size

As we stated in criterion (e) in the introduction, a fundamental requirement from a generalization bound, however numerically large the bound may be, is that it should vary inversely with the size of the training dataset size $(m)$ like the observed generalization error. Such a requirement is satisfied even by standard parameter-count-based VC-dimension bounds, like $\mathcal{O}(dh/\sqrt{m})$ for depth $d$ , width $h$ ReLU networks . Recent works have “tightened” the parameter-count-dependent terms in these bounds by replacing them with seemingly innocuous norm-based quantities; however, we show below that this has also inadvertently introduced training-set-size-count dependencies in the numerator, contributing to the vacuity of bounds. With these dependencies, the generalization bounds even increase with training dataset size for small batch sizes.

Setup and notations. We focus on fully connected networks of depth $d=5$ , width $h=1024$ trained on MNIST, although we consider other settings in Appendix B. We use SGD with learning rate $0.1$ and batch size $1$ to minimize cross-entropy loss until $99\%$ of the training data are classified correctly by a margin of at least $\gamma^{\star}=10$ i.e., if we denote by $f(\boldsymbol{\mathbf{x}})[y]$ the real-valued logit output (i.e., pre-softmax) on class $y$ for an input $\boldsymbol{\mathbf{x}}$ , we ensure that for $99\%$ of the data $(\boldsymbol{\mathbf{x}},y)$ , the margin $\Gamma(f(\boldsymbol{\mathbf{x}}),y):=f(\boldsymbol{\mathbf{x}})[y]-\max_{y^{\prime}\neq y}f(\boldsymbol{\mathbf{x}})[y^{\prime}]$ is at least $\gamma^{\star}$ . We emphasize that, from the perspective of generalization guarantees, this stopping criterion helps standardize training across different hyperparameter values, including different values of $m$ . Now, observe that for this particular stopping criterion, the test error empirically decreases with size $m$ as $1/m^{0.43}$ as seen in Figure 1 (third plot). However, we will see that the story is starkly different for the generalization bounds.

The bounds grow with training set size m𝑚m.

We now turn to evaluating existing guarantees from Neyshabur et al. and Bartlett et al. . As we note later, our observations apply to many other bounds too. Let $W_{1},\ldots,W_{d}$ be the weights of the learned network (with $W_{1}$ being the weights adjacent to the inputs), $Z_{1},\ldots,Z_{d}$ the random initialization, $\mathcal{D}$ the true data distribution and $S$ the training dataset. For all inputs $\boldsymbol{\mathbf{x}}$ , let $\|\boldsymbol{\mathbf{x}}\|_{2}\leq B$ . Let $\|\cdot\|_{2},\|\cdot\|_{F},\|\cdot\|_{2,1}$ denote the spectral norm, the Frobenius norm and the matrix $(2,1)$ -norm respectively; let $\mathbf{1}[\cdot]$ be the indicator function. Recall that $\Gamma(f(\boldsymbol{\mathbf{x}}),y):=f(\boldsymbol{\mathbf{x}})[y]-\max_{y^{\prime}\neq y}f(\boldsymbol{\mathbf{x}})[y^{\prime}]$ denotes the margin of the network on a datapoint. Then, for any constant $\gamma$ , these generalization guarantees are written as follows, ignoring log factors:

Here the generalization error bound is of the form $\mathcal{O}\left(\frac{Bd\sqrt{h}}{\gamma\sqrt{m}}\prod_{k=1}^{d}\|W_{k}\|_{2}\times\texttt{dist}\right)$ where dist equals $\sqrt{\sum_{k=1}^{d}\frac{\|W_{k}-Z_{k}\|_{F}^{2}}{\|W_{k}\|^{2}_{2}}}$ in and $\frac{1}{d\sqrt{h}}\left({\sum_{k=1}^{d}\left(\frac{\|W_{k}-Z_{k}\|_{2,1}}{\|W_{k}\|_{2}}\right)^{2/3}}\right)^{3/2}$ in .

In our experiments, since we train the networks to fit at least $99\%$ of the datapoints with a margin of $10$ , in the above bounds, we set $\gamma=10$ so that the first train error term in the right hand side of Equation 1 becomes a small value of at most $0.01$ . We then plot in Figure 1 (fourth plot), the second term above, namely the generalization error bounds, and observe that all these bounds grow with the sample size $m$ as $\Omega(m^{0.68})$ , thanks to the fact that the terms in the numerator of these bounds grow with $m$ . Since we are free to plug in $\gamma$ in Equation 1, one may wonder whether there exists a better choice of $\gamma$ for which we can observe a smaller increase on $m$ (since the plotted terms inversely depend on $\gamma$ ). However, in Appendix Figure 6 we establish that even for larger values of $\gamma$ , this $m$ -dependence remains. Also note that, although we do not plot the bounds from , these have nearly identical norms in their numerator, and so one would not expect these bounds to show radically better behavior with respect to $m$ . Finally, we defer experiments conducted for other varied settings, and the neural network bound from to Appendix B.

While the bounds might show better $m$ -dependence for other settings – indeed, for larger batches, we show in Appendix B that the bounds behave better – we believe that the egregious break down of these bounds in this setting (and many other hyperparameter settings as presented in Appendix B) must imply fundamental issues with the bounds themselves. While this may be addressed to some extent with a better understanding of implicit regularization in deep learning, we regard our observations as a call for taking a step back and clearly understanding any inherent limitations in the theoretical tool underlying all these bounds namely, uniform convergence. Side note: Before we proceed to the next section, where we blame uniform convergence for the above problems, we briefly note that we considered another simpler possibility. Specifically, we hypothesized that, for some (not all) existing bounds, the above problems could arise from an issue that does not involve uniform convergence, which we term as pseudo-overfitting. Roughly speaking, a classifier pseudo-overfits when its decision boundary is simple but its real-valued output has large “bumps” around some or all of its training datapoint. As discussed in Appendix C, deep networks pseudo-overfit only to a limited extent, and hence psuedo-overfitting does not provide a complete explanation for the issues faced by these bounds.

Provable failure of uniform convergence

For a given $\delta\in(0,1)$ , the generalization error of the algorithm is essentially a bound on the difference between the error of the hypothesis $h_{S}$ learned on a training set $S$ and the expected error over $\mathcal{D}$ , that holds with high probability of at least $1-\delta$ over the draws of $S$ . More formally:

The generalization error of $\mathcal{A}$ with respect to loss $\mathcal{L}$ is the smallest value $\epsilon_{\text{gen}}(m,\delta)$ such that: $\textrm{Pr}_{S\sim\mathcal{D}^{m}}\left[\mathcal{L}_{\mathcal{D}}(h_{S})-\hat{\mathcal{L}}_{S}(h_{S})\leq\epsilon_{\text{gen}}(m,\delta)\right]\geq 1-\delta$ .

To theoretically bound the generalization error of the algorithm, the most common approach is to provide a two-sided uniform convergence bound on the hypothesis class used by the algorithm, where, for a given draw of $S$ , we look at convergence for all the hypotheses in $\mathcal{H}$ instead of just $h_{S}$ :

The uniform convergence bound with respect to loss $\mathcal{L}$ is the smallest value $\epsilon_{\text{unif}}(m,\delta)$ such that: $\textrm{Pr}_{S\sim\mathcal{D}^{m}}\left[\sup_{h\in\mathcal{H}}\left|\mathcal{L}_{\mathcal{D}}(h)-\hat{\mathcal{L}}_{S}(h)\right|\leq\epsilon_{\text{unif}}(m,\delta)\right]\geq 1-\delta$ .

Tightest algorithm-dependent uniform convergence.

The bound given by $\epsilon_{\text{unif}}$ can be tightened by ignoring many extraneous hypotheses in $\mathcal{H}$ never picked by $\mathcal{A}$ for a given simple distribution $\mathcal{D}$ . This is typically done by focusing on a norm-bounded class of hypotheses that the algorithm $\mathcal{A}$ implicitly restricts itself to. Let us take this to the extreme by applying uniform convergence on “the smallest possible class” of hypotheses, namely, only those hypotheses that are picked by $\mathcal{A}$ under $\mathcal{D}$ , excluding everything else. Observe that pruning the hypothesis class any further would not imply a bound on the generalization error, and hence applying uniform convergence on this aggressively pruned hypothesis class would yield the tightest possible uniform convergence bound. Recall that we care about this formulation because our goal is to rigorously and thoroughly rule out the possibility that no kind of uniform convergence bound, however cleverly applied, can explain generalization in our settings of interest (which we will describe later).

To formally capture this bound, it is helpful to first rephrase the above definition of $\epsilon_{\text{unif}}$ : we can say that $\epsilon_{\text{unif}}(m,\delta)$ is the smallest value for which there exists a set of sample sets $\mathcal{S}_{\delta}\subseteq(\mathcal{X}\times\{-1,1\})^{m}$ for which $Pr_{S\sim\mathcal{D}^{m}}[S\in\mathcal{S}_{\delta}]\geq 1-\delta$ and furthermore, $\sup_{S\in\mathcal{S}_{\delta}}\sup_{h\in\mathcal{H}}|\mathcal{L}_{\mathcal{D}}(h)-\hat{\mathcal{L}}_{S}(h)|\leq\epsilon_{\text{unif}}(m,\delta)$ . Observe that this definition is equivalent to Definition 3.2. Extending this rephrased definition, we can define the tightest uniform convergence bound by replacing $\mathcal{H}$ here with only those hypotheses that are explored by the algorithm $\mathcal{A}$ under the datasets belonging to $\mathcal{S}_{\delta}$ :

The tightest algorithm-dependent uniform convergence bound with respect to loss $\mathcal{L}$ is the smallest value $\epsilon_{\text{unif-alg}}(m,\delta)$ for which there exists a set of sample sets $\mathcal{S}_{\delta}$ such that $Pr_{S\sim\mathcal{D}^{m}}[S\in\mathcal{S}_{\delta}]\geq 1-\delta$ and if we define the space of hypotheses explored by $\mathcal{A}$ on $\mathcal{S}_{\delta}$ as $\mathcal{H}_{\delta}:=\bigcup_{S\in\mathcal{S}_{\delta}}\{h_{S}\}\subseteq\mathcal{H}$ , the following holds: $\sup_{S\in\mathcal{S}_{\delta}}\sup_{h\in\mathcal{H}_{\delta}}\left|\mathcal{L}_{\mathcal{D}}(h)-\hat{\mathcal{L}}_{S}(h)\right|\leq\epsilon_{\text{unif-alg}}(m,\delta)$ .

In the following sections, through examples of overparameterized models trained by GD (or SGD), we argue how even the above tightest algorithm-dependent uniform convergence can fail to explain generalization. i.e., in these settings, even though $\epsilon_{\text{gen}}$ is smaller than a negligible value $\epsilon$ , we show that $\epsilon_{\text{unif-alg}}$ is large (specifically, at least $1-\epsilon$ ). Before we delve into these examples, below we quickly outline the key mathematical idea by which uniform convergence is made to fail.

1 High-dimensional linear classifier

Although we present a neural network example in the next section, we first emphasize why it is also important to understand how uniform convergence could fail for linear classifiers trained using GD. First, it is more natural to expect uniform convergence to yield poorer bounds in more complicated classifiers; linear models are arguably the simplest of classifiers, and hence showing failure of uniform convergence in these models is, in a sense, the most interesting. Secondly, recent works (e.g., ) have shown that as the width of a deep network goes to infinity, under some conditions, the network converges to a high-dimensional linear model (trained on a high-dimensional transformation of the data) – thus making the study of high-dimensional linear models relevant to us. Note that our example is not aimed at modeling the setup of such linearized neural networks. However, it does provide valuable intuition about the mechanism by which uniform convergence fails, and we show how this extends to neural networks in the later sections.

For any $\epsilon,\delta>0,\delta\leq 1/4$ , when $D=\Omega\left(\max\left(m\ln\frac{m}{\delta},m\ln\frac{1}{\epsilon}\right)\right)$ , $\gamma\in$ , the $\mathcal{L}^{(\gamma)}$ loss satisfies $\epsilon_{\text{gen}}(m,\delta)\leq\epsilon$ , while $\epsilon_{\text{unif-alg}}(m,\delta)\geq 1-\epsilon$ . Furthermore, for all $\gamma\geq 0$ , for the $\mathcal{L}^{(\gamma)}$ loss, $\epsilon_{\text{unif-alg}}(m,\delta)\geq 1-\epsilon_{\text{gen}}(m,\delta)$ .

Proof outline.

We now provide an outline of our argument for Theorem 3.1, deferring the proof to the appendix. First, the small generalization (and test) error arises from the fact that $\boldsymbol{\mathbf{w}}_{1}$ is aligned correctly along the true boundary; at the same time, the noisy part of the classifier $\boldsymbol{\mathbf{w}}_{2}$ is poorly aligned with at least $1-\epsilon$ mass of the test inputs, and hence does not dominate the output of the classifier on test data – preserving the good fit of $\boldsymbol{\mathbf{w}}_{1}$ on the test data. On the other hand, at a very high level, under the purview of uniform convergence, we can argue that the noise vector $\boldsymbol{\mathbf{w}}_{2}$ is effectively stripped of its randomness. This misleads uniform convergence into believing that the $D$ noisy dimensions (where $D>m$ ) contribute meaningfully to the representational complexity of the classifier, thereby giving nearly vacuous bounds. We describe this more concretely below.

As a key step in our argument, we show that w.h.p over draws of $S$ , even though the learned classifier $h_{S}$ correctly classifies most of the randomly picked test data, it completely misclassifies a “bad” dataset, namely $S^{\prime}=\{((\boldsymbol{\mathbf{x}}_{1},-\boldsymbol{\mathbf{x}}_{2}),y)\;|\;(\boldsymbol{\mathbf{x}},y)\in S\}$ which is the noise-negated version of $S$ . Now recall that to compute $\epsilon_{\text{unif-alg}}$ one has to begin by picking a sample set space $\mathcal{S}_{\delta}$ of mass $1-\delta$ . We first argue that for any choice of $\mathcal{S}_{\delta}$ , there must exist $S_{\star}$ such that all the following four events hold: (i) $S_{\star}\in\mathcal{S}_{\delta}$ , (ii) the noise-negated $S_{\star}^{\prime}\in\mathcal{S}_{\delta}$ , (iii) $h_{S_{\star}}$ has test error less than $\epsilon$ and (iv) $h_{S_{\star}}$ completely misclassifies $S_{\star}^{\prime}$ . We prove the existence of such an $S_{\star}$ by arguing that over draws from $\mathcal{D}^{m}$ , there is non-zero probability of picking a dataset that satisfies these four conditions. Note that our argument for this crucially makes use of the fact that we have designed the “bad” dataset in a way that it has the same distribution as the training set, namely $\mathcal{D}^{m}$ . Finally, for a given $\mathcal{S}_{\delta}$ , if we have an $S_{\star}$ satisfying (i) to (iv), we can prove our claim as $\epsilon_{\text{unif-alg}}(m,\delta)=\sup_{S\in\mathcal{S}_{\delta}}\sup_{h\in\mathcal{H}_{\delta}}|{\mathcal{L}}_{\mathcal{D}}(h)-\hat{\mathcal{L}}_{S}(h)|\geq|{\mathcal{L}}_{\mathcal{D}}(h_{S_{\star}})-\hat{\mathcal{L}}_{S_{\star}^{\prime}}(h_{S_{\star}})|=|\epsilon-1|=1-\epsilon$ .

Our analysis depends on the fact that $\epsilon_{\text{unif-alg}}$ is a two-sided convergence bound – which is what existing techniques bound – and our result would not apply for hypothetical one-sided uniform convergence bounds. While PAC-Bayes based bounds are typically presented as one-sided bounds, we show in Appendix J that even these are lower-bounded by the two-sided $\epsilon_{\text{unif-alg}}$ . To the best of our knowledge, it is non-trivial to make any of these tools purely one-sided.

The classifier modified by setting $\boldsymbol{\mathbf{w}}_{2}\leftarrow 0$ , has small test error and also enjoys non-vacuous bounds as it has very few parameters. However, such a bound would not fully explain why the original classifier generalizes well. One might then wonder if such a bound could be extended to the original classifier, like it was explored in Nagarajan and Kolter for deep networks. Our result implies that no such extension is possible in this particular example.

2 ReLU neural network

We now design a non-linearly separable task (with no “noisy” dimensions) where a sufficiently wide ReLU network trained in the standard manner, like in the experiments of Section 2 leads to failure of uniform convergence. For our argument, we will rely on a classifier trained empirically, in contrast to our linear examples where we rely on an analytically derived expression for the learned classifier. Thus, this section illustrates that the effects we modeled theoretically in the linear classifier are indeed reflected in typical training settings, even though here it is difficult to precisely analyze the learning process. We also refer the reader to Appendix F, where we present an example of a neural network with exponential activation functions for which we do derive a closed form expression.

Setup. We consider a distribution that was originally proposed in as the “adversarial spheres” dataset (although with slightly different hyperparameters) and was used to study the independent phenomenon of adversarial examples. Specifically, we consider 1000-dimensional data, where two classes are distributed uniformly over two origin-centered hyperspheres with radius $1$ and $1.1$ respectively. We vary the number of training examples from $4k$ to $65k$ (thus ranging through typical dataset sizes like that of MNIST). Observe that compared to the linear example, this data distribution is more realistic in two ways. First, we do not have specific dimensions in the data that are noisy and second, the data dimensionality here as such is a constant less than $m$ . Given samples from this distribution, we train a two-layer ReLU network with $h=100k$ to minimize cross entropy loss using SGD with learning rate $0.1$ and batch size $64$ . We train the network until $99\%$ of the data is classified by a margin of $10$ .

As shown in Figure 2 (blue line), in this setup, the 0-1 error (i.e., $\mathcal{L}^{(0)}$ ) as approximated by the test set, decreases with $m\in[2^{12},2^{16}]$ at the rate of $O(m^{-0.5})$ . Now, to prove failure of uniform convergence, we empirically show that a completely misclassified “bad” dataset $S^{\prime}$ can be constructed in a manner similar to that of the previous example. In this setting, we pick $S^{\prime}$ by simply projecting every training datapoint on the inner hypersphere onto the outer and vice versa, and then flipping the labels. Then, as shown in Figure 2 (orange line), $S^{\prime}$ is completely misclassified by the learned network. Furthermore, like in the previous example, we have $S^{\prime}\sim\mathcal{D}^{m}$ because the distributions are uniform over the hyperspheres. Having established these facts, the rest of the argument follows like in the previous setting, implying failure of uniform convergence as in Theorem 3.1 here too.

In Figure 2 (right), we visualize how the learned boundaries are skewed around the training data in a way that $S^{\prime}$ is misclassified. Note that $S^{\prime}$ is misclassified even when it has as many as $60k$ points, and even though the network was not explicitly trained to misclassify those points. Intuitively, this demonstrates that the boundary learned by the ReLU network has sufficient complexity that hurts uniform convergence while not affecting the generalization error, at least in this setting. We discuss the applicability of this observation to other hyperparameter settings in Appendix G.2.

While we use the same adversarial spheres distribution as and similarly show the existence a certain kind of an adversarial dataset, it is important to note that neither of our observations implies the other. Indeed, the observations in are insufficient to prove failure of uniform convergence. Specifically, show that in the adversarial spheres setting, it is possible to slightly perturb random test examples in some arbitrary direction to discover a misclassified example. However, to show failure of uniform convergence, we need to find a set of misclassified examples $S^{\prime}$ corresponding to the training examples $S$ , and furthermore, we do not want $S^{\prime}$ to be arbitrary. We want $S^{\prime}$ to have the same underlying distribution, $\mathcal{D}^{m}$ .

Deep learning conjecture.

Extending the above insights more generally, we conjecture that in overparameterized deep networks, SGD finds a fit that is simple at a macroscopic level (leading to good generalization) but also has many microscopic fluctuations (hurting uniform convergence). To make this more concrete, for illustration, consider the high-dimensional linear model that sufficiently wide networks have been shown to converge to . That is, roughly, these networks can be written as $h(\boldsymbol{\mathbf{x}})=\boldsymbol{\mathbf{w}}^{T}\boldsymbol{\mathbf{\phi}}(\boldsymbol{\mathbf{x}})$ where $\boldsymbol{\mathbf{\phi}}(\boldsymbol{\mathbf{x}})$ is a rich high-dimensional representation of $\boldsymbol{\mathbf{x}}$ computed from many random features (chosen independent of training data). Inspired by our linear model in Section 3.1, we conjecture that the weights $\boldsymbol{\mathbf{w}}$ learned on a dataset $S$ can be expressed as $\boldsymbol{\mathbf{w}}_{1}+\boldsymbol{\mathbf{w}}_{2}$ , where $\boldsymbol{\mathbf{w}}_{1}^{T}\boldsymbol{\mathbf{\phi}}(\boldsymbol{\mathbf{x}})$ dominates the output on most test inputs and induces a simple decision boundary. That is, it may be possible to apply uniform convergence on the function $\boldsymbol{\mathbf{w}}_{1}^{T}\boldsymbol{\mathbf{\phi}}(\boldsymbol{\mathbf{x}})$ to obtain a small generalization bound. On the other hand, $\boldsymbol{\mathbf{w}}_{2}$ corresponds to meaningless signals that gradient descent gathered from the high-dimensional representation of the training set $S$ . Crucially, these signals would be specific to $S$ , and hence not likely to correlate with most of the test data i.e., $\boldsymbol{\mathbf{w}}_{2}\phi(\boldsymbol{\mathbf{x}})$ would be negligible on most test data, thereby not affecting the generalization error significantly. However, $\boldsymbol{\mathbf{w}}_{2}\phi(\boldsymbol{\mathbf{x}})$ can still create complex fluctuations on the boundary, in low-probability regions of the input space (whose locations would depend on $S$ , like in our examples). As we argued, this can lead to failure of uniform convergence. Perhaps, existing works that have achieved strong uniform convergence bounds on modified networks, may have done so by implicitly suppressing $\boldsymbol{\mathbf{w}}_{2}$ , either by compression, optimization or stochasticization. Revisiting these works may help verify our conjecture.

Conclusion and Future Work

A growing variety of uniform convergence based bounds have sought to explain generalization in deep learning. While these may provide partial intuition about the puzzle, we ask a critical, high level question: by pursuing this broad direction, is it possible to achieve the grand goal of a small generalization bound that shows appropriate dependence on the sample size, width, depth, label noise, and batch size? We cast doubt on this by first, empirically showing that existing bounds can surprisingly increase with training set size for small batch sizes. We then presented example setups, including that of a ReLU neural network, for which uniform convergence provably fails to explain generalization, even after taking implicit bias into account.

Future work in understanding implicit regularization in deep learning may be better guided with our knowledge of the sample-size-dependence in the weight norms. To understand generalization, it may also be promising to explore other learning-theoretic techniques like, say, algorithmic stability ; our linear setup might also inspire new tools. Overall, through our work, we call for going beyond uniform convergence to fully explain generalization in deep learning.

Vaishnavh Nagarajan is supported by a grant from the Bosch Center for AI.

References

Appendix A Summary of existing generalization bounds.

In this section, we provide an informal summary of the properties of (some of the) existing generalization bounds for ReLU networks in Table 1.

Appendix B More Experiments

In this section, we present more experiments along the lines of what we presented in Section 2.

Recall that in the main paper, we show how the distance from initialization and the product of spectral norms vary with $m$ for network with six layers. In Figure 3, we show how the terms grow with sample size $m$ for each layer individually. Our main observation is that the first layer suffers from the largest dependence on $m$ .

Distance between trajectories of shuffled datasets grows with m𝑚m.

In the main paper, we saw that the distance between the solutions learned on different draws of the dataset grow substantially with $m$ . In Figure 5 (left), we show that even the distance between the solutions learned on the same draw, but a different shuffling of the dataset grows substantially with $m$ .

Flat minima

We note a similar kind of paradox concerning noise in training. Specifically, it is intriguing that on one hand, generalization is aided by larger learning rates and smaller batch sizes due to increased noise in SGD. On the other, theoretical analyses benefit from the opposite; Allen-Zhu et al. even explicitly regularize SGD for their three-layer-network result to help “forget false information” gathered by SGD. In other words, it seems that noise aids generalization, yet hinders attempts at explaining generalization. The intuition from our examples (such as the linear example) is that such “false information” could provably impair uniform convergence without affecting generalization.

Frobenius norms grow with m𝑚m when m≫hmuch-greater-than𝑚ℎm\gg h.

Some bounds like depend on the Frobenius norms of the weight matrices (or the distance from origin), which as noted in are in fact width-dependent, and grow as $\Omega(\sqrt{h})$ . However, even these terms do grow with the number of samples in the regime where $m$ is larger than $h$ . In Figure 5, we report the total distance from origin of the learned parameters for a network with $h=256$ (we choose a smaller width to better emphasize the growth of this term with $m$ ); here, we see that for $m>8192$ , the distance from origin grows at a rate of $\Omega(m^{0.42})$ that is quite similar to what we observed for distance from initialization.

Even a relaxed notion of margin does not address the m𝑚m-dependency.

Recall that in the main paper, we computed the generalization error bound in Equation 1 by setting $\gamma$ to be $\gamma^{\star}$ , the margin achieved by the network on at least $99\%$ of the data. One may hope that by choosing a larger value of $\gamma$ , this bound would become smaller, and that the $m$ -dependence may improve. We consider this possibility by computing the median margin of the network over the training set (instead of the $1\%$ -percentile’th margin) and substituting this in the second term in the right hand side of the guarantee in Equation 1. By doing this, the first margin-based train error term in the right hand side of Equation 1 would simplify to $0.5$ (as half the training data are misclassified by this large margin). Thereby we already forgo an explanation of half of the generalization behavior. At least we could hope that the second term no longer grows with $m$ . Unfortunately, we observe in Figure 6 (left) that the bounds still grow with $m$ . This is because, as shown in Figure 6 (right), the median margin value does not grow as fast with $m$ as the numerators of these bounds grow.

Effect of depth.

We observed that as the network gets shallower the bounds show better dependence with $m$ . As an extreme case, we consider a network with only one hidden layer, and with $h=50000$ . Here we also present a third bound, namely that of Neyshabur et al. , besides the two bounds discussed in the main paper. Specifically, if $Z_{1},Z_{2}$ are the random initializations of the weight matrices in the network, the generalization error bound (the last term in Equation 1) here is of the following form, ignoring log factors:

The first term here is meant to be width-independent, while the second term clearly depends on the width and does decrease with $m$ at the rate of $m^{-0.5}$ . Hence, in our plots in Figure 7, we only focus on the first term. We see that these bounds are almost constant and decrease at a minute rate of $\Omega(m^{-0.066})$ while the test errors decrease much faster, at the rate of $\mathcal{O}(m^{-0.35})$ .

Effect of width.

In Figure 8, we demonstrate that our observation that the bounds increase with $m$ extends to widths $h=128$ and $h=2000$ too.

B.1 Effect of batch size

In Figure 9, we show how the bounds vary with the batch size for a fixed sample size of $16384$ . It turns out that even though the test error decreases with decreasing batch size (for our fixed stopping criterion), all these bounds increase (by a couple of orders of magnitude) with decreasing batch size. Again, this is because the terms like distance from initialization increase for smaller batch sizes (perhaps because of greater levels of noise in the updates). Overall, existing bounds do not reflect the same behavior as the actual generalization error in terms of their dependence on the batch size.

Bounds vs. m𝑚m for batch size of 323232.

In the main paper, we only dealt with a small batch size of $1$ . In Figure 10, we show bounds vs. sample size plots for a batch size of $32$ . We observe that in this case, the bounds do decrease with sample size, although only at a rate of $\mathcal{O}(m^{-0.23})$ which is not as fast as the observed decrease in test error which is $\Omega(m^{-0.44})$ . Our intuition as to why the bounds behave better (in terms of $m$ -dependence) in the larger batch size regime is that here the amount of noise in the parameter updates is much less compared to smaller batch sizes (and as we discussed earlier, uniform convergence finds it challenging to explain away such noise).

Squared error loss.

All the experiments presented so far deal with the cross-entropy loss, for which the optimization procedure ideally diverges to infinity; thus, one might suspect that our results are sensitive to the stopping criterion. It would therefore be useful to consider the squared error loss where the optimum on the training loss can be found in a finite distance away from the random initialization. Specifically, we consider the case where the squared error loss between the outputs of the network and the one-hot encoding of the true labels is minimized to a value of $0.05$ on average over the training data.

We observe in Figure 11 that even for this case, the distance from initialization and the spectral norms grow with the sample size at a rate of at least $m^{0.3}$ . On the other hand, the test error decreases with sample size as $1/m^{0.38}$ , indicating that even for the squared error loss, these terms hurt would hurt the generalization bound with respect to its dependence on $m$ .

Appendix C Pseudo-overfitting

Recall that in the main paper, we briefly discussed a new notion that we call as pseudo-overfitting and noted that it is not the reason behind why some techniques lead to vacuous generalization bounds. We describe this in more detail here. We emphasize this discussion because i) it brings up a fundamental and so far unknown issue that might potentially exist in current approaches to explaining generalization and ii) rules it out before making more profound claims about uniform convergence.

Our argument specifically applies to margin-based Rademacher complexity approaches (such as Bartlett et al. , Neyshabur et al. ). These result in a bound like in Equation 1 that we recall here:

These methods upper bound the uniform convergence bound on the $\mathcal{L}^{(\gamma)}$ error on the network in terms of a uniform convergence bound on the margins of the network (see for more details about margin theory of Rademacher complexity). The resulting generalization error bound in Equation 1 would take the following form, as per our notation from Definition 3.3:

This particular upper bound on the generalization gap in the $\mathcal{L}^{(\gamma)}$ loss is also an upper bound on the generalization gap on the margins. That is, with high probability $1-\delta$ over the draws of $S$ , the above bound is larger than the following term that corresponds to the difference in test/train margins:

We first argue that it is possible for the generalization error of the algorithm to decrease with $m$ (as roughly $m^{-0.5}$ ), but for the above quantity to be independent of $m$ . As a result, the margin-based bound in Equation 2 (which is larger than Equation 3) will be non-decreasing in $m$ , and even vacuous. Below we describe such a scenario.

Consider a network that first learns a simple hypothesis to fit the data, say, by learning a simple linear input-output mapping on linearly separable data. But subsequently, the classifier proceeds to pseudo-overfit to the samples by skewing up (down) the real-valued output of the network by some large constant $\Delta$ in a tiny neighborhood around the positive (negative) training inputs. Note that this would be possible if and only if the network is overparameterized. Now, even though the classifier’s real-valued output is skewed around the training data, the decision boundary is still linear as the sign of the classifier’s output has not changed on any input. Thus, the boundary is still simple and linear and the generalization error small.

However, the training margins are at least a constant $\Delta$ larger than the test margins (which are not affected by the bumps created in tiny regions around the training data). Then, the term in Equation 3 would be larger than $\Delta/\gamma$ and as a result, so would the term in Equation 2. Now in the generalization guarantee of Equation 1, recall that we must pick a value of $\gamma$ such that the first term is low i.e., most of the training datapoints must be classified by at least $\gamma$ margin. In this case, we can at best let $\gamma\approx\Delta$ as any larger value of $\gamma$ would make the margin-based training error non-negligible; as a result of this choice of $\gamma$ , the bound in Equation 3 would be an $m$ -independent constant close to $1$ . The same would also hold for its upper bound in Equation 2, which is the generalization bound provided by the margin-based techniques.

Clearly, this is a potential fundamental limitation in existing approaches, and if deep networks were indeed pseudo-overfitting this way, we would have identified the reason why at least some existing bounds are vacuous. However, (un)fortunately, we rule this out by observing that the difference in the train and test margins in Equation 3 does decrease with training dataset size $m$ (see Figure 12) as $\mathcal{O}(m^{-0.33})$ . Additionally, this difference is numerically much less than $\gamma^{\star}=10$ (which is the least margin by which $99\%$ of the training data is classified) as long as $m$ is large, implying that Equation 3 is non-vacuous.

It is worth noting that the generalization error decreases at a faster rate of $\mathcal{O}(m^{-0.43})$ implying that the upper bound in Equation 3 which decreases only as $m^{-0.33}$ , is loose. This already indicates a partial weakness in this specific approach to deriving generalization guarantees. Nevertheless, even this upper bound decreases at a significant rate with $m$ which the subsequent uniform convergence-based upper bound in Equation 2 is unable to capture, thus hinting at more fundamental weaknesses specific to uniform convergence.

Before we wrap up this section, we discuss a question brought up by an anonymous reviewer, which we believe is worth addressing. Recall that in Section 3.1 and Section 3.2, we presented a linear and hypersphere classification task where we showed that uniform convergence provably fails. In light of the above discussion, one may be tempted to ask: do these two models fail to obey uniform convergence because of pseudo-overfitting?

The answer to this is that our proof for failure of uniform convergence in both these examples did not rely on any kind of pseudo-overfitting – had our proof relied on it, then we would have been able to show failure of only specific kinds of uniform convergence bounds (as discussed above). More formally, pseudo-overfitting in itself does not imply the lower bounds on $\epsilon_{{\textrm{\tiny unif-alg}}}$ that we have shown in these settings.

One may still be curious to understand the level of pseudo-overfitting in these examples, to get a sense of the similarity of this scenario with that of the MNIST setup. To this end, we note that our linear setup does indeed suffer from significant pseudo-overfitting – the classifier’s output does indeed have bumps around each training point (which can be concluded from our proof).

In the case of the hypersphere example, we present Figure 13, where we plot of the average margins in this setup like in Figure 12. Here, we observe that, the mean margins on the test data (orange line) and on training data (blue line) do converge to each other with more training data size $m$ i.e., the gap in the mean test and training margins (green line) does decrease with $m$ . Thus our setup exhibits a behavior similar to deep networks on MNIST in Figure 12. As noted in our earlier discussion, since the rate of decrease of the mean margin gap in MNIST is not as large as the decrease in test error itself, there should be “a small amount” of psuedo-overfitting in MNIST. The same holds in this setting, although, here we observe an even milder decrease, implying a larger amount of pseudo-overfitting. Nevertheless, we emphasize that, our proof shows that uniform convergence cannot capture even this decrease with $m$ .

To conclude, pseudo-overfitting is certainly a phenomenon worth exploring better; however, our examples elucidate that there is a phenomenon beyond pseudo-overfitting that is at play in deep learning.

Appendix D Useful Lemmas

In this section, we state some standard results we will use in our proofs. We first define some constants: $c_{1}=1/2048$ , $c_{2}=\sqrt{15/16}$ and $c_{3}=\sqrt{17/16}$ and $c_{4}=\sqrt{2}$ .

First, we state a tail bound for sub-exponential random variables .

For a sub-exponential random variable $X$ with parameters $(\nu,b)$ and mean $\mu$ , for all $t>0$ ,

As a corollary, we have the following bound on the sum of squared normal variables:

For $z_{1},z_{2},\ldots,z_{D}\sim\mathcal{N}(0,1)$ , we have that

We now state the Hoeffding bound for sub-Gaussian random variable.

Let $z_{1},z_{2},\ldots,z_{D}$ be independently drawn sub-Gaussian variables with mean and sub-gaussian parameter $\sigma_{i}$ . Then,

Appendix E Proof for Theorem 3.1

In this section, we prove the failure of uniform convergence for our linear model. We first recall the setup:

Learning algorithm $\mathcal{A}$ : We consider a linear classifier with weights $\boldsymbol{\mathbf{w}}=(\boldsymbol{\mathbf{w}}_{1},\boldsymbol{\mathbf{w}}_{2})$ . The output is computed as $h(\boldsymbol{\mathbf{x}})=\boldsymbol{\mathbf{w}}_{1}\boldsymbol{\mathbf{x}}_{1}+\boldsymbol{\mathbf{w}}_{2}\boldsymbol{\mathbf{x}}_{2}$ . Assume the weights are initialized to origin. Given $S=\{(\boldsymbol{\mathbf{x}}^{(1)},y^{(1)}),\ldots,(\boldsymbol{\mathbf{x}}^{(m)},y^{(m)})\}$ , $\mathcal{A}$ takes a gradient step of learning rate $1$ to maximize $y\cdot h(\boldsymbol{\mathbf{x}})$ for each $(\boldsymbol{\mathbf{x}},y)\in S$ . Regardless of the batch size, the learned weights would satisfy, $\boldsymbol{\mathbf{w}}_{1}=2m\boldsymbol{\mathbf{u}}$ and $\boldsymbol{\mathbf{w}}_{2}=\sum_{i}y^{(i)}\boldsymbol{\mathbf{x}}_{2}^{(i)}$ .

Below, we state the precise theorem statement (where we’ve used the constants $c_{1}=1/32$ , $c_{2}=1/2$ and $c_{3}=3/2$ and $c_{4}=\sqrt{2}$ ):

Theorem 3.1 In the setup above, for any $\epsilon,\delta>0$ and $\delta<1/4$ , let $D$ be sufficiently large that it satisfies

then we have that for all $\gamma\geq 0$ , for the $\mathcal{L}^{(\gamma)}$ loss, $\epsilon_{\text{unif-alg}}(m,\delta)\geq 1-\epsilon_{\text{gen}}(m,\delta)$ .

Specifically, for $\gamma\in$ , $\epsilon_{\text{gen}}(m,\delta)\leq\epsilon$ , and so $\epsilon_{\text{unif-alg}}(m,\delta)\geq 1-\epsilon$ .

The above follows from Lemma E.1, where we upper bound the generalization error, and from Lemma E.2 where we lower bound uniform convergence. ∎

We first prove that the above algorithm generalizes well with respect to the losses corresponding to $\gamma\in$ . First for the training data, we argue that both $\boldsymbol{\mathbf{w}}_{1}$ and a small part of the noise vector $\boldsymbol{\mathbf{w}}_{2}$ align along the correct direction, while the remaining part of the high-dimensional noise vector are orthogonal to the input; this leads to correct classification of the training set. Then, on the test data, we argue that $\boldsymbol{\mathbf{w}}_{1}$ aligns well, while $\boldsymbol{\mathbf{w}}_{2}$ contributes very little to the output of the classifier because it is high-dimensional noise. As a result, for most test data, the classification is correct, and hence the test and generalization error are both small.

In the setup of Section 3, when $\gamma\in$ , for $\mathcal{L}^{(\gamma)}$ , $\epsilon_{\text{gen}}(m,\delta)\leq\epsilon$ .

The parameters learned by our algorithm satisfies $\boldsymbol{\mathbf{w}}_{1}=2m\cdot\boldsymbol{\mathbf{u}}$ and $\boldsymbol{\mathbf{w}}_{2}=\sum y^{(i)}\boldsymbol{\mathbf{x}}_{2}^{(i)}\sim\mathcal{N}(0,\frac{8m}{c_{2}^{2}D})$ .

First, we have from Corollary D.1.1 that with probability $1-\frac{\delta}{3m}$ over the draws of $\boldsymbol{\mathbf{x}}_{2}^{(i)}$ , as long as $\frac{\delta}{3m}\geq 2e^{-c_{1}D}$ (which is given to hold by Equation 4),

Next, for a given $\boldsymbol{\mathbf{x}}^{(i)}$ , we have from Corollary D.2.1, with probability $1-\frac{\delta}{3m}$ over the draws of $\sum_{j\neq i}y^{(j)}\boldsymbol{\mathbf{x}}_{2}^{(j)}$ ,

Then, with probability $1-\frac{2}{3}\delta$ over the draws of the training dataset we have for all $i$ ,

Thus, for all $\gamma\in$ , the $\mathcal{L}^{(\gamma)}$ loss of this classifier on the training dataset $S$ is zero.

Now, from Corollary D.1.1, with probability $1-\frac{\delta}{3}$ over the draws of the training data, we also have that, as long as $\frac{\delta}{3m}\geq 2e^{-c_{1}D}$ (which is given to hold by Equation 4),

Next, conditioned on the draw of $S$ and the learned classifier, for any $\epsilon^{\prime}>0$ , with probability $1-\epsilon^{\prime}$ over the draws of a test data point, $(\boldsymbol{\mathbf{z}},y)$ , we have from Corollary D.2.1 that

Using this, we have that with probability $1-2\exp\left(-\frac{1}{2}\left({\frac{c_{2}^{2}}{4c_{4}c_{3}}\sqrt{\frac{D}{m}}}\right)^{2}\right)$ over the draws of a test data point, $(\boldsymbol{\mathbf{z}},y)$ ,

Thus, we have that for $\gamma\in$ , the $\mathcal{L}^{(\gamma)}$ loss of the classifier on the distribution $\mathcal{D}$ is $2\exp\left(-\frac{1}{2}\left({\frac{c_{2}^{2}}{4c_{4}c_{3}}\sqrt{\frac{D}{m}}}\right)^{2}\right)$ which is at most $\epsilon$ as assumed in Equation 6. In other words, the absolute difference between the distribution loss and the train loss is at most $\epsilon$ and this holds for at least $1-\delta$ draws of the samples $S$ . Then, by the definition of $\epsilon_{\text{gen}}$ we have the result.

We next prove our uniform convergence lower bound. The main idea is that when the noise vectors in the training samples are negated, with high probability, the classifier misclassifies the training data. We can then show that for any choice of $\mathcal{S}_{\delta}$ as required by the definition of $\epsilon_{\text{unif-alg}}$ , we can always find an $S_{\star}$ and its noise-negated version $S_{\star}^{\prime}$ both of which belong to $\mathcal{S}_{\delta}$ . Furthermore, we can show that $h_{S_{\star}}$ has small test error but high empirical error on $S_{\star}^{\prime}$ , and that this leads to a nearly vacuous uniform convergence bound.

In the setup of Section 3, for any $\epsilon>0$ and for any $\delta\leq 1/4$ , and for the same lower bounds on $D$ , and for any $\gamma\geq 0$ , we have that

For any $S$ , let $S^{\prime}$ denote the set of noise-negated samples $S^{\prime}=\{((\boldsymbol{\mathbf{x}}_{1},-\boldsymbol{\mathbf{x}}_{2}),u)\;|\;((\boldsymbol{\mathbf{x}}_{1},\boldsymbol{\mathbf{x}}_{2}),y)\in S\}$ . We first show with high probability $1-2\delta/3$ over the draws of $S$ , that the classifier learned on $S$ , misclassifies $S^{\prime}$ completely. The proof for this is nearly identical to our proof for why the training loss is zero, except for certain sign changes. For any $\boldsymbol{\mathbf{x}}_{\text{neg}}^{(i)}=(\boldsymbol{\mathbf{x}}_{1}^{(i)},-\boldsymbol{\mathbf{x}}_{2}^{(i)})$ , we have

Since the learned hypothesis misclassifies all of $S^{\prime}$ , it has loss of $1$ on $S^{\prime}$ .

Now recall that, by definition, to compute $\epsilon_{\text{unif-alg}}$ , one has to pick a sample set space $\mathcal{S}_{\delta}$ of mass $1-\delta$ i.e., $Pr_{S\sim\mathcal{S}^{m}}[S\in\mathcal{S}_{\delta}]\geq 1-\delta$ . We first argue that for any choice of $\mathcal{S}_{\delta}$ , there must exist a ‘bad’ $S_{\star}$ such that (i) $S_{\star}\in\mathcal{S}_{\delta}$ , (ii) $S_{\star}^{\prime}\in\mathcal{S}_{\delta}$ , (iii) $h_{S_{\star}}$ has test error less than $\epsilon_{\text{gen}}(m,\delta)$ and (iv) $h_{S_{\star}}$ completely misclassifies $S_{\star}^{\prime}$ .

We show the existence of such an $S_{\star}$ , by arguing that over the draws of $S$ , there is non-zero probability of picking an $S$ that satisfies all the above conditions. Specifically, we have by the union bound that

By definition of $\mathcal{S}_{\delta}$ , we know $Pr_{S\sim\mathcal{D}^{m}}\left[S\notin\mathcal{S}_{\delta}\right]\leq\delta$ . Similarly, by definition of the generalization error, we know that $Pr_{S\sim\mathcal{D}^{m}}\left[\mathcal{L}_{\mathcal{D}}(h_{S})>\epsilon_{\text{gen}}(m,\delta)\right]\leq\delta$ . We have also established above that $Pr_{S\sim\mathcal{D}^{m}}\left[\hat{\mathcal{L}}_{S^{\prime}}(h_{S})\neq 1\right]\leq 2\delta/3$ . As for the term $Pr_{S\sim\mathcal{D}^{m}}\left[S^{\prime}\notin\mathcal{S}_{\delta}\right]$ , observe that under the draws of $S$ , the distribution of the noise-negated dataset $S^{\prime}$ is identical to $\mathcal{D}^{m}$ . This is because the isotropic Gaussian noise vectors have the same distribution under negation. Hence, again by definition of $\mathcal{S}_{\delta}$ , even this probability is at most $\delta$ . Thus, we have that the probability in the left hand side of Equation 13 is at least $1-4\delta$ , which is positive as long as $\delta<1/4$ .

This implies that for any given choice of $\mathcal{S}_{\delta}$ , there exists $S_{\star}$ that satisfies our requirement. Then, from the definition of $\epsilon_{\text{unif-alg}}(m,\delta)$ , we essentially have that,

Appendix F Neural Network with Exponential Activations

In this section, we prove the failure of uniform convergence for a neural network model with exponential activations. We first define the setup.

Let $\boldsymbol{\mathbf{u}}$ be an arbitrary vector in $D$ dimensional space such that $\|\boldsymbol{\mathbf{u}}\|=\sqrt{D}/2$ . Consider an input distribution in $2D$ dimensional space such that, conditioned on the label $y$ drawn from uniform distribution over $\{-1,+1\}$ , the first $D$ dimensions $\boldsymbol{\mathbf{x}}_{1}$ of a random point is given by $y\boldsymbol{\mathbf{u}}$ and the remaining $D$ dimensions are drawn from $\mathcal{N}(0,1)$ . Note that in this section, we require $D$ to be only as large as $\ln m$ , and not as large as $m$ .

Architecture.

Algorithm

where $p(\boldsymbol{\mathbf{w}})$ equals the p.d.f of $\boldsymbol{\mathbf{w}}$ under the distribution it is drawn from. In this case $p(\boldsymbol{\mathbf{w}})=\frac{1}{(2\pi)^{D}}\exp\left(-\frac{\|\boldsymbol{\mathbf{w}}\|^{2}}{2}\right)$ .

In order to simplify our calculations we will set $\eta=(4\pi)^{D}$ , although our analysis would extend to other values of the learning rate too. Similarly, our results would only differ by constants if we consider the alternative update rule, $a_{\boldsymbol{\mathbf{w}}}\leftarrow a_{\boldsymbol{\mathbf{w}}}+\eta y\cdot\exp(\boldsymbol{\mathbf{w}}\cdot\boldsymbol{\mathbf{x}})$ .

We now state our main theorem (in terms of constants $c_{1},c_{2},c_{3},c_{4}$ defined in Section D).

In the set up above, for any $\epsilon,\delta>0$ and $\delta<1/4$ , let $D$ and $m$ be sufficiently large that it satisfies

then we have that for all $\gamma\geq 0$ , for the $\mathcal{L}^{(\gamma)}$ loss, $\epsilon_{\text{unif-alg}}(m,\delta)\geq 1-\epsilon_{\text{gen}}(m,\delta)$ .

Specifically, for $\gamma\in$ , $\epsilon_{\text{gen}}(m,\delta)\leq\epsilon$ , and so $\epsilon_{\text{unif-alg}}(m,\delta)\geq 1-\epsilon$ .

The result follows from the following lemmas. First in Lemma F.2, we derive the closed form expression for the function computed by the learned network. In Lemma F.3, we upper bound the generalization error and in Lemma F.4, we lower bound uniform convergence. ∎

We first derive a closed form expression for how the output of the network changes under a gradient descent step on a particular datapoint.

Let $h^{(0)}(\cdot)$ denote the function computed by the network before updating the weights. After updating the weights on a particular input $(\boldsymbol{\mathbf{x}},y)$ according to Equation 14, the learned network corresponds to:

Next, we argue that the generalization error of the algorithm is small. From Lemma F.2, we have that the output of the network is essentially determined by a summation of contributions from every training point. To show that the training error is zero, we argue that on any training point, the contribution from that training point dominates all other contributions, thus leading to correct classification. On any test point, we similarly show that the contribution of training points of the same class as that test point dominates the output of the network. Note that our result requires $D$ to scale only logarithmically with training samples $m$ .

In the setup of Section 3, when $\gamma\in$ , for $\mathcal{L}^{(\gamma)}$ , $\epsilon_{\text{gen}}(m,\delta)\leq\epsilon$ .

We first establish a few facts that hold with high probability over the draws of the training set $S$ . First, from Corollary D.1.1 we have that, since $D\geq\frac{1}{c_{2}}\ln\frac{3m}{\delta}$ (from Equation 15), with probability at least $1-\delta/3$ over the draws of $S$ , for all $i$ , the noisy part of each training input can be bounded as

Next, from Corollary D.2.1, we have that with probability at least $1-\frac{\delta}{3m^{2}}$ over the draws of $\boldsymbol{\mathbf{x}}_{2}^{(i)}$ and $\boldsymbol{\mathbf{x}}_{2}^{(j)}$ for $i\neq j$ ,

Then, by a union bound, the above two equations hold for all $i\neq j$ with probability at least $1-\delta/2$ .

Next, since each $y^{(i)}$ is essentially an independent sub-Gaussian with mean and sub-Gaussian parameter $\sigma=1$ , we can apply Hoeffding’s bound (Lemma D.2) to conclude that with probability at least $1-\delta/3$ over the draws of $S$ ,

Note that this means that there must exist at least one training data in each class.

Given these facts, we first show that the training error is zero by showing that for all $i$ , $y^{(i)}h(\boldsymbol{\mathbf{x}}^{(i)})$ is sufficiently large. On any training input $(\boldsymbol{\mathbf{x}}^{(i)},y^{(i)})$ , using Lemma F.2, we can write

Now, for any $j$ such that $y^{(j)}\neq y^{(i)}$ , we have that

Plugging this back in the previous equation we have that

Hence, $\boldsymbol{\mathbf{x}}^{(i)}$ is correctly classified by a margin of $1$ for every $i$ .

Now consider any test data point $(\boldsymbol{\mathbf{z}},y)$ . Since $D\geq\frac{1}{c_{2}}\ln\frac{2}{\epsilon}$ (Equation 16), we have that with probability at least $1-\epsilon/2$ over the draws of $\boldsymbol{\mathbf{z}}_{2}$ , by Corollary D.1.1

Similarly, for each $i$ , we have that with probability at least $1-\epsilon/2m$ over the draws of $\boldsymbol{\mathbf{z}}$ , the following holds good by Corollary D.2.1

Hence, the above holds over at least $1-\epsilon/2$ draws of $\boldsymbol{\mathbf{z}}$ , and by extension, both the above equations hold over at least $1-\epsilon$ draws of $\boldsymbol{\mathbf{z}}$ .

Now, for any $i$ such that $y^{(i)}=y$ , we have that

Similarly, for any $i$ such that $y^{(i)}\neq y$ , we have that

Since from Equation 21 we know there exists at least one training sample with a given label, we have that

Thus, at least $1-\epsilon$ of the test datapoints are classified correctly.

We next show that the uniform convergence bound is nearly vacuous. In order to do this, we create a set $S^{\prime}$ from $S$ by negating all values but the noise vector. We then show that for every point in $S^{\prime}$ , the contribution from the corresponding point in $S$ dominates over the contribution from all other points. (This is because of how the non-negated noise vector in the point from $S^{\prime}$ aligns adversarially with the noise vector from the corresponding point in $S$ ). As a result, the points in $S^{\prime}$ are all labeled like in $S$ , implying that $S^{\prime}$ is completely misclassified. Then, similar to our previous arguments, we can show that uniform convergence is nearly vacuous.

In the setup of Section F, for any $\epsilon>0$ and for any $\delta\leq 1/4$ , and for the same lower bounds on $D$ and $m$ as in Theorem F.1, and for any $\gamma\geq 0$ , we have that

Let $S^{\prime}$ be a modified version of the training set where all values are negated except that of the noise vectors i.e., $S^{\prime}=\{((-\boldsymbol{\mathbf{x}}_{1},\boldsymbol{\mathbf{x}}_{2}),-y)\;|\;((\boldsymbol{\mathbf{x}}_{1},\boldsymbol{\mathbf{x}}_{2}),y)\in S\}$ . First we show that with probability at least $1-2\delta/3$ over the draws of $S$ , $S^{\prime}$ is completely misclassified. First, we have that with probability $1-2\delta/3$ , Equations 19 and 20 hold good. Let $(\boldsymbol{\mathbf{x}}_{\textrm{neg}}^{(i)},y_{\textrm{neg}}^{(i)})$ denote the $i$ th sample from $S^{\prime}$ . Then, we have that

Now, consider $j$ such that $y^{(j)}=y_{\textrm{neg}}^{(i)}$ . we have that

Plugging the above back in Equation 24, we have

implying that $\boldsymbol{\mathbf{x}}_{\textrm{neg}}^{(i)}$ is misclassified. This holds simultaneously for all $i$ , implying that $S^{\prime}$ is misclassified with high probability $1-2\delta/3$ over the draws of $S$ . Furthermore, $S^{\prime}$ has the same distribution as $\mathcal{D}^{m}$ . Then, by the same argument as that of Lemma E.2, we can prove our final claim.

Appendix G Further Remarks.

In this section, we make some clarifying remarks about our theoretical results.

Typically, like in Mohri et al. , Bartlett et al. , the 0-1 test error is upper bounded in terms of the $\mathcal{L}^{(\gamma)}$ test error for some optimal choice of $\gamma>0$ (as it is easier to apply uniform convergence for $\gamma>0$ ). From the result in the main paper, it is obvious that for $\gamma\leq 1$ , this approach would yield vacuous bounds. We now establish that this is the case even for $\gamma>1$ .

To help state this more clearly, for the scope of this particular section, let $\epsilon^{(\gamma)}_{\textrm{unif-alg}},\epsilon^{(\gamma)}_{\textrm{gen}}$ denote the uniform convergence and generalization error for $\mathcal{L}^{(\gamma)}$ loss. Then, the following inequality is used to derive a bound on the 0-1 error:

where the second inequality above holds with probability at least $1-\delta$ over the draws of $S$ , while the first holds for all $S$ (which follows by definition of $\mathcal{L}^{(\gamma)}$ and $\mathcal{L}^{(0)}$ ).

To establish that uniform convergence is nearly vacuous in any setting of $\gamma$ , we must show that the right hand side of the above bound is nearly vacuous for any choice of $\gamma\geq 0$ (despite the fact that $\mathcal{L}_{\mathcal{D}}^{(0)}(S)\leq\epsilon$ ). In our results, we explicitly showed this to be true for only small values of $\gamma$ , by arguing that the second term in the R.H.S, namely $\epsilon^{(\gamma)}_{\textrm{unif-alg}}(m,\delta)$ , is nearly vacuous.

Below, we show that the above bound is indeed nearly vacuous for any value of $\gamma$ , when we have that $\epsilon_{\text{unif-alg}}^{(\gamma)}(m,\delta)\geq 1-\epsilon_{\text{gen}}^{(\gamma)}(m,\delta)$ . Note that we established the relation $\epsilon_{\text{unif-alg}}^{(\gamma)}(m,\delta)\geq 1-\epsilon_{\text{gen}}^{(\gamma)}(m,\delta)$ to be true in all of our setups.

Given that for all $\gamma\geq 0$ , $\epsilon_{\text{unif-alg}}^{(\gamma)}(m,\delta)\geq 1-\epsilon_{\text{gen}}^{(\gamma)}(m,\delta)$ then, we then have that for all $\gamma\geq 0$ ,

or in other words, the guarantee from the right hand side of Equation 25 is nearly vacuous.

Assume on the contrary that for some choice of $\gamma$ , we are able to show that with probability at least $1-\delta$ over the draws of $S$ , the right hand side of Equation 25 is less than $1/2$ . This means that $\epsilon^{(\gamma)}_{\textrm{unif-alg}}(m,\delta)<1/2$ . Furthermore, this also means that with probability at least $1-\delta$ over the draws of $S$ , $\hat{\mathcal{L}}^{(\gamma)}_{S}(h_{S})<1/2$ and $\mathcal{L}^{(\gamma)}_{\mathcal{D}}(h_{S})<1/2$ (which follows from the second inequality in Equation 25).

As a result, we have that with probability at least $1-\delta$ , $\mathcal{L}^{(\gamma)}_{\mathcal{D}}(h_{S})-\hat{\mathcal{L}}^{(\gamma)}_{S}(h_{S})<1/2$ . In other words, $\epsilon_{\textrm{gen}}^{(\gamma)}(m,\delta)<1/2$ . Since we are given that $\epsilon_{\text{unif-alg}}^{(\gamma)}(m,\delta)\geq 1-\epsilon_{\text{gen}}^{(\gamma)}(m,\delta)$ , by our upper bound on the generalization error, we have $\epsilon_{\text{unif-alg}}^{(\gamma)}(m,\delta)\geq 1/2$ , which is a contradiction to our earlier inference that $\epsilon_{\text{unif-alg}}^{(\gamma)}(m,\delta)<1/2$ . Hence, our assumption is wrong.

G.2 Applicability of the observation in Section 3.2 to other settings

Recall that in the main paper, we discussed a setup where two hyperspheres of radius $1$ and $1.1$ respectively are classified by a sufficiently overparameterized ReLU network. We saw that even when the number of training examples was as large as $65536$ , we could project all of these examples on to the other corresponding hypersphere, to create a completely misclassified set $S^{\prime}$ . How well does this observation extend to other hyperparameter settings?

First, we note that in order to achieve full misclassification of $S^{\prime}$ , the network would have to be sufficiently overparameterized i.e., either the width or the input dimension must be larger. When the training set size $m$ is too large, one would observe that $S^{\prime}$ is not as significantly misclassified as observed. (Note that on the other hand, increasing the parameter count would not hurt the generalization error. In fact it would improve it.)

Second, we note that our observation is sensitive to the choice of the difference in the radii between the hyperspheres (and potentially to other hyperparameters too). For example, when the outer sphere has radius $2$ , SGD learns to classify these spheres perfectly, resulting in zero error on both test data and on $S^{\prime}$ . As a result, our lower bound on $\epsilon_{\text{unif-alg}}$ would not hold in this setting.

However, here we sketch a (very) informal argument as to why there is reason to believe that our lower bound can still hold on a weaker notion of uniform convergence, a notion that is always applied in practice (in the main paper we focus on a strong notion of uniform convergence as a negative result about it is more powerful). More concretely, in reality, uniform convergence is computed without much knowledge about the data distribution, save a few weakly informative assumptions such as those bounding its support. Such a uniform convergence bound is effectively computed uniformly in supremum over a class of distributions.

Going back to the hypersphere example, the intuition is that even when the radii of the spheres are far apart, and hence, the classification perfect, the decision boundary learned by the network could still be microscopically complex – however these complexities are not exaggerated enough to misclassify $S^{\prime}$ . Now, for this given decision boundary, one would be able to construct an $S^{\prime\prime}$ which corresponds to projecting $S$ on two concentric hyperspheres that fall within these skews. Such an $S^{\prime\prime}$ would have a distribution that comes from some $\mathcal{D}^{\prime}$ which, although not equal to $\mathcal{D}$ , still obeys our assumptions about the underlying distribution. The uniform convergence bound which also holds for $\mathcal{D^{\prime}}$ would thus have to be vacuous.

As seen in the proof of Lemma E.1, the generalization error $\epsilon$ depends on $m$ and $D$ as $\mathcal{O}(e^{-D/m})$ ignoring some constants in the exponent. Clearly, this error decreases with the parameter count $D$ .

On the other hand, one may also observe that this generalization error grows with the number of samples $m$ , which might at first make this model seem inconsistent with our real world observations. However, we emphasize that this is a minor artefact of the simplifications in our setup, rather than a conceptual issue. With a small modification to our setup, we can make the generalization error decrease with $m$ , mirroring our empirical observations. Specifically, in the current setup, we learn the true boundary along the first $K$ dimensions exactly. We can however modify it to a more standard learning setup where the boundary is not exactly recoverable and needs to be estimated from the examples. This would lead to an additional generalization error that scales as $\mathcal{O}(\sqrt{\frac{K}{m}})$ that is non-vacuous as long as $K\ll m$ . Thus, the overall generalization error would be $\mathcal{O}(e^{-D/m}+\sqrt{\frac{K}{m}})$ .

What about the overall dependence on $m$ ? Now, assume we have an overparameterization level of $D\gg m\ln(m/K)$ , so that $e^{-D/m}\ll\sqrt{K/m}$ . Hence, in the sufficiently overparameterized regime, the generalization error $\mathcal{O}(e^{-D/m})$ that comes from the noise we have modeled, pales in comparison with the generalization error that would stem from estimating the low-complexity boundary. Overall, as a function of $m$ , the resulting error would behave like $\mathcal{O}(\sqrt{\frac{K}{m}})$ and hence show a decrease with increasing $m$ (as long the increase in $m$ is within the overparameterized regime).

G.4 Failure of hypothesis-dependent uniform convergence bounds.

Appendix H An abstract setup

That is, the learner misclassifies inputs that correspond to the negations of the samples in the training data – this would be possible if and only if the classifier is overparameterized with $\Omega(mD)$ parameters to store $S^{\prime}$ . We will show that uniform convergence fails to explain generalization for this learner.

First we establish that this learner generalizes well. Note that a given $S$ has zero probability mass under $\mathcal{D}$ , and so does $S^{\prime}$ . Then, the training and test error are zero – except for pathological draws of $S$ that intersect with $S^{\prime}$ , which are almost surely never drawn from $\mathcal{D}^{m}$ – and hence, the generalization error of $\mathcal{A}$ is zero too.

It might thus seem reasonable to expect that one could explain this generalization using implicit-regularization-based uniform convergence by showing $\epsilon_{\text{unif-alg}}(m,\delta)=0$ . Surprisingly, this is not the case as $\epsilon_{\text{unif-alg}}(m,\delta)$ is in fact $1$ !

First it is easy to see why the looser bound $\epsilon_{\text{unif}}(m,\delta)$ equals 1, if we let $\mathcal{H}$ be the space of all hypotheses the algorithm could output: there must exist a non-pathological $S\in\mathcal{S}_{\delta}$ , and we know that $h_{S^{\prime}}\in\mathcal{H}$ misclassifies the negation of its training set, namely $S$ . Then, $\sup_{h\in\mathcal{H}}|\mathcal{L}_{\mathcal{D}}(h)-\hat{\mathcal{L}}_{S}(h)|=|\mathcal{L}_{\mathcal{D}}(h_{S^{\prime}})-\hat{\mathcal{L}}_{S}(h_{S^{\prime}})|=|0-1|=1$ .

One might hope that in the stronger bound of $\epsilon_{\text{unif-alg}}(m,\delta)$ since we truncate the hypothesis space, it is possible that the above adversarial situation would fall apart. However, with a more nuanced argument, we can similarly show that $\epsilon_{\text{unif-alg}}(m,\delta)=1$ . First, recall that any bound on $\epsilon_{\text{unif-alg}}(m,\delta)$ , would have to pick a truncated sample set space $\mathcal{S}_{\delta}$ . Consider any choice of $\mathcal{S}_{\delta}$ , and the corresponding set of explored hypotheses $\mathcal{H}_{\delta}$ . We will show that for any choice of $\mathcal{S}_{\delta}$ , there exists $S_{\star}\in\mathcal{S}_{\delta}$ such that (i) $h_{S_{\star}}$ has zero test error and (ii) the negated training set $S_{\star}^{\prime}$ belongs to $\mathcal{S}_{\delta}$ and (iii) $h_{S_{\star}}$ has error $1$ on $S_{\star}$ . Then, it follows that $\epsilon_{\text{unif-alg}}(m,\delta)=\sup_{S\in\mathcal{S}_{\delta}}\sup_{h\in\mathcal{H}_{\delta}}|{\mathcal{L}}_{\mathcal{D}}(h)-\hat{\mathcal{L}}_{S}(h)|\geq|{\mathcal{L}}_{\mathcal{D}}(h_{S_{\star}})-\hat{\mathcal{L}}_{S_{\star}^{\prime}}(h)|=|0-1|=1$ .

We can prove the existence of such an $S_{\star}$ by showing that the probability of picking one such set under $\mathcal{D}^{m}$ is non-zero for $\delta<1/2$ . Specifically, under $S\sim\mathcal{D}^{m}$ , we have by the union bound that

Since the pathological draws have probability zero, the first probability term on the right hand side is zero. The second term is at most $\delta$ by definition of $\mathcal{S}_{\delta}$ . Crucially, the last term too is at most $\delta$ because $S^{\prime}$ (which is the negated version of $S$ ) obeys the same distribution as $S$ (since the isotropic Gaussian is invariant to a negation). Thus, the above probability is at least $1-2\delta>0$ , implying that there exist (many) $S_{\star}$ , proving our main claim.

While our particular learner might seem artificial, much of this artificiality is only required to make the argument simple. The crucial trait of the learner that we require is that the misclassified region in the input space (i) covers low probability and yet (ii) is complex and highly dependent on the training set draw. Our intuition is that SGD-trained deep networks possess these traits.

Appendix I Learnability and Uniform Convergence

Below, we provide a detailed discussion on learnability, uniform convergence and generalization. Specifically, we argue why the fact that uniform convergence is necessary for learnability does not preclude the fact that uniform convergence maybe unable to explain generalization of a particular algorithm for a particular distribution.

We first recall the notion of learnability. First, formally, a binary classification problem consists of a hypothesis class $\mathcal{H}$ and an instance space $\mathcal{X}\times\{-1,1\}$ . The problem is said to be learnable if there exists a learning rule $\mathcal{A}^{\prime}:\bigcup\limits_{m=1}^{\infty}\mathcal{Z}^{m}\to\mathcal{H}$ and a monotonically decreasing sequence $\epsilon_{\text{lnblty}}(m)$ such that $\epsilon_{\text{lnblty}}(m)\xrightarrow{m\to\infty}0$ and

Vapnik and Chervonenkis showed that finite VC dimension of the hypothesis class is necessary and sufficient for learnability in binary classification problems. As Shalev-Shwartz et al. note, since finite VC dimension is equivalent to uniform convergence, it can thus be concluded that uniform convergence is necessary and sufficient for learnability binary classification problems.

However, learnability is a strong notion that does not necessarily have to hold for a particular learning algorithm to generalize well for a particular underlying distribution. Roughly speaking, this is because learnability evaluates the algorithm under all possible distributions, including many complex distributions; while a learning algorithm may generalize well for a particular distribution under a given hypothesis class, it may fail to do so on more complex distributions under the same hypothesis class.

For more intuition, we present a more concrete but informal argument below. However, this argument is technically redundant because learnability is equivalent to uniform convergence for binary classification, and since we established the lack of necessity of uniform convergence, we effectively established the same for learnability too. However, we still provide the following informal argument as it provides a different insight into why learnability and uniform convergence are not necessary to explain generalization.

Our goal is to establish that in the set up of Section 3, even if we considered the binary classification problem corresponding to $\mathcal{H}_{\delta}$ (the class consisting of only those hypotheses explored by the algorithm $\mathcal{A}$ under a distribution $\mathcal{D}$ ), the corresponding binary classification problem is not learnable i.e., Equation 26 does not hold when we plug in $\mathcal{H}_{\delta}$ in place of $\mathcal{H}$ .

First consider distributions of the following form that is more complex than the linearly separable $\mathcal{D}$ : for any dataset $S^{\prime}$ , let $\mathcal{D}_{S^{\prime}}$ be the distribution that has half its mass on the part of the linearly separable distribution $\mathcal{D}$ excluding $S^{\prime}$ , and half its mass on the distribution that is uniformly distributed over $S^{\prime}$ . Now let $S^{\prime}$ be a random dataset drawn from $\mathcal{D}$ but with all its labels flipped; consider the corresponding complex distribution $\mathcal{D}_{S^{\prime}}$ .

We first show that there exists $h\in\mathcal{H}_{\delta}$ that fits this distribution well. Now, for most draws of the “wrongly” labeled $S^{\prime}$ , we can show that the hypothesis $h$ for which $\boldsymbol{\mathbf{w}}_{1}=2\cdot\boldsymbol{\mathbf{u}}$ and $\boldsymbol{\mathbf{w}}_{2}=\sum_{(x,y)\in S^{\prime}}y\cdot\boldsymbol{\mathbf{x}}_{2}$ fits the “wrong” labels of $S^{\prime}$ perfectly; this is because, just as argued in Lemma E.2, $\boldsymbol{\mathbf{w}}_{2}$ dominates the output on all these inputs, although $\boldsymbol{\mathbf{w}}_{1}$ would be aligned incorrectly with these inputs. Furthermore, since $\boldsymbol{\mathbf{w}}_{2}$ does not align with most inputs from $\mathcal{D}$ , by an argument similar to Lemma E.1, we can also show that this hypothesis has at most $\epsilon$ error on $\mathcal{D}$ , and that this hypothesis belongs to $\mathcal{H}_{\delta}$ . Overall this means that, w.h.p over the choice of $S^{\prime}$ , there exists a hypothesis $h\in\mathcal{H}_{\delta}$ for which the error on the complex distribution $\mathcal{D}_{S^{\prime}}$ is at most $\epsilon/2$ i.e.,

On the other hand, let $\mathcal{A}^{\prime}$ be any learning rule which outputs a hypothesis given $S\sim\mathcal{D}_{S^{\prime}}$ . With high probability over the draws of $S\sim\mathcal{D}_{S^{\prime}}$ , only at most, say $3/4$ th of $S$ (i.e., $0.75m$ examples) will be sampled from $S^{\prime}$ (and the rest from $\mathcal{D}$ ). Since the learning rule which has access only to $S$ , has not seen at least a quarter of $S^{\prime}$ , with high probability over the random draws of $S^{\prime}$ , the learning rule will fail to classify roughly half of the unseen examples from $S^{\prime}$ correctly (which would be about $(m/4)\cdot 1/2=m/8$ ). Then, the error on $\mathcal{D}_{S^{\prime}}$ will be at least $1/16$ . From the above arguments, we have that $\epsilon_{\text{learnability}}(m)\geq 1/16-\epsilon/2$ , which is a non-negligible constant that is independent of $m$ .

Appendix J Deterministic PAC-Bayes bounds are two-sided uniform convergence bounds

By definition, VC-dimension, Rademacher complexity and other covering number based bounds are known to upper bound the term $\epsilon_{\text{unif-alg}}$ and therefore our negative result immediately applies to all these bounds. However, it may not be immediately clear if bounds derived through the PAC-Bayesian approach fall under this category too. In this discussion, we show that existing deterministic PAC-Bayes based bounds are in fact two-sided in that they are lower bounded by $\epsilon_{\text{unif-alg}}$ too.

For a given prior distribution $P$ over the parameters, a PAC-Bayesian bound is of the following form: with high probability $1-\delta$ over the draws of the data $S$ , we have that for all distributions $Q$ over the hypotheses space:

Note that here for any $a,b\in$ , $KL(a\|b)=a\ln\frac{a}{b}+(1-a)\ln\frac{1-a}{1-b}$ . Since the precise form of the PAC-Bayesian bound on the right hand side is not relevant for the rest of the discussion, we will concisely refer to it as $\epsilon_{\textrm{pb}}(P,Q,m,\delta)$ . What is of interest to us is the fact that the above bound holds for all $Q$ for most draws of $S$ and that the KL-divergence on the right-hand side is in itself two-sided, in some sense.

Typically, the above bound is simplified to derive the following one-sided bound on the difference between the expected and empirical errors of a stochastic network (see for example):

This bound is then manipulated in different ways to obtain bounds on the deterministic network. In the rest of this discussion, we focus on the two major such derandomizing techniques and argue that both these techniques boil down to two-sided convergence. While, we do not formally establish that there may exist other techniques which ensure that the resulting deterministic bound is strictly one-sided, we suspect that no such techniques may exist. This is because the KL-divergence bound in Equation 27 is in itself two-sided in the sense that for the right hand side bound to be small, both the stochastic test and train errors must be close to each other; it is not sufficient if the stochastic test error is smaller than the stochastic train error.

To derive a deterministic generalization bound, one approach is to add extra terms that account for the perturbation in the loss of the network . That is, define:

Then, one can get a deterministic upper bound as:

Note that while applying this technique, for any hypothesis $h$ , one picks a posterior $Q_{h}$ specific to that hypothesis (typically, centered at that hypothesis).

We formally define the deterministic bound resulting from this technique below. We consider the algorithm-dependent version and furthermore, we consider a bound that results from the best possible choice of $Q_{h}$ for all $h$ . We define this deterministic bound in the format of $\epsilon_{\text{unif-alg}}$ as follows:

The distribution-dependent, algorithm-dependent, deterministic PAC-Bayesian bound of (the hypothesis class $\mathcal{H}$ , algorithm $\mathcal{A}$ )-pair with respect to $\mathcal{L}$ is defined to be the smallest value $\epsilon_{\text{pb-det-A}}(m,\delta)$ such that the following holds:

there exists a set of $m$ -sized samples $\mathcal{S}_{\delta}\subseteq(\mathcal{X}\times\{-1,+1\})^{m}$ for which:

and if we define $\mathcal{H}_{\delta}=\bigcup_{S\in\mathcal{S}_{\delta}}\{h_{S}\}$ to be the space of hypotheses explored only on these samples, then there must exist a prior $P$ and for each $h\in\mathcal{H}_{\delta}$ , a distribution $Q_{h}$ , such that uniform convergence must hold as follows:

as a result of which, by Equation 28, the following one-sided uniform convergence also holds:

Now, recall that $\epsilon_{\text{unif-alg}}(m,\delta)$ is a two-sided bound, and in fact our main proof crucially depended on this fact in order to lower bound $\epsilon_{\text{unif-alg}}(m,\delta)$ . Hence, to extend our lower bound to $\epsilon_{\text{pb-det-A}}(m,\delta)$ we need to show that it is also two-sided in that it is lower bounded by $\epsilon_{\text{unif-alg}}(m,\delta)$ . The following result establishes this:

Let $\mathcal{A}$ be an algorithm such that on at least $1-\delta$ draws of the training dataset $S$ , the algorithm outputs a hypothesis $h_{S}$ that has $\hat{\epsilon}(m,\delta)$ loss on the training data $S$ . Then

First, by the definition of the generalization error, we know that with probability at least $1-\delta$ over the draws of $S$ ,

Furthermore since the training loss it at most $\hat{\epsilon}(m,\delta)$ on at least $1-\delta$ draws we have that on at least $1-2\delta$ draws of the dataset,

Let $\mathcal{H}_{\delta}$ and $\mathcal{S}_{\delta}$ be the subset of hypotheses and sample sets as in the definition of $\epsilon_{\text{pb-det-A}}$ . Then, from the above, there exist $\mathcal{H}_{3\delta}\subseteq\mathcal{H}_{\delta}$ and $\mathcal{S}_{3\delta}\subseteq\mathcal{S}_{\delta}$ such that

and $\mathcal{H}_{3\delta}=\bigcup_{S\in\mathcal{S}_{3\delta}}\{h_{S}\}$ , and furthermore,

Using the above, and the definition of $\Delta$ , we have for all $h\in\mathcal{H}_{3\delta}$ , the following upper bound on its stochastic test error:

We consider two cases. First, for some $h\in\mathcal{H}_{3\delta}$ and $S\in\mathcal{S}_{3\delta}$ , consider the case that $e^{3/2}b>a$ . Then, we have

Now consider the case where $a>e^{3/2}b$ . This means that $(1-a)<(1-b)$ . Then, if we consider the PAC-Bayesian bound of Equation 27,

on the second term, we can apply the inequality $\ln x\geq\frac{(x-1)(x+1)}{2x}=\frac{1}{2}\left(x-\frac{1}{x}\right)$ which holds for $x\in$ to get:

Plugging this back in Equation 33, we have,

Since, for all $h\in\mathcal{H}_{3\delta}$ and $S\in\mathcal{S}_{3\delta}$ , one of Equations 32 and 34 hold, we have that:

It follows from Equation 30 that the above bound holds good even after we take the absolute value of the first term in the left hand side. However, the absolute value is lower-bounded by $\epsilon_{\text{unif-alg}}(m,3\delta)$ (which follows from how $\epsilon_{\text{unif-alg}}(m,3\delta)$ is defined to be the smallest possible value over the choices of $\mathcal{H}_{3\delta},\mathcal{S}_{3\delta}$ ).

As a result of the above theorem, we can show that $\epsilon_{\textrm{pb-det-A}}(m,\delta)={\Omega}(1)-\mathcal{O}(\epsilon)$ , thus establishing that, for sufficiently large $D$ , even though the generalization error would be negligibly small, the PAC-Bayes based bound would be as large as a constant.

In the setup of Section 3, for any $\epsilon,\delta>0,\delta<1/12$ , when $D=\Omega\left(\max\left(m\ln\frac{3}{\delta},m\ln\frac{1}{\epsilon}\right)\right)$ , we have,

The fact that $\epsilon_{\text{gen}}(m,\delta)\leq\epsilon$ follows from Theorem 3.1. Additionally, $\hat{\epsilon}(m,\delta)=0$ follows from the proof of Theorem 3.1. Now, as long as $3\delta<1/4$ , and $D$ is sufficiently large (i.e., in the lower bounds on $D$ in Theorem 3.1, if we replace $\delta$ by $3\delta$ ), we have from Theorem 3.1 that $\epsilon_{\text{unif-alg}}(m,3\delta)>1-\epsilon$ . Plugging these in Theorem J.1, we get the result in the above corollary. ∎

J.2 Deterministic PAC-Bayesian Bounds of Type B

In this section, we consider another standard approach to making PAC-Bayesian bounds deterministic . Here, the idea is to pick for each $h$ a distribution $Q_{h}$ such that for all $\boldsymbol{\mathbf{x}}$ :

Then, by applying the PAC-Bayesian bound of Equation 28 for the loss $\mathcal{L}^{\prime}_{\gamma/2}$ , one can get a deterministic upper bound as follows, without having to introduce the extra $\Delta$ terms,

The distribution-dependent, algorithm-dependent, deterministic PAC-Bayesian bound of (the hypothesis class $\mathcal{H}$ , algorithm $\mathcal{A}$ )-pair is defined to be the smallest value $\epsilon_{\text{pb-det-B}}(m,\delta)$ such that the following holds:

there exists a set of $m$ -sized samples $\mathcal{S}_{\delta}\subseteq(\mathcal{X}\times\{-1,+1\})^{m}$ for which:

and if we define $\mathcal{H}_{\delta}=\bigcup_{S\in\mathcal{S}_{\delta}}\{h_{S}\}$ to be the space of hypotheses explored only on these samples, then there must exist a prior $P$ and for each $h$ a distribution $Q_{h}$ , such that uniform convergence must hold as follows: for all $S\in\mathcal{S}_{\delta}$ and for all $h\in\mathcal{H}_{\delta}$ ,

as a result of which the following one-sided uniform convergence also holds:

We can similarly show that $\epsilon_{\text{pb-det-B}}(m,\delta)$ is lower-bounded by the uniform convergence bound of $\epsilon_{\text{unif-alg}}$ too.

Let $\mathcal{A}$ be an algorithm such that on at least $1-\delta$ draws of the training dataset $S$ , the algorithm outputs a hypothesis $h_{S}$ such that the margin-based training loss can be bounded as:

and with high probability $1-\delta$ over the draws of $S$ , the generalization error can be bounded as:

Then there exists a set of samples $\mathcal{S}_{3\delta}$ of mass at least $1-3\delta$ , and a corresponding set of hypothesis $\mathcal{H}_{3\delta}$ learned on these sample sets such that:

Note that the above statement is slightly different from how Theorem J.1 is stated as it is not expressed in terms of $\epsilon_{\text{unif-alg}}$ . In the corollary that follows the proof of this statement, we will see how it can be reduced in terms of $\epsilon_{\text{unif-alg}}$ .

Most of the proof is similar to the proof of Theorem J.1. Like in the proof of Theorem J.1, we can argue that there exists $\mathcal{S}_{3\delta}$ and $\mathcal{H}_{3\delta}$ for which the test error can be bounded as,

where we have used $\epsilon_{\text{gen}}(m,\delta)$ to denote the generalization error of $\mathcal{L}^{\prime(\gamma)}$ and not the 0-1 error (we note that this is ambiguous notation, but we keep it this way for simplicity).

Now consider the case where $a>e^{3/2}b$ . Again, by similar arithmetic manipulation in the PAC-Bayesian bound of Equation 28 applied on $\mathcal{L}^{\prime(\gamma/2)}$ , we get,

Since, for all $h\in\mathcal{H}_{3\delta}$ and $S\in\mathcal{S}_{3\delta}$ , one of Equations 37 and 38 hold, we have the claimed result.

Similarly, as a result of the above theorem, we can show that $\epsilon_{\textrm{pb-det-B}}(m,\delta)={\Omega}(1)-\mathcal{O}(\epsilon)$ , thus establishing that, for sufficiently large $D$ , even though the generalization error would be negligibly small, the PAC-Bayes based bound would be as large as a constant and hence cannot explain generalization.

In the setup of Section 3, for any $\epsilon,\delta>0,\delta<1/12$ , when $D=\Omega\left(\max\left(m\ln\frac{3}{\delta},m\ln\frac{1}{\epsilon}\right)\right)$ , we have,

It follows from the proof of Theorem 3.1 that $\hat{\epsilon}(m,\delta)=0$ , since all training points are classified by a margin of $\gamma$ (see Equation 9). Similarly, from Equation 12 in that proof, since most test points are classified by a margin of $\gamma$ , $\epsilon_{\text{gen}}(m,\delta)\leq\epsilon$ . Now, as long as $3\delta<1/4$ , and $D$ is sufficiently large (i.e., in the lower bounds on $D$ in Theorem 3.1, if we replace $\delta$ by $3\delta$ ), we will get that there exists $S\in\mathcal{S}_{3\delta}$ and $h\in\mathcal{H}_{3\delta}$ for which the empirical loss $\mathcal{L}^{(0)}$ loss is $1$ . Then, by Theorem J.2, we get the result in the above corollary. ∎