Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data
Gintare Karolina Dziugaite, Daniel M. Roy
Introduction
By optimizing a PAC-Bayes bound, we show that it is possible to compute nonvacuous numerical bounds on the generalization error of deep stochastic neural networks with millions of parameters, despite the training data sets being one or more orders of magnitude smaller than the number of parameters. To our knowledge, these are the first explicit and nonvacuous numerical bounds computed for trained neural networks in the modern deep learning regime where the number of network parameters eclipses the number of training examples.
The bounds we compute are data dependent, incorporating millions of components optimized numerically to identify a large region in weight space with low average empirical error around the solution obtained by stochastic gradient descent (SGD). The data dependence is essential: indeed, the VC dimension of neural networks is typically bounded below by the number of parameters, and so one needs as many training data as parameters before (uniform) PAC bounds are nonvacuous, i.e., before the generalization error falls below 1. To put this in concrete terms, on MNIST, having even 72 hidden units in a fully connected first layer yields vacuous PAC bounds.
Evidently, we are operating far from the worst case: observed generalization cannot be explained in terms the regularizing effect of the size of the neural network alone. This is an old observation, and one that attracted considerable theoretical attention two decades ago: Bartlett [Bar97, Bar98] showed that, in large (sigmoidal) neural networks, when the learned weights are small in magnitude, the fat-shattering dimension is more important than the VC dimension for characterizing generalization. In particular, Bartlett established classification error bounds in terms of the empirical margin and the fat-shattering dimension, and then gave fat-shattering bounds for neural networks in terms of the magnitudes of the weights and the depth of the network alone. Improved norm-based bounds were obtained using Rademacher and Gaussian complexity by [BM02, KP02].
These norm-based bounds are the foundation of our current understanding of neural network generalization. It is widely accepted that these bounds explain observed generalization, at least “qualitatively” and/or when the weights are explicitly regularized. Indeed, recent work by [NTS14] puts forth the idea that SGD performs implicit norm-based regularization. Somewhat surprisingly, when we investigated state-of-the-art Rademacher bounds for ReLU networks, the bounds were vacuous when applied to solutions obtained by SGD on real networks/datasets. We discuss the details of this analysis in Appendix D. While most theoreticians would assume these bounds were numerically loose to some extent, they might be surprised to learn that the bounds do not logically establish generalization on their own. It is worth highlighting that this observation does not necessarily rule out the existence of nonvacuous bounds under the same or similar hypotheses. This is an important avenue to investigate.
Our investigation was instigated by recent empirical work by \citefullauthorRethinking17 [Zha+17], who show that stochastic gradient descent (SGD), applied to deep networks with millions of parameters, is:
able to achieve training error on CIFAR10 and IMAGENET and still generalize (i.e., test error remains small, despite the potential for overfitting);
still able to achieve training error even after the labels are randomized, and does so with only a small factor of additional computational time.
Taken together, these two observations demonstrate that these networks have a tremendous capacity to overfit and yet SGD does not abuse this capacity as it optimizes the surrogate loss, despite the lack of explicit regularization.
It is a major open problem to explain this phenomenon. A natural approach would be to show that, under realistic hypotheses, SGD performs implicit regularization or tends to find solutions that possess some particular structural property that we already know to be connected to generalization. However, in order to complete the logical connection, we need an associated error bound to be nonvacuous in the regime of model size / data size where we hope to explain the phenomenon.
This work establishes a potential candidate, building off ideas by [Lan02] and [LC02]: On a binary class variant of MNIST, we find that SGD solutions are nearby to relatively large regions in weight space with low average empirical error. We find this structure by optimizing a PAC-Bayes bound, starting at the SGD solution, obtaining a nonvacuous generalization bound for a stochastic neural network. Across a variety of network architectures, our PAC-Bayes bounds on the test error are in the range 16–22%. These are far from nonvacuous but loose: Chernoff bounds on the test error based on held-out data are consistently around 3%. Despite the gap, theoreticians aware of the numerical performance of generalization bounds will likely be surprised that it is possible at all to obtain nonvacuous numerical bounds for models with such large capacity trained on so few training examples. While we cannot entirely explain the magnitude of generalization, we can demonstrate nontrivial generalization.
Our approach was inspired by a line of work in physics by \citefullauthorPhysRevLett.115.128101 [Bal+15] and the same authors with Borgs and Chayes [Bal+16]. Based on theoretical results for discrete optimization linking computational efficiency to the existence of nonisolated solutions, the authors propose a number of new algorithms for learning discrete neural networks by explicitly driving a local search towards nonisolated solutions. On the basis of Bayesian ideas, they posit that these solutions have good generalization properties. In a recent work with Chaudhari, Choromanska, Soatto, and LeCun [Cha+17], they introduce local-entropy loss and EntropySGD, extending these algorithmic ideas to modern deep learning architectures with continuous parametrizations, and obtaining impressive empirical results.
In the continuous setting, nonisolated solutions correspond to “flat minima”. The existence and regularizing effects of flat minima in the empirical error surface was recognized early on by researchers, going back at work by [HC93] and [HS97]. [HS97] discuss sharp versus flat minima using the language of minimum description length (MDL; [Ris83, Grü07]). In short, describing weights in sharp minima requires high precision in order to not incur nontrivial excess error, whereas flat minimum can be described with lower precision. A similar coding argument appears in [HC93].
Hochreiter97 propose an algorithm to find flat minima by minimizing the training error while maximizing the log volume of a connected region of the parameter space that yields similar classifiers with similarly good training error. There are very close connections—at both the level of analysis and algorithms—with the work of [Cha+17] and close connections with the approach we take to compute nonvacuous error bounds by exploiting the local error surface. (We discuss more related work in Section 6.)
Despite the promising underpinnings, the generalization theorems given by [Cha+17] have admittedly unrealistic assumptions, and fall short of connecting local-entropy minimization to observed generalization.
The goal of this work is to identify structure in the solutions obtained by SGD that provably implies small generalization error. Computationally, it is much easier to demonstrate that a randomized classifier will generalize, and so our results actually pertain to the generalization error of a stochastic neural network, i.e., one whose weights/biases are drawn at random from some distribution on every forward evaluation of the network. Under bounded loss, Fubini’s theorem implies that we also obtain a bound on the expected error of a neural network whose weights have been randomly perturbed. It would be interesting to achieve tighter control on the distribution of error or on the error of the mean neural network.
Returning to the goal of explaining SGD and generalization in deep learning more generally, one could study whether the type of structure we exploit to obtain bounds necessarily arises from performing SGD under natural conditions. (We suspect one condition may be that the Bayes error rate is close to zero.) More ambitiously, perhaps the existence of the same structure can explain the success of SGD in practice.
2 Approach
Our working hypothesis is that SGD finds good solutions only if they are surrounded by a relatively large volume of solutions that are nearly as good. This hypothesis suggests that PAC-Bayes bounds may be fruitful: if SGD finds a solution contained in a large volume of equally good solutions, then the expected error rate of a classifier drawn at random from this volume should match that of the SGD solution. The PAC-Bayes theorem [McA99] bounds the expected error rate of a classifier chosen from a distribution in terms of the Kullback–Liebler divergence from some a priori fixed distribution , and so if the volume of equally good solutions is large, and not too far from the mass of , we will obtain a nonvacuous bound.
Our approach will be to use optimization to find a broad distribution over neural network parameters that minimizes the PAC-Bayes bound, in effect mapping out the volume of equally good solutions surrounding the SGD solution. This idea is actually a modern take on an old idea by \citefullauthorLC02 [LC02], who apply PAC-Bayes bounds to small two-layer stochastic neural networks (with only hidden units) that were trained on (relatively large, in comparison) data sets of several hundred labeled examples.
The basic idea can be traced back even further to work by \citefullauthorHinton93 [HC93], who propose an algorithm for controlling overfitting in neural networks via the minimum description length principle. In particular, they minimize the sum of the empirical squared error and the KL divergence between a prior and posterior distribution on the weights. Their algorithm is applied to networks with 100’s of inputs and 4 hidden units, trained on several hundred labeled examples. \citefullauthorHinton93 do not compute numerical generalization bounds to verify that MDL principles alone suffice to explain the observed generalization.
Our algorithm more directly extends the work by [LC02], who propose to construct a distribution over neural networks by performing a sensitivity analysis on each parameter after training, searching for the largest deviation that does not increase the training error by more than, e.g., . For , [LC02] choose a multivariate normal distribution over the network parameters, centered at the parameters of the trained neural network. The covariance matrix is diagonal, with the variance of each parameter chosen to be the estimated sensitivity, scaled by a global constant. (The global scale is chosen so that the training error of is within, e.g., of that of the original trained network.) Their prior is also a multivariate normal, but with zero mean and covariance given by some scalar multiple of the identity matrix. By employing a union bound, they allow themselves to choose the scalar multiple in a data-dependent fashion to optimize the PAC-Bayes bound.
The algorithm sketched by [LC02] does not scale to modern neural networks for several reasons, but one dominates: in massively overparametrized networks, individual parameters often have negligible effect on the training classification error, and so it is not possible to estimate the relative sensitivity of large populations of neurons by studying the sensitivity of neurons in isolation.
Instead, we use stochastic gradient descent to directly optimize the PAC-Bayes bound on the error rate of a stochastic neural network. At each step, we update the network weights and their variances by taking a step along an unbiased estimate of the gradient of (an upper bound on) the PAC-Bayes bound. In effect, the objective function is the sum of i) the empirical surrogate loss averaged over a random perturbation of the SGD solution, and ii) a generalization error bound that acts like a regularizer.
Having demonstrated that this simple approach can construct a witness to generalization, it is worthwhile asking whether these ideas can be extended to the setting of local-entropic loss [Cha+17]. If we view the distribution that defines the local-entropic loss as defining a stochastic neural network, can we use PAC-Bayes bounds to establish nonvacuous bounds on its generalization error?
Preliminaries
which will serve as a convex surrogate (i.e., upper bound) to the 0–1 loss.
We define the following notions of error:
For , we will abuse notation and define
where denotes the Bernoulli distribution on with mean .
2 Inverting KL bounds
In the following sections, we will encounter bounds on a quantity of the form
for some and . Thus, we are interested in
3 Bounds
We will employ three probabilistic bounds to control generalization error: the union bound, a sample convergence bound derived from the Chernoff bound, and the PAC-Bayes bound due to [McA99]. We state the union bound for completeness.
Recall that denotes the Bernoulli distribution on with mean . The following bound is derived from the KL formulation of the Chernoff bound:
Finally, we present a variant of McAllester’s PAC-Bayes bound due to [LS01]. (See also [Lan02].)
The PAC-Bayes bound leads to the following learning algorithm [McA99]:
Fix a probability and a distribution on .
Collect an i.i.d. dataset of size .
Compute the optimal distribution on that minimizes the error bound
Return the randomized classifier given by .
In all but the simplest scenarios, making predictions according to the optimal is intractable. However, we can attempt to approximate it.
PAC-Bayes bound optimization
In order to obtain a KL divergence in closed form, we choose to be multivariate normal. Symmetry considerations would suggest that we choose for some , however there is no single good choice of . (We will also see that there are good reasons not to choose a zero mean, and so we will let denote the mean to be chosen a priori.)
and, using Eq. 1, the KL term simplifies to
We fix , , and .
2 Stochastic Gradient Descent
We cannot optimize Eq. 4 directly because we cannot compute or its gradients efficiently. We can, however, compute the gradient of the unbiased estimate , where . We will use an i.i.d. copy of at each iteration. We did not experiment using mini-batches during bound optimization.
3 Final PAC-Bayes bound
We cannot compute this bound exactly because computing is intractable. However, we can obtain unbiased estimates and apply the sample convergence bound (Theorem 2.2). In particular, given i.i.d. samples from , we produce the Monte Carlo approximation , for which is exactly computable, and obtain the bound
which holds with probability . By another application of the union bound,
with probability . We use this bound in our reported results.
Experiments
Starting from neural networks whose weights have been trained by SGD (with momentum) to achieve near-perfect accuracy on a binary class variant of MNIST, we then optimize a PAC-Bayes bound on the error rate of stochastic neural network whose weights are random perturbations of the weights learned by SGD. We consider several different network architectures, varying both the depth and the width of the network.
We use the MNIST handwritten digits data set [LCB10] as provided in Tensorflow [TF], where the dataset is split into the training set (55000 images) and test set (10000 images). (We do not use the validation set.) Each MNIST image is black and white and 28-pixels square, resulting in a network input dimension of . MNIST is usually treated as a multiclass classification problem. In order to use standard PAC-Bayes bounds, we produce a binary classification problem by mapping numbers to label and to label . In some experiments, we train on random labels, i.e., binary labels drawn independently and uniformly at random.
2 Initial network training by SGD
All experiments are performed on multilayer perceptrons, i.e., feed-forward neural networks with 2 or more layers, each layer fully connected to the previous and next layer. We choose a standard initialization scheme for the weights and biases: Weights are initialized randomly from a normal distribution (with mean zero and standard deviation ) that is truncated to . Biases are initialized to a constant value of 0.1 for the first layer and 0 for the remaining layers. We let denote this random initialization of the weights (and biases).
We use REctified Linear Unit (RELU) activations at every hidden node. The last layer is linear. In order to train the weights, we minimize the logistic loss by SGD with momentum (learning rate 0.01; momentum 0.9). SGD is run in mini-batches of size 100. These settings are similar to those in [Zha+17].
On our binary class variant of MNIST, we train several neural network architectures of varying depth and width (see Table 1). In each case, we train for a total of 20 epochs. We also train a small network (with one 600-unit hidden layer) on random labels, in order to demonstrate the large capacity of the network. Obtaining 0 training error requires 120 epochs. See the first two rows of Table 1 for the train/test error rates.
3 PAC-Bayes bound optimization
We optimize the objective by gradient descent with the RMSprop optimizer (with decay 0.9, as is typical). We use the unbiased estimate of the gradient of the empirical surrogate error of the randomized classifier . We set the learning rate to 0.001 for the first 150000 iterations, and then lower it to 0.0001 for the final 50000 iterations. For the random-label experiment, we optimize the bound with a smaller 0.0001 learning rate for 500000 iterations. In both cases, the learning rate is tuned so that the objective decreases smoothly during learning.
Algorithm 1 is pseudo code for optimizing the PAC-Bayes bound. The code implements vanilla SGD, although it can be easily modified to use an optimizer like RMSprop.
4 Reported values
Reported error rates correspond to classification error. Train and test error rates are empirical averages for networks learned by SGD. In light of 10000 test data points and the observed error rates, upper bounds via Theorem 2.2 are only 0.005 higher.
Reported train and test error rates for the stochastic neural networks (abbreviated SNN) are upper bounds computed by an application of Theorem 2.2 as described in Section 3.3 with and . These numbers produce estimates within 0.001–0.002.
Our VC-dimension bounds for ReLU networks are computed from a formula communicated to us by [Bar17]. These bounds are in , where is the number of layers and is the total number of tunable parameters across layers.
Results
See Table 1. All SGD trained networks achieve perfect or near-perfect accuracy on the training data. On true labels, the SNN mean training error increases slightly as the weight distribution broadens to minimize the KL divergence. The SGD solution is close to mean of the SNN as measured with respect to the SNN covariance. (See Appendix C for a discussion.) For the random-label experiment, the SNN mean training error rises above 10%. Ideally, it might have risen to nearly 50%, while driving down the KL term to near zero.
The empirical test error of the SGD classifiers does not change much across the different architectures, despite the potential for overfitting. This phenomenon is well known, though still remarkable. For the random-label experiment, the empirical test classification error of 0.508 represents lack of generalization, as expected. The same two patterns hold for the SNN test error too, with slightly higher error rates.
Remarkably, the PAC-Bayes bounds do not grow much despite the networks becoming several times larger, and all true label experiments have classification error bounded by 0.23. (This observation is consistent with [NTS14].) Since larger networks possess many more symmetries, the true PAC-Bayes bounds for our learned stochastic neural network classifier might be substantially smaller. (See Appendix B for a discussion.) While these bounds are several times larger than the test error estimated on held-out data (approximately, 0.03), they demonstrate nontrivial generalization. The PAC-Bayes bound for the random-label experiment is vacuous.
The VC-dimension upper bounds indicate that data independent bounds will be vacuous by several orders of magnitude. Because the number of parameters exceeds the available training data, lower bounds imply that generalization cannot be explained in a data independent way.
Related work
As we mention in the introduction, our approach scales the ideas in [HC93] and [LC02] to the modern deep learning regime where the networks have millions of parameters, but are trained on one or two orders of magnitude fewer training examples. The objective we optimize is an upper bound on the PAC-Bayes bound, which we know from the discussion in Section 2.2 will be very loose when the empirical classification error is approximately zero. Indeed, in that case, the PAC-Bayes bound is approximately
The objective optimized by \citefullauthorHinton93 is of the same essential form as this one, except for the choice of squared error and different prior and posterior distributions. We explored using Eq. 7 as our objective with a surrogate loss, but it did not produce different results.
In the introduction we discuss the close connection of our work to several recent papers [Bal+15, Bal+16, Cha+17] that study “flat” or nonisolated minima on the account of their generalization and/or algorithmic properties.
Based on theoretical results for k-SAT that efficient algorithms find nonisolated solutions, [Bal+16] model efficient neural network learning algorithms as minimizers of a replicated version of the empirical loss surface, which emphasizes nonisolated minima and deemphasizes isolated minima. They then propose several algorithms for learning discrete neural networks using these ideas.
In follow-up work with Chaudhari, Choromanska, Soatto, and LeCun [Cha+17], they translate these ideas into the setting of continuously parametrized neural networks. They introduce an algorithm, called Entropy-SGD, which seeks out large regions of dense local minima: it maximizes the depth and flatness of the energy landscape. Their objective integrates both the energy of nearby parameters and the weighted distance to the parameters. In particular, rather than directly minimizing an error surface , they propose the following minimization problem over the so-called local-entropic loss:
where is a parameter and a constant. In comparison, our algorithm can be interpreted as an optimization of the form
[Cha+17] give an informal characterization of the generalization properties of local-entropic loss in Bayesian terms by comparing the marginal likelihood of two Bayesian priors centered at a solution with small and large local-entropic loss. Informally, a Bayesian prior centered on an isolated solution will lead to small marginal likelihood in contrast to one centered in a wide valley. They give a formal result relying on the uniform stability of SGD [HRS15] to show under some strong (and admittedly unrealistic) conditions that Entropy-SGD generalizes better than SGD. The key property is that the local-entropic loss surface is smoother than the original error surface.
Other authors have found evidence of the importance of “flat” minima: Recent work by \citefullauthorKMNST16 [Kes+17] finds that large-batch methods tend to converge to sharp / isolated minima and have worse generalization performance compared to mini-batch algorithms, which tend to converge to flat minima and have good generalization performance. The bulk of their paper is devoted to the problem of restoring good generalization behavior to batch algorithms.
Conclusion and Future work
We obtain nonvacuous generalization bounds for deep neural networks with millions of parameters trained on 55000 MNIST examples. These bounds are obtained by optimizing an objective derived from the PAC-Bayes bound, starting from the solution produced by SGD. Despite the weights changing, the SGD solution remains well within the 1% ellipsoidal quantile, i.e., the volume spanned by the stochastic neural network contains the original SGD solution. (When labels are randomized, however, optimizing the PAC-Bayes bound causes the solution to shift considerably.)
Our experiments look only at fully connected feed forward networks trained on a binary class variant of MNIST. It would be interesting to see if the results extend to multiclass classification, to other data sets, and to other types of architectures, especially convolutional ones.
Our PAC-Bayes bound can be tightened in several ways. Highly dependent weights constrain the size of the axis-aligned ellipsoid representing the stochastic neural network. We can potentially recognize small populations of highly dependent weights, and optimize their covariance parameters, rather than enforcing independence in the posterior.
One might also consider replacing the multivariate normal posterior with a distribution that is more tuned to the loss surface. One promising avenue is to follow the lines of [Cha+17] and consider (local) Gibbs distributions. If the solutions obtained by minimizing the local-entropic loss are flatter than those obtained by SGD, than we may be able to demonstrate quantitatively tighter bounds.
Finally, there is the hard work of understanding the generalization properties of SGD. In light of our work, it may be useful to start by asking whether SGD finds solutions in flat minima. Such solutions could then be lifted to stochastic neural networks with good generalization properties. Going from stochastic networks back to deterministic ones may require additional structure.
This research was carried out while the authors were visiting the Simons Institute for the Theory of Computing at UC Berkeley. The authors would like to thank Peter Bartlett, Shai Ben-David, Dylan Foster, Matus Telgarsky, and Ruth Urner for helpful discussions. GKD is supported by an EPSRC studentship. DMR is supported by an NSERC Discovery Grant, Connaught Award, and U.S. Air Force Office of Scientific Research grant #FA9550-15-1-0074.
References
Our reported results use five steps of Newton’s method.
Appendix B Network symmetries
In an ideal world, we would account for all the network symmetries when computing the KL divergence in the PAC-Bayes bound. However, it does not seem to be computationally feasible to account for the symmetries, as we discuss below. Given this, it makes sense to try to break the symmetries somehow. Indeed, one consequence of randomly initializing a neural network’s weights is that some symmetries are broken. If we do not expect SGD to reverse (many of) these symmetries, then the initial weight configuration, , will be a better mean for the PAC-Bayes prior than the origin. In fact, breaking symmetries in this way lead to much better bounds than setting the means to zero.
The above lemma can be generalized to distributions over (potentially infinite) sets of network symmetries.
In this work, we therefore take a different approach to dealing with symmetries. Neural networks are randomly initialized in order to break symmetries. Combined with the idea that the learned parameters will reflect these broken symmetries, we choose our prior to be located at the random initialization, rather than at zero.
Appendix C Comparing weights before and after PAC-Bayes optimization
In the course of optimizing the PAC-Bayes bound, we allow the mean to deviate from the SGD solution that serves as the starting point. This is necessary to obtain bounds as tight as those that we computed. Do the weights change much during optimization of the bound? How would we measure this change?
To answer these questions, we calculated the p-value of the SGD solution under the distribution of the stochastic neural network.
The estimate was for all true label experiments, i.e., is less extreme of a perturbation of than a typical perturbation. For the random-label experiments, and differ significantly, which is consistent with the bound being optimized in the face of random labels.
Appendix D Evaluating Rademacher error bounds
For every , with probability at least over the choice of , for all ,
In order to compute these bounds, we must compute (bounds on) the Rademacher complexity of appropriate function classes. To that end, we will use results by [NTS15] for ReLU networks (i.e., multilayer perceptrons with ReLU activations).
stated here in the special case of a 2-layer network with 1 output neuron. For any number of layers, the path norm can be computed easily in a forward pass, requiring only a matrix–vector product at each layer.
[NTS15] also provide the follow Rademacher bound in terms of the path norm:
If the upper bound appearing in the bound of Theorem D.2 is instead taken to be a bound on , then one essentially obtains the Gaussian complexity bounds for neural networks established by [BM02, KP02]. However, their bounds apply only to networks with bounded activation functions, ruling out ReLU networks.
In our experiments, we will compute the bound obtained by combining Theorems D.2 and D.1.
Note that the constant in Theorem D.1 must be chosen independently of the data . As in the original result [KP02, Thm. 2], one can use a union bound to allow oneself to choose based on the data in order to minimize the bound. Even though the effect of this change is usually (relatively) small, its magnitude depends on the particular weight function employed in the union bound. Instead, we will apply the bound with an optimized , yielding an optimistic bound (formally, a lower bound on any upper bound obtained from a union bound). We optimize over a grid of values, and handle the vacuous edge cases analytically. Nevertheless, we will see even the resulting (optimistic) bound is vacuous.
We use SGD to train a two-layer 600-hidden-unit ReLU network on the same binary class variant of MNIST used to evaluate our PAC-Bayes bounds. We set the global learning rate to 0.005. As in our PAC-Bayes experiments, we optimize the average logistic loss during training. The random initializations commonly used for ReLU networks lead to initial path norms that produce vacuous error bounds. In order to visualize the behavior of the path-norm bound under SGD, we reduce the standard deviation of the truncated-normal initialization from 0.04 to 0.0001. As before, we use mini-batches of 100 training examples, yielding 550 iterations per epoch.
For comparison, we also train the same network architecture while explicitly regularizing the path norm. ([NSS15] propose training neural networks via steepest descent with respect to the path norm. We leave this comparison to future work.)
D.2 Results
When the network is trained by optimizing the logistic cost function without regularization, the error bound becomes vacuous within a fraction of a single epoch. This occurs before the training error dips appreciable below chance. The bound’s behavior is due to the path norm diverging. While the level sets of the empirical margin distribution are growing, they are not growing fast enough to counteract the growth of the path norm. (See the left column of Fig. 1.)