Fantastic Generalization Measures and Where to Find Them

Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, Samy Bengio

Introduction

Deep neural networks have seen tremendous success in a number of applications, but why (and how well) these models generalize is still a mystery (Neyshabur et al.,, 2014; Zhang et al.,, 2016; Recht et al.,, 2019). It is crucial to better understand the reason behind the generalization of modern deep learning models; such an understanding has multiple benefits, including providing guarantees for safety-critical scenarios and the design of better models.

A number of papers have attempted to understand the generalization phenomenon in deep learning models from a theoretical perspective e.g. (Neyshabur et al., 2015b, ; Bartlett et al.,, 2017; Neyshabur et al., 2018a, ; Golowich et al.,, 2017; Arora et al.,, 2018; Nagarajan and Kolter, 2019a, ; Wei and Ma, 2019a, ; Long and Sedghi,, 2019). The most direct and principled approach for studying generalization in deep learning is to prove a generalization bound which is typically an upper bound on the test error based on some quantity that can be calculated on the training set. Unfortunately, finding tight bounds has proven to be an arduous undertaking. While encouragingly Dziugaite and Roy, (2017) showed that PAC-Bayesian bounds can be optimized to achieve a reasonably tight generalization bound, current bounds are still not tight enough to accurately capture the generalization behavior. Others have proposed more direct empirical ways to characterize generalization of deep networks without attempting to deriving bounds (Keskar et al.,, 2016; Liang et al.,, 2017). However, as pointed by Dziugaite and Roy, (2017), empirical correlation does not necessarily translate to a casual relationship between a measure and generalization.

A core component in (theoretical or empirical) analysis of generalization is the notion of complexity measure; a quantity that monotonically relates to some aspect of generalization. More specifically, lower complexity should often imply smaller generalization gap. A complexity measure may depend on the properties of the trained model, optimizer, and possibly training data, but should not have access to a validation set.

Theoretically motivated complexity measures such as VC-dimension, norm of parameters, etc., are often featured as the major components of generalization bounds, where the monotonic relationship between the measures and generalization is mathematically established. In contrast, empirically motivated complexity measures such as sharpness (Keskar et al.,, 2016) are justified by experimentation and observation. In this work, we do not need to distinguish between theoretically vs empirically motivated measures, and simply refer to both as complexity measures.

Despite the prominent role of complexity measures in studying generalization, the empirical evaluation of these measures is usually limited to a few models, often on toy problems. A measure can only be considered reliable as a predictor of generalization gap if it is tested extensively on many models at a realistic problem size. To this end, we carefully selected a wide range of complexity measures from the literature. Some of the measures are motivated by generalization bounds such as those related to VC-dimension, norm or margin based bounds, and PAC-Bayesian bounds. We further selected a variety of empirical measures such as sharpness (Keskar et al.,, 2016), Fisher-Rao norm (Liang et al.,, 2017) and path norms (Neyshabur et al.,, 2017).

In this study, we trained more than 10,000 models over two image classification datasets, namely, CIFAR-10 (Krizhevsky et al.,, 2014) and Street View House Numbers (SVHN) Netzer et al., (2011). In order to create a wide range of generalization behaviors, we carefully varied hyperparameters that are believed to influence generalization. We also selected multiple optimization algorithms and looked at different stopping criteria for training convergence. Details of all our measures and hyperparameter selections are provided in Appendix C. Training under all combination of hyperparameters and optimization resulted in a large pool of models. For any such model, we considered 40 complexity measures. The key findings that arise from our large scale study are summarized below:

It is easy for some complexity measures to capture spurious correlations that do not reflect more causal insights about generalization; to mitigate this problem, we propose a more rigorous approach for studying them.

Many norm-based measures not only perform poorly, but negatively correlate with generalization specifically when the optimization procedure injects some stochasticity. In particular, the generalization bound based on the product of spectral norms of the layers (similar to that of Bartlett et al., (2017)) has very strong negative correlation with generalization.

Sharpness-based measures such as PAC-Bayesian bounds (McAllester,, 1999) bounds and sharpness measure proposed by Keskar et al., (2016) perform the best overall and seem to be promising candidates for further research.

Measures related to the optimization procedures such as the gradient noise and the speed of the optimization can be predictive of generalization.

Our findings on the relative success of sharpness-based and optimization-based complexity measures for predicting the generalization gap can provoke further study of these measures.

The theoretically motivated measures that we consider in this work belong to a few different families: PAC-Bayes (McAllester,, 1999; Dziugaite and Roy,, 2017; Neyshabur et al.,, 2017); VC-dimension (Vapnik and Chervonenkis,, 1971); and norm-based bounds (Neyshabur et al., 2015b, ; Bartlett et al.,, 2017; Neyshabur et al., 2018a, ). The empirically motivated measures from prior literature that we consider are based on sharpness measure (Keskar et al.,, 2016); Fisher-Rao measure (Liang et al.,, 2017); distance of trained weights from initialization (Nagarajan and Kolter, 2019b, ) and path norm (Neyshabur et al., 2015a, ). Finally, we consider some optimization based measures based on the speed of the optimization algorithm as motivated by the work of (Hardt et al.,, 2015) and (Wilson et al., 2017a, ), and the magnitude of the gradient noise as motivated by the work of (Chaudhari and Soatto,, 2018) and (Smith and Le,, 2017).

A few papers have explored a large scale study of generalization in deep networks. Neyshabur et al., (2017) perform a small scale study of the generalization of PAC-Bayes, sharpness and a few different norms, and the generalization analysis is restricted to correlation. Jiang et al., (2018) studied the role of margin as a predictor of the generalization gap. However, they used a significantly more restricted set of models (e.g. no depth variations), the experiments were not controlled for potential undesired correlation (e.g. the models can have vastly different training error) and some measures contained parameters that must be learned from the set of models. Novak et al., (2018) conducted large scale study of neural networks but they only looked at correlation of a few measures to generalization. In contrast, we study thousands of models, and perform controlled experiments to avoid undesired artificial correlations. Some of our analysis techniques are inspired by Neal, (2019) who proposed the idea of studying generalization in deep models via causal graphs, but did not provide any details or empirical results connected to that idea. Our work focuses on measures that can be computed on a single model and compares a large number of bounds and measures across a much wider range of models in a carefully controlled fashion.

2 Notation

Generalization: What is the goal and how to evaluate?

Generalization is arguably the most fundamental and yet mysterious aspect of machine learning. The core question in generalization is what causes the triplet of a model, optimization algorithm, and data propertiesFor example, it is expected that images share certain structures that allows some models (which leverage these biases) to generalize., to generalize well beyond the training set. There are many hypotheses concerning this question, but what is the right way to compare these hypotheses? The core component of each hypothesis is complexity measure that monotonically relates to some aspect of generalization. Here we briefly discuss some potential approaches to compare different complexity measures:

Tightness of Generalization Bounds. Proving generalization bounds is very useful to establish the causal relationship between a complexity measure and the generalization error. However, almost all existing bounds are vacuous on current deep learning tasks (combination of models and datasets), and therefore, one cannot rely on their proof as an evidence on the causal relationship between a complexity measure and generalization currentlySee Dziugaite and Roy, (2017) for an example of non-vacuous generalization bound and related discussions..

Regularizing the Complexity Measure. One may evaluate a complexity measure by adding it as a regularizer and directly optimizing it, but this could fail due to two reasons. The complexity measure could change the loss landscape in non-trivial ways and make the optimization more difficult. In such cases, if the optimization fails to optimize the measure, no conclusion can be made about the causality. Another, and perhaps more critical, problem is the existence of implicit regularization of the optimization algorithm. This makes it hard to run a controlled experiment since one cannot simply turn off the implicit regularization; therefore, if optimizing a measure does not improve generalization it could be simply due to the fact that it is regularizing the model in the same way as the optimization is regularizing it implicitly.

Correlation with Generalization Evaluating measures based on correlation with generalization is very useful but it can also provide a misleading picture. To check the correlation, we should vary architectures and optimization algorithms to produce a set of models. If the set is generated in an artificial way and is not representative of the typical setting, the conclusions might be deceiving and might not generalize to typical cases. One such example is training with different portions of random labels which artificially changes the dataset. Another pitfall is drawing conclusion from changing one or two hyper-parameters (e.g changing the width or batch-size and checking if a measure would correlate with generalization). In these cases, the hyper-parameter could be the true cause of both change in the measure and change in the generalization, but the measure itself has no causal relationship with generalization. Therefore, one needs to be very careful with experimental design to avoid unwanted correlations.

In this work we focus on the third approach. While acknowledging all limitations of a correlation analysis, we try to improve the procedure and capture some of the causal effects as much as possible through careful design of controlled experiments. Further, to evaluate the effectiveness of complexity measures as accurately as possible, we analyze them over sufficiently trained models (if not to completion) with a wide range of variations in hyperparameters. For practical reasons, these models must reach convergence within a reasonable time budget.

In order to create models with different generalization behavior, we consider various hyperparameter types, which are known or believed to influence generalization (e.g. batch size, dropout rate, etc.). Formally, denote each hyperparameter by $\theta_{i}$ taking values from the set $\Theta_{i}$ , for $i=1,\dots,n$ and $n$ denoting the total number of hyperparameter typesIn our analysis we use $n=7$ hyperparameters: batch size, dropout probability, learning rate, network depth, weight decay coefficient, network width, optimizer.. For each value of hyperparameters $\bm{\theta}\triangleq(\theta_{1},\theta_{2},\dots,\theta_{n})\in\Theta$ , where $\Theta\triangleq\Theta_{1}\times\Theta_{2}\times\dots\times\Theta_{n}$ , we train the architecture until the training loss (cross-entropy value) reaches a given threshold $\epsilon$ . See the Appendix A.2 for a discussion on the choice of the stopping criterion. Doing this for each hyper-parameter configuration $\bm{\theta}\in\Theta$ , we obtain a total of $|\Theta|$ models. The space $\Theta$ reflects our prior knowledge about a reasonable hyperparameter space, both in terms of their types and values. Regarding the latter, one could, for example, create $\Theta_{i}$ by grid sampling of a reasonable number of points within a reasonable range of values for $\theta_{i}$ .

2 Evaluation Criteria

An ideal complexity measure must be such that, for any pair of trained models, if $\mu(\bm{\theta}_{1})>\mu(\bm{\theta}_{2})$ , then so is $g(\bm{\theta}_{1})>g(\bm{\theta}_{2})$ . We use Kendall’s rank coefficient $\tau$ (Kendall,, 1938) to capture to what degree such consistency holds among the elements of $\mathcal{T}$ .

Note that $\tau$ can vary between $1$ and $-1$ and attains these extreme values at perfect agreement (two rankings are the same) and perfect disagreement (one ranking is the reverse of the other) respectively. If complexity and generalization are independent, the coefficient becomes zero.

2.2 Granulated Kendall’s Coefficient

While Kendall’s correlation coefficient is an effective tool widely used to capture relationship between 2 rankings of a set of objects, we found that certain measures can achieve high $\tau$ values in a trivial manner – i.e. the measure may strongly correlate with the generalization performance without necessarily capturing the cause of generalization. We will analyze this phenomenon in greater details in subsequent sections. To mitigate the effect of spurious correlations, we propose a new quantity for reflecting the correlation between measures and generalization based on a more controlled setting.

None of the existing complexity measures is perfect. However, they might have different sensitivity and accuracy w.r.t. different hyperparameters. For example, sharpness may do better than other measures when only a certain hyperparameter (say batch size) changes. To understand such details, in addition to $\tau(\mathcal{T})$ , we compute $\tau$ for consistency within each hyperparameter axis $\Theta_{i}$ , and then average the coefficient across the remaining hyperparameter space. Formally, we define:

The inner $\tau$ reflects the ranking correlation between the generalization and the complexity measure for a small group of models where the only difference among them is the variation along a single hyperparameter $\theta_{i}$ . We then average the value across all combinations of the other hyperparameter axis. Intuitively, if a measure is good at predicting the effect of hyperparameter $\theta_{i}$ over the model distribution, then its corresponding $\psi_{i}$ should be high. Finally, we compute the average $\psi_{i}$ of average across all hyperparamter axes, and name it $\Psi$ :

If a measure achieves a high $\Psi$ on a given hyperparameter distribution $\Theta$ , then it should achieve high individual $\psi$ across all hyperparameters. A complexity measure that excels at predicting changes in a single hyperparameter (high $\psi_{i}$ ) but fails at the other hyperparameters (low $\psi_{j}$ for all $j\neq i$ ) will not do well on $\Psi$ . On the other hand, if the measure performs well on $\Psi$ , it means that the measure can reliably rank the generalization for each of the hyper-parameter changes.

A thought experiment to illustrate why $\Psi$ captures a better causal nature of the generalization than Kendall’s $\tau$ is as follows. Suppose there exists a measure that perfectly captures the depth of the network while producing random prediction if 2 networks have the same depth, this measure would do reasonably well in terms of $\tau$ but much worse in terms of $\Psi$ . In the experiments we consider in the following sections, we found that such a measure would achieve overall $\tau=0.362$ but $\Psi=0.11$ .

We acknowledge that this measure is only a small step towards the difficult problem of capturing the causal relationship between complexity measures and generalization in empirical settings, and we hope this encourages future work in this direction.

2.3 Conditional Independence Test: Towards Capturing the Causal Relationships

Relying on correlation is intuitive but perhaps unsatisfactory. In our experiments, we change several hyper-parameters and assess the correlation between a complexity measure and generalization. When we observe correlation between a complexity measure and generalization, we want to differentiate the following two scenarios:

Changing a hyper-parameter causes the complexity measure to be low and lower value of the measure causes the generalization gap to be low.

Changing a hyper-parameter causes the complexity measure to be low and changing the same hyper-parameter also causes the generalization to be low but the lower value of the complexity measure by itself has no effect on generalization.

The above two scenarios are demonstrated in Figure 1-Middle and Figure 1-Right respectively. In attempt to truly understand these relationships, we will rely on the tools from probabilistic causality. Our approach is inspired by the seminal work on Inductive Causation (IC) Algorithm by Verma and Pearl, (1991), which provides a framework for learning a graphical model through conditional independence test. While the IC algorithm traditionally initiates the graph to be fully connected, we will take advantage of our knowledge about generalization and prune edges of the initialized graph to expedite the computations. Namely, we assume that the choice of hyperparameter does not directly explain generalization, but rather it induces changes in some measure $\mu$ which can be used to explain generalization.

The above removes the unwanted correlation between generalization and complexity measure that is caused by hyperparameter types in set $\mathcal{S}$ . Since in our case the conditional mutual information between a complexity measure and generalization is at most equal to the conditional entropy of generalization, we normalize it with the conditional entropy to arrive at a criterion ranging between 0 and 1:

According to the IC algorithm, an edge is kept between two nodes if there exists no subset ${\mathcal{S}}$ of hyperparameter types such that the two nodes are independent, i.e. $\hat{{\mathcal{I}}}(V_{\mu},V_{g}\,|\,U_{\mathcal{S}})=0$ . In our setup, setting ${\mathcal{S}}$ to the set of all hyperparameter types is not possible as both the conditional entropy and conditional mutual information would become zero. Moreover, due to computational reasons, we only look at $|{\mathcal{S}}|\leq 2$ :

At a high level, the larger ${\mathcal{K}}$ is for a measure $\mu$ , the more likely an edge exists between $\mu$ and $g$ , and therefore the more likely $\mu$ can explain generalization. For details on the set-up, please refer to Appendix A.5 on how these quantities are estimated.

Generating a Family of Trained Models

We chose 7 common hyperparameter types related to optimization and architecture design, with 3 choices for each hyperparameter. We generated $3^{7}=2187$ models that are trained on the CIFAR-10 dataset. We analyze these $2187$ models in the subsequent sections; however, additional results including repeating the experiments 5 times as well as training the models using SVHN dataset are presentedAll the experiments reported in the main text have been repeated for 5 times. The mean (Table 9) is consistent with those presented in the main text and standard deviation (Table 10) is very small compared to the magnitude of the mean for all measures. Further, we also repeat the experiments once on the SVHN dataset (Table 7), whose results are also consistent with the observations made on CIFAR-10. in Appendix Section A.6. These additional experiments, which add up to more than 10,000 trained models, suggest that the observations we make here are robust to randomness, and, more importantly, captures general behaviors of image classification tasks.

We trained these models to convergence. Convergence criterion is chosen as when cross-entropy loss reaches the value 0.01. Any model that was not able to achieve this value of cross-entropyIn our analysis, less than 5 percent of the models do not reach this threshold. was discarded from further analysis. The latter is different from the DEMOGEN dataset (Jiang et al.,, 2018) where the models are not trained to the same cross-entropy. Putting the stopping criterion on the training loss rather than the number of epochs is crucial since otherwise one can simply use cross-entropy loss value to predict generalization. Please see Appendix Section A.2 for a discussion on the choice of stopping criterion.

To construct a pool of trained models with vastly different generalization behaviors while being able to fit the training set, we covered a wide range of hyperparameters for training. Our base model is inspired by the Network-in-Network (Gao et al.,, 2011). The hyperparameter categories we test on are: weight decay coefficient (weight decay), width of the layer (width), mini-batch size (batch size), learning rate (learning rate), dropout probability (dropout), depth of the architecture (depth) and the choice of the optimization algorithms (optimizer). We select 3 choices for each hyperparameter (i.e. $|\Theta_{i}|=3$ ). Please refer to Appendix A.3 for the details on the models, and Appendix A.1 for the reasoning behind the design choices.

Figure 2 shows some summarizing statistics of the models in this study. On the left we show the number of models that achieve above 99% training accuracy for every individual hyperparameter choice. Since we have $3^{7}=2187$ models in total, the maximum number of model for each hyperparameter type is $3^{7-1}=718$ ; the majority of the models in our pool were able to reach this threshold. In the middle we show the distribution of the cross-entropy value over the entire training set. While we want the models to be at exactly 0.01 cross-entropy, in practice it is computationally prohibitive to constantly evaluate the loss over the entire training set; further, to enable reasonable temporal granularity, we estimate the training loss with 100 randomly sampled minibatch. These computational compromises result in long-tailed distribution of training loss centered at 0.01. As shown in Table 1, even such minuscule range of cross-entropy difference could lead to positive correlation with generalization, highlighting the importance of training loss as a stopping criterion. On the right, we show the distribution of the generalization gap. We see that while all the models’ training accuracy is above 0.99, there is a wide range of generalization gap, which is ideal for evaluating complexity measures.

Performance of Complexity Measures

The first baseline we consider is performance of a measure against an oracle who observes the noisy generalization gap. Concretely, we rank the models based on the true generalization gap with some additive noise. The resulting ranking correlation indicates how close the performances of all models are. As the scale of the noise approaches 0, the oracle’s prediction tends towards perfect (i.e. 1). This baseline accounts for the potential noise in the training procedure and gives an anchor for gauging the difficulty of each hyperparameter type. Formally, given an arbitrary set of hyper-parameters $\Theta^{\prime}$ , we define $\epsilon$ -oracle to be the expectation of $\tau$ or $\Psi$ where the measure is $\{g(\bm{\theta})+{\mathscr{N}}(0,\epsilon^{2})\,|\,\bm{\theta}\in\Theta^{\prime}\}$ . We report the performance of the noisy oracle in Table 1 for ${\epsilon}\in\{0.02,0.05\}$ . For additional choices of $\epsilon$ please refer to Appendix A.6.

Second, to understand how our hyperparameter choices affect the optimization, we give each hyperparameter type a canonical order which is believed to have correlation with generalization (e.g. larger learning rate generalizes better) and measure their $\tau$ . The exact canonical ordering can be found in Appendix A.4. Note that unlike other measures, each canonical ordering can only predict generalization for its own hyperparameter type, since its corresponding hyperparameter remains fixed in any other hyperparameter type; consequently, each column actually represents different measure for the canonical measure row. Assuming that each canonical measure is uninformative of any other canonical measures, the $\Psi$ criterion for each canonical measure is $\frac{1}{7}$ of its performance on the corresponding hyperparameter type.

We next look at one of the most well-known complexity measures in machine learning; the VC-Dimension. Bartlett et al., (2019) proves bounds on the VC dimension of piece-wise linear networks with potential weight sharing. In Appendix C.1, we extend their result to include pooling layers and multi-class classification. We report two complexity measures based on VC-dimension bounds and parameter counting. These measures could be predictive merely when the architecture changes, which happens only in depth and width hyperparameter types. We observe that, with both types, VC-dimension as well as the number of parameters are negatively correlated with generalization gap which confirms the widely known empirical observation that overparametrization improves generalization in deep learning.

Finally, we report the measures that only look at the output of the network. In particular, we look at the cross-entropy loss, margin $\gamma$ , and the entropy of the output. These three measures are closely related to each other. In fact, the outcomes in Table 1 reflects this similarity. These results confirm the general understanding that larger margin, lower cross-entropy and higher entropy would lead to better generalization. Please see Appendix C.1.1 for definitions and more discussions on these measures.

2 Surprising Failure of Some (Norm & Margin)-Based Measures

Spectral bound: The most surprising observation here is that the spectral complexity is strongly negatively correlated with generalization, and negatively correlated with changes within every hyperparameter type. Most notably, it has strong negative correlation with the depth of the network, which may suggest that the largest singular values are not sufficient to capture the capacity of the model. To better understand the reason behind this observation, we investigate using different components of the spectral complexity as the measure. An interesting observation is that the Frobenius distance to initialization is negatively correlated, but the Frobenius norm of the parameters is slightly positively correlated with generalization, which contradicts some theories suggesting solutions closer to initialization should generalize better. A tempting hypothesis is that weight decay favors solution closer to the origin, but we did an ablation study on only models with 0 weight decay and found that the distance from initialization still correlates negatively with generalization.

These observations correspond to choosing different reference matrices ${\mathbf{W}}_{i}^{0}$ for the bound: the distance corresponds to using the initialization as the reference matrices while the Frobenius norm of the parameters corresponds to using the origin as the reference. Since the Frobenius norm of the parameters shows better correlation, we use zero reference matrices in the spectral bound. This improved both $\tau$ and $\Psi$ , albeit still negative. In addition, we extensively investigate the effect of different terms of the Spectral bound to isolate the effect; however, the results do not improve. These experiments can be found in the Appendix C.2.

Path norm: While path-norm is a proper norm in the function space but not in parameter space, we observe that it is positively correlated with generalization in all hyper-parameter types and achieves comparable $\tau$ (0.373) and $\Psi$ (0.311).

Fisher-Rao metric: The Fisher-Rao metric is a lower bound (Liang et al.,, 2017) on the path norm that has been recently shown to capture generalization. We observed that it overall shows worse correlation than the path norm; in particular, it is negatively correlated ( $\tau=-0.553$ ) with the depth of the network, which contrasts with path norm that properly captures the effect of depth on generalization. A more interesting observation is that the Fisher-Rao metric achieves a positive $\Psi=0.154$ but its $\tau=0.078$ is essentially at chance. This may suggest that the metric can capture a single hyper-parameter change but is not able to capture the interactions between different hyperparameter types.

Effect of Randomness: dropout and batch size (first 2 columns of Table 2) directly introduce randomness into the training dynamic. For batch size, we observed that the Frobenius displacement and spectral complexity both correlate negatively with the changes in batch size while the Frobenius norm of the parameters correlates positively with generalization. On the other hand, when changes happen to the magnitude dropout probability, we observed that all of the proper norms are negatively correlated with the generalization changes. Since increasing dropout usually reduces the generalization gap, this implies that increasing the dropout probability may be at least partially responsible for the growth in these norms. This is unexpected since increasing norm in principle implies higher model capacity which is usually more prone to overfitting.

The overall picture does not change much going from the ranking correlation to mutual information, with a notable exception where spectral complexity has the highest conditional mutual information compared to all the other measures. This is due to the fact that the conditional mutual information is agnostic to the direction of correlation, and in the ranking correlation, spectral complexity has the highest absolute correlation. While this view might seem contradictory to classical view as the spectral complexity is a complexity measure which should be small to guarantee good generalization, it is nonetheless informative about the generalization of the model. Further, by inspecting the conditional mutual information for each hyperparameter, we find that the majority of spectral complexity’s predictive power is due to its ability to capture the depth of the network, as the mutual information is significantly lower if depth is already observed.

3 Success of Sharpness-Based Measures

A natural category of generalization measures is centered around the concept of “sharpness” of the local minima, capturing the sensitivity of the empirical risk (i.e. the loss over the entire training set) to perturbations in model parameters. Such notion of stability under perturbation is captured elegantly by the PAC-Bayesian framework (McAllester,, 1999) which has provided promising insights for studying generalization of deep neural networks (Dziugaite and Roy,, 2017; Neyshabur et al.,, 2017; Neyshabur et al., 2018a, ). In this sections, we investigate PAC-Bayesian generalization bounds and several of their variants which rely on different priors and different notions of sharpness (Table 3).

PAC-Bayesian framework captures sharpness in the expected sense since we add randomly generated perturbations to the parameters. Another possible notion of sharpness is the worst-case sharpness where we search for the direction that changes the loss the most. This is motivated by (Keskar et al.,, 2016) where they observe that this notion would correlate with generalization in the case of different batch sizes. We can use PAC-Bayesian framework to construct generalization bounds for this worst-case perturbations as well. We refer to this worst case bound as the sharpness bound in Eq (50). The main component in both PAC-Bayes and worst-case sharpness bounds is the ratio of norm of parameters to the magnitude of the perturbation, where the magnitude is chosen to be the largest number such that the training error of the perturbed model is at most $0.1$ . While mathematically, the sharpness bound should always yield higher complexity than the PAC-Bayes bound, we observed that the former has higher correlation both in terms of $\tau$ and $\Psi$ . In addition, we studied inverse of perturbation magnitude as a measure by removing the norm in the numerator to compare it with the bound. However, we did not observe a significant difference.

Perturbing the parameters without taking their magnitude into account can cause many of them to switch signs. Therefore, one cannot apply large perturbations to the model without changing the loss significantly. One possible modification to improve the perturbations is to choose the perturbation magnitude based on the magnitude of the parameter. In that case, it is guaranteed that if the magnitude of perturbation is less than the magnitude of the parameter, then the sign of the parameter does not change. Following Keskar et al., (2016), we pick the magnitude of the perturbation with respect to the magnitude of parameters. We formalize this notion of importance based magnitude. Specifically, we derive two alternative generalization bounds for expected sharpness in Eq ( 55) and worst case sharpness in Eq (58) that include the magnitude of the parameters into the prior. Formally, we design $\alpha^{\prime}$ and $\sigma^{\prime}$ , respectively for sharpness and PAC-Bayes bounds, to be the ratio of parameter magnitude to the perturbation magnitude. While this change did not improve upon the original PAC-Bayesian measures, we observed that simply looking at $1/\alpha^{\prime}$ has surprising predictive power in terms of the generalization which surpasses the performance of oracle 0.02. This measure is very close to what was originally suggested in Keskar et al., (2016).

The effectiveness of this measure is further corroborated by the conditional mutual information based metric, where we observed that $1/\alpha^{\prime}$ has the highest mutual information with generalization among all hyperparameters and also overall.

3.2 Finding σ𝜎\sigma

In case of models with extremely small loss, the perturbed loss should roughly increase monotonically with respect to the perturbation scale. Leveraging this observation, we design algorithms for computing the perturbation scale $\sigma$ such that the first term on the RHS is as close to a fixed value as possible for all models. In our experiments, we choose the deviation to be 0.1 which translates to 10% training error. These search algorithms are paramount to compare measures between different models. We provide the detailed algorithms in the Appendix D. To improve upon our algorithms, one could try a computational approach similar to Dziugaite and Roy, (2017) to obtain a numerically better bound which may result in stronger correlation. However, due to practical computational constraints, we could not do so for the large number of models we consider.

4 Potential of Optimization-based Measures

Optimization is an indispensable component of deep learning. Numerous optimizers have been proposed for more stable training and faster convergence. How the optimization scheme and speed of optimization influence generalization of a model has been a topic of contention among the deep learning community (Merity et al.,, 2017; Hardt et al.,, 2015). We study 3 representative optimizers Momentum SGD, Adam, and RMSProp with different initial learning rates in our experiments to thoroughly evaluate this phenomenon. We also consider other optimization related measures that are believed to correlate with generalization. These include (Table 4):

Number of iterations required to reach cross-entropy equals 0.1

Number of iterations required going from cross-entropy equals 0.1 to cross-entropy equals 0.01

Variance of the gradients after only seeing the entire dataset once (1 epoch)

Variance of the gradients when the cross-entropy is approximately 0.01

Number of Iterations: The number of iterations roughly characterizes the speed of optimization, which has been argued to correlate with generalization. For the models considered here, we observed that the initial phase (to reach cross-entropy value of 0.1) of the optimization is negatively correlated with the speed of optimization for both $\tau$ and $\Psi$ . This would suggest that the difficulty of optimization during the initial phase of the optimization benefits the final generalization. On the other hand, the speed of optimization going from cross-entropy 0.1 to cross-entropy 0.01 does not seem to be correlated with the generalization of the final solution. Importantly, the speed of optimization is not an explicit capacity measure so either positive or negative correlation could potentially be informative.

Variance of Gradients: Towards the end of the training, the variance of the gradients also captures a particular type of “flatness” of the local minima. This measure is surprisingly predictive of the generalization both in terms of $\tau$ and $\Psi$ , and more importantly, is positively correlated across every type of hyperparameter. To the best of our knowledge, this is the first time this phenomenon has been observed. The connection between variance of the gradient and generalization is perhaps natural since much of the recent advancement in deep learning such as residual networks (He et al.,, 2016) or batch normalization have enabled using larger learning rates to train neural networks. Stability with higher learning rates implies smaller noises in the minibatch gradient. With the mutual information metric, the overall observation is consistent with that of ranking correlation, but the final gradient noise also outperforms gradient noise at 1 epoch of training conditioned on the dropout probability. We hope that our work encourages future works in other possible measures based on optimization and during training.

Conclusion

We conducted large scale experiments to test the correlation of different measures with the generalization of deep models and propose a framework to better disentangle the cause of correlation from spurious correlation. We confirmed the effectiveness of the PAC-Bayesian bounds through our experiments and corroborate it as a promising direction for cracking the generalization puzzle. Further, we provide an extension to existing PAC-Bayesian bounds that consider the importance of each parameter. We also found that several measures related to optimization are surprisingly predictive of generalization and worthy of further investigation. On the other hand, several surprising failures about the norm-based measures were uncovered. In particular, we found that regularization that introduces randomness into the optimization can increase various norm of the models and spectral complexity related norm-based measures are unable to capture generalization – in fact, most of them are negatively correlated. Our experiments demonstrate that the study of generalization measure can be misleading when the number of models studied is small and the metric of quantifying the relationship is not carefully chosen. We hope this work will incentivize more rigorous treatment of generalization measures in future work.

To the best of our knowledge, this work is one of the most comprehensive study of generalization to date, but there are a few short-comings. Due to computational constraints, we were only able to study 7 most common hyperparameter types and relatively small architectures, which do not reflect the models used in production. Indeed, if more hyperparameters are considered, one could expect to better capture the causal relationship. We also only studied models trained on two image datasets (CIFAR-10 and SVHN), only classification models and only convolutional networks. We hope that future work would address these limitations.

Acknowledgement

We thank our colleagues at Google: Guy Gur-Ari for many insightful discussions that helped with the experiment design, Ethan Dyer, Pierre Foret, Sergey Ioffe for their feedback, and Scott Yak for help with implementation. We are grateful for insightful discussions with Brady Neal of University of Montreal about limitation of correlation analysis. We also thank Daniel Roy of University of Toronto for insightful comments.

References

Appendix A Experiments

During our experiments, we found that Batch Normalization (Ioffe and Szegedy,, 2015) is crucial to reliably reach a low cross-entropy value for all models; since normalization is a indispensable components of modern neural networks, we decide to use batch normalization in all of our models. We remove batch normalization before computing any measure by fusing the $\gamma$ , $\beta$ and moving statistics with the convolution operator that precedes the normalization. This is important as Dinh et al., (2017) showed that common generalization measures such as sharpness can be easily manipulated with re-parameterization. We also discovered that the models trained with data augmentation often cannot fit the data (i.e. reach cross-entropy 0.01) completely. Since a model with data augmentation tends to consistently generalize better than the models without data augmentation, measure that reflects the training error (i.e. value of cross-entropy) will easily predict the ranking between two models even though it has only learned that one model uses data augmentation (see the thought experiments from the previous section). While certain hyperparameter configuration can reach cross-entropy of 0.01 even with data augmentation, it greatly limits the space of models that we can study. Hence, we make the design choice to not include data augmentation in the models of this study. Note that from a theoretical perspective, data augmentation is also challenging to analyze since the training samples generated from the procedure are no longer identical and independently distributed. All values for all the measures we computed over these models can be found in Table 5 in Appendix A.6.

A.2 The choice of stopping criterion

The choice of stopping criterion is very essential and could completely change the evaluation and the resulting conclusions. In our experiments we noticed that if we pick the stopping criterion based on number of iterations or number of epochs, then since some models optimize faster than others, they end up fitting the training data more and in that case the cross-entropy itself can be very predictive of generalization. To make it harder to distinguish models based on their training performance, it makes more sense to choose the stopping criterion based on the training error or training loss. We noticed that as expected, models with the same cross-entropy usually have very similar training error so that suggests that this choice is not very important. However, during the optimization the training error behavior is noisier than cross-entropy and moreover, after the training error reaches zero, it cannot distinguish models while the cross-entropy is still meaningful after fitting the data. Therefore, we decided to use cross-entropy as the stopping criterion.

A.3 All Model Specification

As mentioned in the main text, the models we use resemble Network-in-Network (Gao et al.,, 2011) which is a class of more parameter efficient convolution neural networks that achieve reasonably competitive performance on modern image classification benchmarks. The model consists blocks of modules that have 1 $3\times 3$ convolution with stride 2 followed by 2 $1\times 1$ convolution with stride 1. We refer to this single module as a NiN-block and construct models of different size by stacking NiN-block. For simplicity, all NiN-block have the same number of output channels $c_{out}$ . Dropout is applied at the end of every NiN-block. At the end of the model, there is a $1\times 1$ convolution reducing the channel number to the class number (i.e. 10 for CIFAR-10) followed by a global average pooling to produce the output logits.

For width, we choose from $c_{out}$ from 3 options: $\{2\times 96,4\times 96,8\times 96\}$ .

For dropout, we choose from 3 options: $\{0.0,0.25,0.5\}$

For batch size, we choose from: $\{32,64,128\}$

Since each optimizer may require different learning rate and in some cases, different regularization, we fine-tuned the hyper-parameters for each optimizer while keeping 3 options for every hyper-parameter choicesWhile methods with adaptive methods generally require less tuning, in practice researchers have observed performance gains from tuning the initial learning rate and learning rate decay..

Momentum SGD: We choose momentum of 0.9 and choose the initial learning rate $\eta$ from $\{0.1,0.032,0.01\}$ and regularization coefficient $\lambda$ from $\{0.0,0.0001,0.0005\}$ . The learning rate decay schedule is $\times 0.1$ at iterations $$.

Adam: We choose initial learning rate $\eta$ from $\{0.001,3.2e-4,1e-4\}$ , $\epsilon=1e-3$ and regularization coefficient $\lambda$ from $\{0.0,0.0001,0.0005\}$ . The learning rate decay schedule is $\times 0.1$ at iterations $$.

RMSProp: We choose initial learning rate $\eta$ from $\{0.001,3.2e-4,1e-4\}$ and regularization coefficient $\lambda$ from $\{0.0,0.0001,0.0003\}$ . The learning rate decay schedule is $\times 0.1$ at iterations $$.

A.4 Canonical Measures

Based on empirical observations made by the community as a whole, the canonical ordering we give to each of the hyper-parameter categories are as follows:

Batchsize: smaller batchsize leads to smaller generalization gap

Depth: deeper network leads to smaller generalization gap

Width: wider network leads to smaller generalization gap

Dropout: The higher the dropout ( $\leq 0.5$ ) the smaller the generalization gap

Weight decay: The higher the weight decay (smaller than the maximum for each optimizer) the smaller the generalization gap

Learning rate: The higher the learning rate (smaller than the maximum for each optimizer) the smaller the generalization gap

Optimizer: Generalization gap of Momentum SGD $<$ Generalization gap of Adam $<$ Generalization gap of RMSProp

A.5 Definition of Random Variables

Since the measures are results of complicated interactions between the data, the model, and the training procedures, we cannot manipulate it to be any values that we want. Instead, we use the following definition of random variables: suppose ${\mathcal{S}}$ is a subset of all the components of $\bm{\theta}$ (e.g. ${\mathcal{S}}=\{\varnothing\}$ for $|{\mathcal{S}}|=0$ , $|{\mathcal{S}}|=\{\textrm{learning rate}\}$ for $|{\mathcal{S}}|=1$ or $|{\mathcal{S}}|=\{\textrm{learning rate, dropout}\}$ for $|{\mathcal{S}}|=2$ ). Specifically we denote ${\mathcal{S}}_{ab}$ as the collective condition $\{\theta_{1}^{(a)}=v_{1},\theta_{1}^{(b)}=v_{2},\,\dots,\,\theta_{|{\mathcal{S}}|}^{(a)}=v_{2|{\mathcal{S}}|-1},\theta_{|{\mathcal{S}}|}^{(b)}=v_{2|{\mathcal{S}}|}\}$ . We can then define and empirical measure four probability $\textrm{Pr}(\mu^{(a)}>\mu^{(b)},g^{(a)}>g^{(b)}\,|\,{\mathcal{S}}_{ab})$ , $\textrm{Pr}(\mu^{(a)}>\mu^{(b)},g^{(a)}<g^{(b)}\,|\,{\mathcal{S}}_{ab})$ , $\textrm{Pr}(\mu^{(a)}<\mu^{(b)},g^{(a)}>g^{(b)}\,|\,{\mathcal{S}}_{ab})$ and $\textrm{Pr}(\mu^{(a)}<\mu^{(b)},g^{(a)}<g^{(b)}\,|\,{\mathcal{S}}_{ab})$ .

Together forms a 2 by 2 table that defines the joint distribution of the Bernoulli random variables $\textrm{Pr}(g^{(a)}>g^{(b)}\,|\,{\mathcal{S}}_{ab})$ and $\textrm{Pr}(\mu^{(a)}>\mu^{(b)}\,|\,{\mathcal{S}}_{ab})$ . For notation convenience, we use $\textrm{Pr}(\mu,g\,|\,{\mathcal{S}}_{ab})$ , $\textrm{Pr}(g\,|\,{\mathcal{S}}_{ab})$ and $\textrm{Pr}(\mu\,|\,{\mathcal{S}}_{ab})$ to denote the joint and marginal. If there are $N=3$ choices for each hyperparameter in ${\mathcal{S}}$ then there will be $N^{|{\mathcal{S}}|}$ such tables for each hyperparameter combination. Since each configuration occurs with equal probability, for that arbitrary $\bm{\theta}^{(a)}$ and $\bm{\theta}^{(b)}$ drawn from $\Theta$ conditioned on that the components of ${\mathcal{S}}$ are observed for both models, the joint distribution can be defined as $\textrm{Pr}(\mu,g\,|\,{\mathcal{S}})=\frac{1}{N^{|{\mathcal{S}}|}}\sum_{{\mathcal{S}}_{ab}}\textrm{Pr}(\mu,g\,|\,{\mathcal{S}}_{ab})$ and likely the marginals can be defined as $\textrm{Pr}(\mu\,|\,{\mathcal{S}})=\frac{1}{N^{|{\mathcal{S}}|}}\sum_{{\mathcal{S}}_{ab}}\textrm{Pr}(\mu\,|\,{\mathcal{S}}_{ab})$ and $\textrm{Pr}(g\,|\,{\mathcal{S}})=\frac{1}{N^{|{\mathcal{S}}|}}\sum_{{\mathcal{S}}_{ab}}\textrm{Pr}(g\,|\,{\mathcal{S}}_{ab})$ . With these notations established, all the relevant quantities can be computed by iterating over all pairs of models.

A.6 All Results

Below we present all of the measures we computed and their respective $\tau$ and $\Psi$ on more than 10,000 models we trained and additional plots. Unless stated otherwise, convergence is considered when the loss reaches the value of $0.1$ .

Appendix B Extended Notation

Given any margin value $\gamma\geq 0$ , we define the margin loss $L_{\gamma}$ as follows:

We denote a tensor as ${\mathbf{A}}$ , vector as $\bm{a}$ , and scalar as $A$ or $a$ . For any $1\leq j\leq k$ , consider a $k$ -th order tensor ${\mathbf{A}}$ and a $j$ -th order tensor ${\mathbf{B}}$ where dimensions of ${\mathbf{B}}$ match the last $j$ dimensions of ${\mathbf{A}}$ . We then define the product operator $\otimes_{j}$ :

where $i_{1},\dots,i_{k-j}$ are indices. We also assume that the input images have dimension $n\times n$ and there are $\kappa$ classes. Given the number of input channels $c_{\text{in}}$ , number of output channels $c_{\text{out}}$ , 2D square kernel with side length $k$ , stride $s$ , and padding $p$ , we define the convolutional layer $\text{conv}_{{\mathbf{W}},s,p}$ as follows:

We also define the max-pooling operator $\text{pool}_{k,s,p}$ as follows:

Appendix C Complexity Measures

In this section, we look at different complexity measures. When a measure $\mu$ is based on a generalization bound, we chose it so that the following is true with probability $0.99$ (we choose the failure probability $\delta$ to be 0.01):

We also consider measures which do not provably bound the generalization error and evaluate those.

Note that in almost all cases, the canonical ordering given based on some “common" assumptions are positively correlated with the generalization in terms of both $\tau$ and $\Psi$ ; however, for optimizer, the correlation $\tau$ is close to 0. This implies that the choice of optimizer is only essentially uncorrelated with the generalization gap in the range of models we consider. This ordering helps validate many techniques used by the practioners.

We start by restating the theorem in (Bartlett et al.,, 2019) which provides an upper bound on the VC-dimension of any piece-wise linear network.

Let $\mathcal{F}$ be the class of feed-forward networks with a fixed computation graph of depth $d$ and ReLU activations. Let $a_{i}$ and $q_{i}$ be the number of activations and parameters in layer $i$ . Then VC-dimension of ${\mathcal{F}}$ can be bounded as follows:

Given a convolutional network $f$ , for any $\delta>0$ , with probability $1-\delta$ over the the training set:

We simplify the bound in Theorem 1 using a $d^{\prime}$ to refer to the depth instead of $d$ :

In order to extend the above bound to a convolutional network, we need to present a pooling layer with ReLU activations. First note that maximum of two inputs can be calculated using two layers with ReLU and linear activations as $\max(x_{1},x_{2})=x_{1}+ReLU(x_{2}-x_{1})$ . Now, since max-pooling at layer $i$ has kernel sizes $k^{\prime}_{i}$ , we need $\lceil 4\log_{2}(k^{\prime}_{i})\rceil$ layers to present that but given that the kernel size of the max-pooling layer is at most size of the image, we have

Therefore, we have $d^{\prime}\leq 9d\log_{2}(n)$ . The number of activations in any of these layers is at most $n^{2}c_{i}$ since there are at most $n^{2}$ pairs of neighbor pixels in an $n\times n$ image with $c_{i}$ channels. We ignore strides when calculating the upper bound since it only reduces number of activations at a few layers and does not change the bound significantly. Using these bounds on $d^{\prime}$ , $a_{i}$ and $q_{i}$ the equivalent network, we can bound the VC dimension as follows:

For binary classifiers, generalization error can be in terms of Rademacher complexity (Mohri et al.,, 2012) which in turn can be bounded by $72\sqrt{\text{VC}/m}$ (Kontorovich,, 2016). Therefore, we can get the followingThe generalization gap is bounded by two times Rademacher Complexity, hence the constant 144. generalization bound:

For multi-class classification, the generalization error can be similarly bounded by Graph dimension which is an extension of VC-dimension. A simple approach get a bound on Graph dimension is to consider all pairs of classes as binary classification problem which bounds the graph dimension by $\kappa^{2}\,\,VC({\mathcal{F}})$ . There, putting everything together, we get the following generalization bound:

Inspired by Theorem 2, we define the following $VC$ -based measure for generalization:

Since some of the dependencies in the above measure are probably proof artifacts, we also define another measure that is nothing but the number of parameters of the model:

While measures that can be calculated only based on the output of the network cannot reveal complexity of the network, they can still be very informative for predicting generalization. Therefore, we define a few measures that can be calculated solely based on the output of the network.

We start by looking at the cross-entropy over the output. Even though we used a cross-entropy based stopping criterion, the cross-entropy of the final models is not exactly the same as the stopping criterion and it could be informative. Hence we define the following measure:

Another useful and intuitive notion that appears in generalization bounds is margin. In all measures that involve margin $\gamma$ , we set the margin $\gamma$ to be the $10$ -th percentile of the margin values on the training set and therefore ensuring $\hat{L}_{\gamma}\leq 0.1$ . Even though margin alone is not a sensible generalization measure and can be artificially increased by scaling up the magnitude of the weights, it could still reveal information about training dynamics and therefore be informative. We report the following measure based on the margin:

Finally, entropy of the output is another interesting measure and it has been shown that regularizing it can improve generalization in deep learning (Pereyra et al.,, 2017). With a fixed cross-entropy, increasing the entropy corresponds to distribute the uncertainty of the predictions equally among the wrong labels which is connected to label smoothing and increasing the margin. We define the following measure which is the negative entropy of the output of the network:

where $p_{i}[j]$ is the predicted probability of the class $j$ for the input data ${\mathbf{X}}_{i}$ .

C.2 (Norm & Margin)-Based Measures

Unfortunately, none of the above founds are directly applicable to convolutional networks. Pitas et al., (2017) built on Neyshabur et al., 2018a and extended the bound on the spectral norm to convolutional networks. The bound is very similar to the one for fully connected networks by Bartlett et al., (2017). We next restate their generalization bound for convolutional networks including the constants.

Inspired by the above theorem, we define the following spectral measure:

The generalization bound in Theorem 3 depends on reference tensors ${\mathbf{W}}^{0}_{i}$ . We chose the initial tensor as the reference in the above measure but another reasonable choice is the origin which gives the following measures:

Since some of the terms in the generalization bounds might be proof artifacts, we also measure the main terms in the generalization bound:

We further look at the main two terms in the bound separately to be able to differentiate their contributions.

Finally, since product of spectral norms almost certainly increases with depth, we look at the following measure which is equal to the sum over squared spectral norms after rebalancing the layers to have the same spectral norms:

The generalization bound given in Neyshabur et al., 2015b is not directly applicable to convolutional networks. However, Since for each layer $i$ , we have $\left\|{\mathbf{W}}_{i}\right\|_{2}\leq k_{i}^{2}\left\|{\mathbf{W}}_{i}\right\|_{F}$ and therefore by Theorem 3, we can get an upper bound on the test error based on product of Frobenius norms. Therefore, we define the following measure based on the product of Frobenius norms:

We also look at the following measure with correspond to sum of squared Frobenius norms of the layers after rebalancing them to have the same norm:

Finally, given recent evidence on the importance of distance to initialization (Dziugaite and Roy,, 2017; Nagarajan and Kolter, 2019b, ; Neyshabur et al., 2018b, ), we calculate the following measures:

In case when the reference matrix ${\mathbf{W}}_{i}^{0}=0$ for all weights, Eq (40) the Frobenius norm of the parameters which also correspond to distance from the origin:

Path-norm was introduced in Neyshabur et al., 2015b as an scale invariant complexity measure for generalization and is shown to be a useful geometry for optimization Neyshabur et al., 2015a . To calculate path-norm, we square the parameters of the network, do a forward pass on an all-ones input and then take square root of sum of the network outputs. We define the following measures based on the path-norm:

where ${\mathbf{w}}^{2}={\mathbf{w}}\circ{\mathbf{w}}$ is the element-wise square operation on the parameters.

Fisher-Rao metric was introduced in Liang et al., (2017) as a complexity measure for neural networks. Liang et al., (2017) showed that Fisher-Rao norm is a lower bound on the path-norm and it correlates in some cases. We define a measure based on the Fisher-Rao matric of the network:

C.3 Flatness-based Measures

For any $\delta>0$ , distribution $D$ , prior $P$ , with probability $1-\delta$ over the training set, for any posterior $Q$ the following bound holds:

Based on the above bound, we define the following measures using the origin and initialization as reference tensors:

The above framework captures flatness in the expected sense since we add Gaussian perturbations to the parameters. Another notion of flatness is the worst-case flatness where we search for the direction that changes the loss the most. This is motivated by (Keskar et al.,, 2016) where they observe that this notion would correlate to generalization in the case of different batch sizes. We can use PAC-Bayesian framework to give generalization bounds for worst-case perturbations as well. The magnitude of a Gaussian variable with with variance $\sigma^{2}$ is at most $\sigma\sqrt{2\log(2/\delta)}$ with probability $1-\delta/2$ . Applying a union bound on all parameters, we get that with probability $1-\delta/2$ the magnitude of the Gaussian noise is at most $\alpha=\sigma\sqrt{2\log(2\omega/\delta)}$ where $\omega$ is the number of parameters of the model. Therefore, we can get the following generalization bound:

Inspired by the above bound, we define the following measures:

where $\alpha$ is chosen to be the largest number such that $\max_{|u_{i}|\leq\alpha}\hat{L}(f_{{\mathbf{w}}+{\mathbf{u}}})\leq 0.1$ .

To understand the importance of the flatness parameters $\sigma$ and $\alpha$ , we also define the following measures:

where $\alpha$ and $\sigma$ are computed as explained above.

Therefore, the generalization bound can be written as follows

We also follow similar arguments are before to get a similar bound on the worst-case sharpness:

We look at the following measures based on the above bound:

Finally, we look at measures that are only based the sharpness values computed above:

where $\alpha$ and $\sigma$ are computed as explained above.

C.4 Optimization-based Measures

There are mixed results about how the optimization speed is relevant to generalization. On one hand we know that adding Batch Normalization or using shortcuts in residual architectures help both optimization and generalization and Hardt et al., (2015) suggests that faster optimization results in better generalization. On the other hand, there are empirical results showing that adaptive optimization methods that are faster, usually generalize worse (Wilson et al., 2017b, ). Here, we put these hypothesis into test by looking at the number of steps to achieve cross-entropy 0.1 and the number of steps needed to go from cross-entropy 0.1 to 0.01:

The above measures tell us if the speed of optimization at early or late stages can be informative about generalization. We also define measures that look at the SGD gradient noise after the first epoch and at the end of training at cross-entropy 0.01 to test the gradient noise can be predictive of generalization:

where ${\mathbf{w}}^{1}$ is the weight vector after the first epoch.

Appendix D Algorithms

We first lay out some common notations used in the pseudocode:

$f$ : the architecture that takes parameter $\theta$ and input $x$ and map to $f(x;\theta)$ which is the predicted label of $x$

$M$ : Some kind of iteration; $M_{1}$ : binary search depth; $M_{2}$ : Monte Carlo Estimation steps; $M_{3}$ : Iteration for estimating the loss

$\mathscr{D}=\{(x_{i},y_{i})\}_{i=0}^{n}$ the dataset the model is trained on; $\mathcal{B}$ as a uniformly sampled minibatch from the dataset.

Both search algorithm relies on the assumption that the loss increases monotonically with the perturbation magnitude $\sigma$ around the final weight. This assumption is quite mild and in reality holds across almost all the models in this study.

Note that for finding the sharpness $\sigma$ , we use the cross-entropy as the differentiable surrogate object instead of the 1-0 loss which is in general not differentiable. Using gradient ascent brings another additional challenge that is for a converged model, the local gradient signal is usually weak, making gradient ascent extremely inefficient. To speed up thie process, we add a uniform noise with range being $[-\sigma_{new}/N_{{\mathbf{w}}},\sigma_{new}/N_{{\mathbf{w}}}]$ to lift the weight off the flat minima where $N_{{\mathbf{w}}}$ is the number of parameters. This empirical greatly accelerates the search.

Further, for magnitude aware version of the bounds, the overall algorithm stays the same with the exception that now covariance matrices at line 7 of Algorithm 2 become as diagonal matrix containing $w_{i}^{2}$ on the diagonal; similarly, for line 12 of Algorithm 3, the weight clipping of each $w_{i}$ is conditioned on $\sigma_{new}|w_{i}|$ , i.e. clipped to $[-\sigma_{new}|w_{i}|,\sigma_{new}|w_{i}|]$ . Here $w_{i}$ denotes the $i^{th}$ parameter of flattened ${\mathbf{w}}$ .