Exploring Generalization in Deep Learning

Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, Nathan Srebro

Introduction

Learning with deep neural networks has enjoyed huge empirical success in recent years across a wide variety of tasks. Despite being a complex, non-convex optimization problem, simple methods such as stochastic gradient descent (SGD) are able to recover good solutions that minimize the training error. More surprisingly, the networks learned this way exhibit good generalization behavior, even when the number of parameters is significantly larger than the amount of training data .

In such an over parametrized setting, the objective has multiple global minima, all minimize the training error, but many of them do not generalize well. Hence, just minimizing the training error is not sufficient for learning: picking the wrong global minima can lead to bad generalization behavior. In such situations, generalization behavior depends implicitly on the algorithm used to minimize the training error. Different algorithmic choices for optimization such as the initialization, update rules, learning rate, and stopping condition, will lead to different global minima with different generalization behavior . For example, Neyshabur et al. introduced Path-SGD, an optimization algorithm that is invariant to rescaling of weights, and showed better generalization behavior over SGD for both feedforward and recurrent neural networks . Keskar et al. noticed that the solutions found by stochastic gradient descent with large batch sizes generalizes worse than the one found with smaller batch sizes, and Hardt et al. discuss how stochastic gradient descent ensures uniform stability, thereby helping generalization for convex objectives.

What is the bias introduced by these algorithmic choices for neural networks? What ensures generalization in neural networks? What is the relevant notion of complexity or capacity control?

As mentioned above, simply accounting for complexity in terms of the number of parameters, or any measure which is uniform across all functions representable by a given architecture, is not sufficient to explain the generalization ability of neural networks trained in practice. For linear models, norms and margin-based measures, and not the number of parameters, are commonly used for capacity control . Also norms such as the trace norm and max norm are considered as sensible inductive biases in matrix factorization and are often more appropriate than parameter-counting measures such as the rank . In a similar spirit, Bartlett and later Neyshabur et al. suggested different norms of network parameters to measure the capacity of neural networks. In a different line of work, Keskar et al. suggested “sharpness” (robustness of the training error to perturbations in the parameters) as a complexity measure for neural networks. Others, including Langford and Caruana and more recently Dziugaite and Roy , propose a PAC-Bayes analysis.

What makes a complexity measure appropriate for explaining generalization in deep learning? First, an appropriate complexity measure must be sufficient in ensuring generalization. Second, networks learned in practice should be of low complexity under this measure. This can happen if our optimization algorithms bias us toward lower complexity models under this measure and it is possible to capture real data using networks of low complexity. In particular, the complexity measure should help explain several recently observed empirical phenomena that are not explained by a uniform notion of complexity:

It is possible to obtain zero training error on random labels using the same architecture for which training with real labels leads to good generalization . We would expect the networks learned using real labels (and which generalizes well) to have much lower complexity, under the suggested measure, than those learned using random labels (and which obviously do not generalize well).

Increasing the number of hidden units, thereby increasing the number of parameters, can lead to a decrease in generalization error even when the training error does not decrease . We would expect to see the complexity measure decrease as we increase the number of hidden units.

When training the same architecture, with the same training set, using two different optimization methods (or different algorithmic or parameter choices), one method results in better generalization even though both lead to zero training error . We would expect to see a correlation between the complexity measure and generalization ability among zero-training error models.

In this paper we examine different complexity measures that have recently been suggested, or could be considered, in explaining generalization in deep learning. In light of the above, we evaluate the measures based on their ability to theoretically guarantee generalization, and their empirical ability to explain the above phenomena. Studying how each measure can guarantee generalization also let us better understand how it should be computed and compared when trying to explain the empirical phenomena.

We investigate complexity measures including norms, robustness and sharpness of the network. We emphasize in our theoretical and empirical study the importance of relating the scale of the parameters and the scale of the output of the network, e.g. by relating norm and margin. In this light, we discuss how sharpness by itself is not sufficient for ensuring generalization, but can be combined, through PAC-Bayes analysis, with the norm of the weights to obtain an appropriate complexity measure. The role of sharpness in PAC-Bayesian analysis of neural networks was also recently noted by Dziugaite and Roy , who used numerical techniques to numerically optimize the overall PAC-Bayes bound—here we emphasize the distinct role of sharpness as a balance for norm.

In order to further understand the significance of sharpness in deep learning, and how its relationship to margin deviates for that found in linear models, we also establish, in Section 4, sufficient conditions on the network that provably ensures small sharpness.

Generalization and Capacity Control in Deep Learning

In this section, we discuss complexity measures that have been suggested, or could be used for capacity control in neural networks. We discuss advantages and weaknesses of each of these complexity measures and examine their abilities to explain the observed generalization phenomena in deep learning.

We consider the statistical capacity of a model class in terms of the number of examples required to ensure generalization, i.e. that the population (or test error) is close to the training error, even when minimizing the training error. This also roughly corresponds to the maximum number of examples on which one can obtain small training error even with random labels.

For some of the measures discussed, we allow $\mathcal{M}$ to depend also on the training set. If this is done carefully, we can still ensure generalization for the restricted class $\mathcal{H}_{\mathcal{M},\alpha}$ .

We will consider several possible complexity measures. For each candidate measure, we first investigate whether it is sufficient for generalization, and analyze the capacity of $\mathcal{H}_{\mathcal{M},\alpha}$ . Understanding the capacity corresponding to different complexity measures also allows us to relate between different measures and provides guidance as to what and how we should measure: From the above discussion, it is clear that any monotone transformation of a complexity measures leads to an equivalent notion of complexity. Furthermore, complexity is meaningful only in the context of a specific hypothesis class $\mathcal{H}$ , e.g. specific architecture or network size. The capacity, as we consider it (in units of sample complexity), provides a yardstick by which to measure complexity (we should be clear though, that we are vague regarding the scaling of the generalization error itself, and only consider the scaling in terms of complexity and model class, thus we obtain only a very crude yardstick sufficient for investigating trends and relative phenomena, not a quantitative yardstick).

For any model, if its parameters have finite precision, its capacity is linear in the total number of parameters. Even without making an assumption on the precision of parameters, the VC dimension of feedforward networks can be bounded in terms of the number of parameters $\text{dim}(\mathbf{w})$ . In particular, Bartlett and Harvey et al. , following Bartlett et al. , give the following tight (up to logarithmic factors) bound on the VC dimension and hence capacity of feedforward networks with ReLU activations:

In the over-parametrized settings, where the number of parameters is more than the number of samples, complexity measures that depend on the total number of parameters are too weak and cannot explain the generalization behavior. Neural networks used in practice often have significantly more parameters than samples, and indeed can perfectly fit even random labels, obviously without generalizing . Moreover, measuring complexity in terms of number of parameters cannot explain the reduction in generalization error as the number of hidden units increase (see also Figure Figure 4).

2 Norms and Margins

Capacity control in terms of norm, when using a zero/one loss (i.e. counting errors) requires us in addition to account for scaling of the output of the neural networks, as the loss is insensitive to this scaling but the norm only makes sense in the context of such scaling. For example, dividing all the weights by the same number will scale down the output of the network but does not change the $0/1$ loss, and hence it is possible to get a network with arbitrary small norm and the same $0/1$ loss. Using a scale sensitive losses, such as the cross entropy loss, does address this issue (if the outputs are scaled down toward zero, the loss becomes trivially bad), and one can obtain generalization guarantees in terms of norm and the cross entropy loss.

However, we should be careful when comparing the norms of different models learned by minimizing the cross entropy loss, in particular when the training error goes to zero. When the training error goes to zero, in order to push the cross entropy loss (or any other positive loss that diminish at infinity) to zero, the outputs of the network must go to infinity, and thus the norm of the weights (under any norm) should also go to infinity. This means that minimizing the cross entropy loss will drive the norm toward infinity. In practice, the search is terminated at some finite time, resulting in large, but finite norm. But the value of this norm is mostly an indication of how far the optimization is allowed to progress—using a stricter stopping criteria (or higher allowed number of iterations) would yield higher norm. In particular, comparing the norms of models found using different optimization approaches is meaningless, as they would all go toward infinity.

Instead, to meaningfully compare norms of the network, we should explicitly take into account the scaling of the outputs of the network. One way this can be done, when the training error is indeed zero, is to consider the “margin” of the predictions in addition to the norms of the parameters. We refer to the margin for a single data point $x$ as the difference between the score of the correct label and the maximum score of other labels, i.e.

In order to measure scale over an entire training set, one simple approach is to consider the “hard margin”, which is the minimum margin among all training points. However, this definition is very sensitive to extreme points as well as to the size of the training set. We consider instead a more robust notion that allows a small portion of data points to violate the margin. For a given training set and small value $\epsilon>0$ , we define the margin $\gamma_{\text{margin}}$ as the lowest value of $\gamma$ such that $\lceil\epsilon m\rceil$ data point have margin lower than $\gamma$ where $m$ is the size of the training set. We found empirically that the qualitative and relative nature of our empirical results is almost unaffected by reasonable choices of $\epsilon$ (e.g. between $0.001$ and $0.1$ ).

spectral norm with capacity proportional to $\frac{1}{\gamma_{\text{margin}}^{2}}\prod_{i=1}^{d}h_{i}\left\lVert{W_{i}}\right\rVert^{2}_{2}$ .

In Section 3 we present further empirical investigations of the appropriateness of these complexity measures to explaining other phenomena.

3 Lipschitz Continuity and Robustness

The measures/norms we discussed so far also control the Lipschitz constant of the network with respect to its input. Is the capacity control achieved through the bound on the Lipschitz constant? Is bounding the Lipschitz constant alone enough for generalization? To answer these questions, and in order to understand capacity control in terms of Lipschitz continuity more broadly, we review here the relevant guarantees.

Another related approach is through algorithmic robustness as suggested by Xu and Mannor . Given $\epsilon>0$ , the model $f_{\mathbf{w}}$ found by a learning algorithm is $K$ robust if $\mathcal{X}$ can be partitioned into $K$ disjoint sets, denoted as $\{C_{i}\}_{i=1}^{K}$ , such that for any pair $(\mathbf{x},y)$ in the training set $\mathbf{s}$ ,Xu and Mannor have defined the robustness as a property of learning algorithm given the model class and the training set. Here since we are focused on the learned model, we introduce it as a property of the model.

Xu and Mannor showed the capacity of a model class whose models are $K$ -robust scales as $K$ . For the model class of functions with bounded Lipschitz $C_{\left\lVert{.}\right\rVert}$ , $K$ is proportional to ${\frac{C_{\left\lVert{.}\right\rVert}}{\gamma_{\text{margin}}}}$ -covering number of the input domain $\mathcal{X}$ under norm $\left\lVert{.}\right\rVert$ . However, the covering number of the input domain can be exponential in the input dimension and the capacity can still grow as $\left(\frac{C_{\left\lVert{.}\right\rVert}}{\gamma_{\text{margin}}}\right)^{n}$ Similar to margin-based bounds, we drop the term that depends on the diameter of the input space..

4 Sharpness

The notion of sharpness as a generalization measure was recently suggested by Keskar et al. and corresponds to robustness to adversarial perturbations on the parameter space:

where the training error $\widehat{L}(f_{\mathbf{w}})$ is generally very small in the case of neural networks in practice, so we can simply drop it from the denominator without a significant change in the sharpness value.

As we will explain below, sharpness defined this way does not capture the generalization behavior. To see this, we first examine whether sharpness can predict the generalization behavior for networks trained on true vs random labels. In the left plot of Figure 2, we plot the sharpness for networks trained on true vs random labels. While sharpness correctly predicts the generalization behavior for bigger networks, for networks of smaller size, those trained on random labels have less sharpness than the ones trained on true labels. Furthermore sharpness defined above depends on the scale of $\mathbf{w}$ and can be artificially increased or decreased by changing the scale of the parameters. Therefore, sharpness alone is not sufficient to control the capacity of the network.

Instead, we advocate viewing a related notion of expected sharpness in the context of the PAC-Bayesian framework. Viewed this way, it becomes clear that sharpness controls only one of two relevant terms, and must be balanced with some other measure such as norm. Together, sharpness and norm do provide capacity control and can explain many of the observed phenomena. This connection between sharpness and the PAC-Bayes framework was also recently noted by Dziugaite and Roy .

The PAC-Bayesian framework provides guarantees on the expected error of a randomized predictor (hypothesis), drawn form a distribution denoted $\mathcal{Q}$ and sometimes referred to as a “posterior” (although it need not be the Bayesian posterior), that depends on the training data. Let $f_{\mathbf{w}}$ be any predictor (not necessarily a neural network) learned from training data. We consider a distribution $\mathcal{Q}$ over predictors with weights of the form $\mathbf{w}+\bm{\nu}$ , where $\mathbf{w}$ is a single predictor learned from the training set, and $\bm{\nu}$ is a random variable. Then, given a “prior” distribution $P$ over the hypothesis that is independent of the training data, with probability at least $1-\delta$ over the draw of the training data, the expected error of $f_{\mathbf{w}+\bm{\nu}}$ can be bounded as follows :

The above inequality clearly holds for $\mathcal{K}\geq 1$ and for $\mathcal{K}<1$ it can be derived from Equation (5) by upper bounding the loss in the second term by $1$ . We can rewrite the above bound as follows:

As we can see, the PAC-Bayes bound depends on two quantities - i) the expected sharpness and ii) the Kullback Leibler (KL) divergence to the “prior” $P$ . The bound is valid for any distribution measure $P$ , any perturbation distribution $\bm{\nu}$ and any method of choosing $\mathbf{w}$ dependent on the training set. A simple way to instantiate the bound is to set $P$ to be a zero mean, $\sigma^{2}$ variance Gaussian distribution. Choosing the perturbation $\bm{\nu}$ to also be a zero mean spherical Gaussian with variance $\sigma^{2}$ in every direction, yields the following guarantee (w.p. $1-\delta$ over the training set):

Another interesting approach is to set the variance of the perturbation to each parameter with respect to the magnitude of the parameter. For example if $\sigma_{i}=\alpha\left\lvert{w_{i}}\right\rvert+\beta$ , then the KL term in the above expression changes to $\sum_{i}\frac{w_{i}^{2}}{2\sigma_{i}^{2}}$ .

The above generalization guarantees give a clear way to think about capacity control jointly in terms of both the expected sharpness and the norm, and as we discussed earlier indicates that sharpness by itself cannot control the capacity without considering the scaling. In the above generalization bound, norms and sharpness interact in a direct way depending on $\sigma$ , as increasing the norm by decreasing $\sigma$ causes decrease in sharpness and vice versa. It is therefore important to find the right balance between the norm and sharpness by choosing $\sigma$ appropriately in order to get a reasonable bound on the capacity.

In our experiments we observe that looking at both these measures jointly indeed makes a better predictor for the generalization error. As discussed earlier, Dziugaite and Roy numerically optimize the overall PAC-Bayes generalization bound over a family of multivariate Gaussian distributions (different choices of perturbations and priors). Since the precise way the sharpness and KL-divergence are combined is not tight, certainly not in (8), nor in the more refined bound used by Dziugaite and Roy , we prefer shying away from numerically optimizing the balance between sharpness and the KL-divergence. Instead, we propose using bi-criteria plots, where sharpness and KL-divergence are plotted against each other, as we vary the perturbation variance. For example, in the center and right panels of Figure 2 we show such plots for networks trained on true and random labels respectively. We see that although sharpness by itself is not sufficient for explaining generalization in this setting (as we saw in the left panel), the bi-criteria plots are significantly lower for the true labels. Even more so, the change in the bi-criteria plot as we increase the number of samples is significantly larger with random labels, correctly capturing the required increase in capacity. For example, to get a fixed value of expected sharpness such as $\epsilon=0.05$ , networks trained with random labels require higher norm compared to those trained with true labels. This behavior is in agreement with our earlier discussion, that sharpness is sensitive to scaling of the parameters and is not a capacity control measure as it can be artificially changed by scaling the network. However, combined with the norm, sharpness does seem to provide a capacity measure.

Empirical Investigation

In this section we investigate the ability of the discussed measures to explain the the generalization phenomenon discussed in the Introduction. We already saw in Figures 1 and 2 that these measures capture the difference in generalization behavior of models trained on true or random labels, including the increase in capacity as the sample size increases, and the difference in this increase between true and random labels.

Given different global minima of the training loss on the same training set and with the same model class, can these measures indicate which model is going to generalize better? In order to verify this property, we can calculate each measure on several different global minima and see if lower values of the measure imply lower generalization error. In order to find different global minima for the training loss, we design an experiment where we force the optimization methods to converge to different global minima with varying generalization abilities by forming a confusion set that includes samples with random labels. The optimization is done on the loss that includes examples from both the confusion set and the training set. Since deep learning models have very high capacity, the optimization over the union of confusion set and training set generally leads to a point with zero error over both confusion and training sets which thus is a global minima for the training set.

We randomly select a subset of CIFAR10 dataset with 10000 data points as the training set and our goal is to find networks that have zero error on this set but different generalization abilities on the test set. In order to do that, we train networks on the union of the training set with fixed size 10000 and confusion sets with varying sizes that consists of CIFAR10 samples with random labels; and we evaluate the learned model on an independent test set. The trained network achieves zero training error but as shown in Figure 3, the test error of the model increases with increasing size of the confusion set. The middle panel of this Figure suggests that the norm of the learned networks can indeed be predictive of their generalization behavior. However, we again observe that sharpness has a poor behavior in these experiments. The right panel of this figure also suggests that PAC-Bayes measure of joint sharpness and KL divergence, has better behavior - for a fixed expected sharpness, networks that have higher generalization error, have higher norms.

Increasing Network Size

Bounding Sharpness

So far we have discussed margin based and sharpness based complexity measures to understand capacity. We have also discussed how sharpness based complexity measures in combination with norms characterize the generalization behavior under the PAC-Bayes framework. In this section we study the question of what affects the sharpness of neural networks? For the case of linear predictors, sharpness only depends on the norm of the predictor. In contrast, for multilayered networks, interaction between the layers plays a major role and consequently two different networks with the same norm can have drastically different sharpness values. For example, consider a network where some subset of the layers despite having non-zero norm interact weakly with their neighbors, or are almost orthogonal to each other. Such a network will have very high sharpness value compared to a network where the neighboring layers interact strongly.

In this section we establish sufficient conditions to bound the expected sharpness of a feedforward network with ReLU activations. Such conditions serve as a useful guideline in studying what helps an optimization method to converge to less sharp optima. Unlike existing generalization bounds , our sharpness based bound does not suffer from exponential dependence on depth.

Now we discuss the conditions that affect the sharpness of a network. As discussed earlier, weak interactions between layers can cause the network to have high sharpness value. Condition $C1$ below prevents such weak interactions (cancellations). A network can also have high sharpness if the changes in the number of activations is exponential in the perturbations to its weights, even for small perturbations. Condition $C2$ avoids such extreme situations on activations. Finally, if a non-active node with large weights becomes active because of the perturbations in lower layers, that can lead to huge changes to the output of the network. Condition $C3$ prevents having such spiky (in magnitude) hidden units. This leads us to the following three conditions, that help in avoiding such pathological cases.

Given $x$ , let $x=W_{0}$ and $D_{0}=I$ . Then, for all $0\leq a<c<b\leq d,\|\left(\Pi_{i=a}^{b}D_{i}W_{i}\right)\|_{F}\geq\frac{\mu}{\sqrt{h_{c}}}\|\Pi_{i=c+1}^{b}D_{i}W_{i}\|_{F}\|\left(\Pi_{i=a}^{c}D_{i}W_{i}\right)\|_{F}$ .

Given $x$ , for any level $k$ , $\frac{1}{h_{k}}\sum_{i\in[h_{k}]}1_{W_{k,i}\Pi_{j=1}^{k-1}D_{j}W_{j}x\leq\delta}\leq C_{2}\delta$ .

For all $i$ , $\|W_{i}\|_{2,\infty}^{2}h_{i}\leq C_{3}^{2}\|D_{i}W_{i}\|_{F}^{2}$ .

Here, $W_{k,i}$ denotes the weights of the $i^{th}$ output node in layer $k$ . $\|W_{i}\|_{2,\infty}$ denotes the maximum $L2$ norm of a hidden unit in layer $i$ . Now we state our result on the generalization error of a ReLU network, in terms of average sharpness and its norm. Let $\|x\|=1$ and $h=\max_{i=1}^{d}h_{i}$ .

Let $\bm{\nu}_{i}$ be a random $h_{i}\times h_{i-1}$ matrix with each entry distributed according to $\mathcal{N}(0,\sigma_{i}^{2})$ . Then, under the conditions $C1,C2,C3$ , with probability $\geq 1-\delta$ ,

where $\gamma_{i}=\frac{\sigma_{i}\sqrt{h_{i}}\sqrt{h_{i-1}}}{\mu^{2}\|W_{i}\|_{F}}$ and $C_{\delta}=2\sqrt{\ln(dh/\delta)}$ .

To understand the above generalization error bound, consider choosing $\gamma_{i}=\frac{\sigma}{C_{\delta}d}$ , and we get a bound that simplifies as follows:

If we choose large $\sigma$ , then the network will have higher expected sharpness but smaller ’norm’ and vice versa. Now one can optimize over the choice of $\sigma$ to balance between the terms on the right hand side and get a better capacity bound. For any reasonable choice of $\sigma$ , the generalization error above, depends only linearly on depth and does not have any exponential dependence, unlike other notions of generalization. Also the error gets worse with decreasing $\mu$ and increasing $C_{2},C_{3}$ as the sharpness of the network increases which is in accordance with our discussion of the conditions above.

Additionally the conditions $C1-C3$ actually hold for networks trained in practice as we verify in Figure 5, and our experiments suggest that, $\mu\geq 1/4,C2\leq 5$ and $C3\leq 3$ . More details on the verification and comparing the conditions on learned network with those of random weights, are presented in the appendix.

where $Err_{d}=\|(W+\bm{\nu})_{d}\left(\Pi_{i=1}^{d-1}\widehat{D}_{i}(W+\bm{\nu})_{i}\right)*x-(W+\bm{\nu})_{d}\left(\Pi_{i=1}^{d-1}D_{i}(W+\bm{\nu})_{i}\right)*x\|_{F}$ . $(i)$ $\widehat{D}_{i}$ is the diagonal matrix with 0’s and 1’s corresponding to the activation pattern of the perturbed network $f_{\mathbf{w}+\bm{\nu}}(x)$ .

The first term in the equation (9) corresponds to error due to perturbation of a network with unchanged activations (linear network). Intuitively this is small when any subset of successive layers of the network do no interact weakly with each other (not orthogonal to each other). Condition $C1$ captures this intuition and we bound this error in Lemma 10.

Let $\bm{\nu}_{i}$ be a random $h_{i}\times h_{i-1}$ matrix with each entry distributed according to $\mathcal{N}(0,\sigma_{i}^{2})$ . Then, under the condition $C1$ ,

The second term in the equation (9) captures the perturbation error due to change in activations. If a tiny perturbation can cause exponentially many changes in number of active nodes, then that network will have huge sharpness. Condition $C2$ and $C3$ essentially characterize the behavior of sensitivity of activation patterns to perturbations, leading to a bound on this term in Lemma 2.

Let $\bm{\nu}_{i}$ be a random $h_{i}\times h_{i-1}$ matrix with each entry distributed according to $\mathcal{N}(0,\sigma_{i}^{2})$ . Then, under the conditions $C1$ , $C2$ and $C3$ , with probability $\geq 1-\delta$ , for all $1\leq k\leq d$ ,

where $\gamma_{i}=\frac{\sigma_{i}\sqrt{h_{i}}\sqrt{h_{i-1}}}{\mu^{2}\|D_{i}W_{i}\|_{F}}$ and $C_{\delta}=2\sqrt{\ln(dh/\delta)}$ .

Here $\gamma_{i}=\frac{\sigma_{i}\sqrt{h_{i}}\sqrt{h_{i-1}}}{\mu^{2}\|D_{i}W_{i}\|_{F}}$ . Substituting the above bound on expected sharpness in the PAC-Bayes result (equation (7)), gives the result.

Conclusion

Learning with deep neural networks displays good generalization behavior in practice, a phenomenon that remains largely unexplained. In this paper we discussed different candidate complexity measures that might explain generalization in neural networks. We outline a concrete methodology for investigating such measures, and report on experiments studying how well the measures explain different phenomena. While there is no clear choice yet, some combination of expected sharpness and norms do seem to capture much of the generalization behavior of neural networks. A major issue still left unresolved is how the choice of optimization algorithm biases such complexity to be low, and what is the precise relationship between optimization and implicit regularization.

References

Appendix A Experiments Settings

In experiment with different network sizes, we train a two layer perceptron with ReLU activation and varying number of hidden units without Batch Normalization or dropout. In the rest of the experiments, we train a modified version of the VGG architecture with the configuration $2\times$ , $2\times$ , $2\times$ , $2\times$ where we add Batch Normalization before ReLU activations and apply $2\times 2$ max-pooling with window size 2 and dropout after each stack. Convolutional layers are followed by $4\times 4$ average pooling, a fully connected layer with 512 hidden units and finally a linear layer is added for prediction.

In all experiments we train the networks using stochastic gradient descent (SGD) with mini-batch size 64, fixed learning rate 0.01 and momentum 0.9 without weight decay. In all experiments where achieving zero training error is possible, we continue training until the cross-entropy loss is less than $10^{-4}$ .

When calculating norms on a network with a Batch Normalization layer, we reparametrize the network to one that represents the exact same function without Batch Normalization as suggested in . In all our figures we plot norm divided by margin to avoid scaling issues (see Section 2), where we set the margin over training set $S$ to be $5^{th}$ -percentile of the margins of the data points in $S$ , i.e. $\text{Prc}_{5}\left\{f_{\mathbf{w}}(x_{i})[y_{i}]-\max_{y\neq y_{i}}f_{\mathbf{w}}(x)[y]|(x_{i},y_{i})\in S\right\}.$ We have also investigated other versions of the margin and observed similar behavior to this notion.

We calculate the sharpness, as suggested in - for each parameter $w_{i}$ we bound the magnitude of perturbation by $\alpha(\left\lvert{w_{i}}\right\rvert+1)$ for $\alpha=5.10^{-4}$ . In order to compute the maximum perturbation (maximize the loss), we perform 2000 updates of stochastic gradient ascent starting from the minimum, with mini-batch size 64, fixed step size 0.01 and momentum 0.9.

To compute the expected sharpness, we perturb each parameter $w_{i}$ of the model with noise generated from Gaussian distribution with zero mean and standard deviation, $\alpha(10\left\lvert{w_{i}}\right\rvert+1)$ . The expected sharpness is average over 1000 random perturbations each of which are averaged over a mini-batch of size 64. We compute the expected sharpness for different choices of $\alpha$ . For each value of $\alpha$ the KL divergence can be calculated as $\frac{1}{\alpha^{2}}\sum_{i}\left(\frac{w_{i}}{(10\left\lvert{w_{i}}\right\rvert+1)}\right)^{2}$ .

Appendix B Proofs

Define $g_{\mathbf{w},\bm{\nu},s}(x)$ as the network $f_{\mathbf{w}}$ with weight $W_{i}$ in every layer $i\in s$ replaced by $\bm{\nu}_{i}$ . Hence,

Base case: First we show the bound for terms with one noisy layer. Let $g_{\mathbf{w},\bm{\nu},\{k\}}(x)$ denote $f_{\mathbf{w}}(x)$ with weights in layer $k$ , $W_{k}$ replaced by $\bm{\nu}_{k}$ . Now notice that,

$(i)$ follows from Lemma 3. $(ii)$ follows from condition $C1$ .

Induction step: Let for any set $s\subset[d],|s|=k$ , the following holds:

We will prove this now for terms with $k+1$ noisy layers.

Substituting the above expression in equation (10) gives,

B.2 Proof of Lemma 2

We prove this lemma by induction on $k$ . Recall that $\widehat{D}_{i}$ is the diagonal matrix with 0’s and 1’s corresponding to the activation pattern of the perturbed network $f_{\mathbf{w}+\bm{\nu}}(x)$ . Let $C_{\delta}=2\sqrt{\ln(dh/\delta)}$ and $1_{E}$ denote the indicator function, that is $1$ if the event $E$ is true, else. We also use $f_{\mathbf{w}}^{k}(x)$ to denote the network truncated to level $k$ , in particular $f_{\mathbf{w}}^{k}(x)=\Pi_{i=1}^{k}D_{k}W_{k}x$ .

Since $\bm{\nu}_{1}$ is a random Gaussian matrix, and $\|x\|\leq 1$ , for any $i$ , $\left\lvert{\left\langle(\bm{\nu})_{1,i},x\right\rangle}\right\rvert\leq 2\sigma_{1}\sqrt{\ln(dh/\delta)}=\sigma_{1}C_{\delta}$ with probability greater than $1-\frac{\delta}{d}$ . Hence, with probability $\geq 1-\frac{\delta}{d}$ ,

This completes the base case for $k=1$ . $\widehat{D}_{1}$ is a random variable that depends on $\bm{\nu}_{1}$ . Hence, in the remainder of the proof, to avoid this dependence, we separately bound $\widehat{D}_{1}-D$ using the expression above and compute expectation only with respect to $\bm{\nu}_{1}$ . With probability $\geq 1-\frac{\delta}{d}$ ,

$k=1$ case does not capture all the intricacies and dependencies of higher layer networks. Hence we also evaluate the bounds for $k=2$ .

Now, with probability $\geq 1-\frac{2\delta}{d}$ we get:

where, $\beta_{i}=2\sqrt{\frac{\hat{\sigma_{1}}}{\sqrt{h_{i}+h_{i-1}}}}$ . $(i)$ follows from condition $C1$ , which results in $\Pi_{i=2}^{d}\frac{\mu\|D_{i}W_{i}\|_{F}}{\sqrt{h_{i}}}\frac{\mu\|D_{1}W_{1}x\|_{F}}{\sqrt{h_{1}}}\leq\|f_{\mathbf{w}}(x)\|_{F}$ . Hence, if we consider the rebalanced networkThe parameters of ReLu networks can be scaled between layers without changing the function where all layers have same values for $\frac{\mu\|D_{i}W_{i}\|_{F}}{\sqrt{h_{i}}}$ , we get, $\frac{\mu\|D_{i}W_{i}\|_{F}}{\sqrt{h_{i}}}\leq\|f_{\mathbf{w}}(x)\|_{F}^{\nicefrac{{1}}{{d}}}$ . Also the above equations follow from setting, $\sigma_{i}=\frac{\hat{\sigma}_{i}}{C_{2}C_{\delta}\sqrt{h_{i}+h_{i-1}}}$ .

Hence, with probability $\geq 1-\frac{2\delta}{d}$ ,

Since, we choose $\sigma_{i}$ to scale as some small number $O(\sigma)$ , in the above expression the first term scales as $O(\sigma)$ and the last two terms decay at least as $O(\sigma^{\nicefrac{{3}}{{2}}})$ . Hence we do not include them in the computation of $Err$ .

We will bound now the first term in the above expression. With probability $\geq 1-\frac{2\delta}{d}$ ,

Hence, with probability $\geq 1-\frac{k\delta}{d}$ ,

Now we will show that the last two terms in the above expression scale as $O(\sigma^{2})$ . For that, first notice that $\|f^{k}_{\mathbf{w}+\bm{\nu}}-f^{k}_{\mathbf{w}}\|_{F}\leq\left(\Pi_{i=1}^{k}\left(1+\frac{\sigma_{i}\sqrt{h_{i}h_{i-1}}}{\mu^{2}\|D_{i}W_{i}\|_{F}}\right)-1\right)\|f_{\mathbf{w}}(x)\|_{F}+Err_{k}$ , from lemma 1. Note that the second term in the above expression clearly scale as $O(\sigma^{2})$ .

Substituting the bounds for $\widehat{D}_{k+1}-D_{k+1}$ and $Err_{k}$ gives us, with probability $\geq 1-\frac{k\delta}{d}$ .

Now we bound the above terms following the same approach as in proof of Lemma 1, by considering all possible replacements of $W_{i}$ with $\bm{\nu}_{i}$ . That gives us the result.

Appendix C Supporting results

Let $A$ , $B$ be $n_{1}\times n_{2}$ and $n_{3}\times n_{4}$ matrices and $\bm{\nu}$ be a $n_{2}\times n_{3}$ entrywise random Gaussian matrix with $\bm{\nu}_{ij}\sim\mathcal{N}(0,\sigma)$ . Then,

Appendix D Conditions in Theorem 1

In this section, we compare the conditions in Theorem 1 of a learned network with that of its random initialization. We trained a 10-layer feedforward network with 1000 hidden units in each layer on MNIST dataset. Figures 6, 7 and 8 compare condition $C1$ , $C2$ and $C3$ on learned weights to that of random initialization respectively. Interestingly, we observe that the network with learned weights is very similar to its random initialization in terms of these conditions.