A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks

Umut Simsekli, Levent Sagun, Mert Gurbuzbalaban

Introduction

Context and motivation: Deep neural networks have revolutionized machine learning and have ubiquitous use in many application domains . In full generality, many key tasks in deep learning reduces to solving the following optimization problem:

Here, $\Omega_{k}\subset\{1,\dots,n\}$ is a random subset that is drawn with or without replacement at iteration $k$ , and $b=|\Omega_{k}|$ denotes the number of elements in $\Omega_{k}$ .

SGD is widely used in deep learning with a great success in its computational efficiency . Beyond efficiency, understanding how SGD performs better than its full batch counterpart in terms of test accuracy remains a major challenge. Even though SGD seems to find zero loss solutions on the training landscape (at least in certain regimes ), it appears that the algorithm finds solutions with different properties depending on how it is tuned . Despite the fact that the impact of SGD on generalization has been studied , a satisfactory theory that can explain its success in a way that encompasses such peculiar empirical properties is still lacking.

where ${\cal N}$ denotes the multivariate (Gaussian) normal distribution and $\mathbf{I}$ denotes the identity matrix of appropriate size. The rationale behind this assumption is that, if the size of the minibatch $b$ is large enough, then we can invoke the Central Limit Theorem (CLT) and assume that the distribution of $U_{k}$ is approximately Gaussian. Then, under this assumption, (2) can be written as follows:

Based on this observation, focused on the relation between this invariant measure and the algorithm parameters, namely the step-size $\eta$ and mini-batch size, as a function of $\sigma^{2}$ . They concluded that the ratio of learning rate divided by the batch size is the control parameter that determines the width of the minima found by SGD. Furthermore, they revisit the famous wide minima folklore : Among the minima found by SGD, the wider it is, the better it performs on the test set. However, there are several fundamental issues with this approach, which we will explain below.

We first illustrate a typical mismatch between the Gaussianity assumption and the empirical behavior of the stochastic gradient noise. In Figure 1, we plot the histogram of the norms of the stochastic gradient noise that is computed using a convolutional neural network in a real classification problem and compare it to the histogram of the norms of Gaussian random variables. It can be clearly observed that the shape of the real histogram is very different than the Gaussian and shows a heavy-tailed behavior.

In addition to the empirical observations, the Gaussianity assumption also yields some theoretical issues. The first issue with this assumption is that the current SDE analyses of SGD are based on the invariant measure of the SDE, which implicitly assumes that sufficiently many iterations have been taken to converge to that measure. Recent results on ULA have shown that, the required number of iterations to achieve the invariant measure often grows exponentially with the dimension $p$ . This result contradicts with the current practice: considering the large size of the neural networks and limited computational budget, only a limited number of iterations – which is much smaller than $\exp(\mathcal{O}(p))$ – can be taken. This conflict becomes clearer in the light of the recent works that studied the local behavior of ULA . These studies showed that ULA will get close to the nearest local optimum in polynomial time; however, the required amount of time for escaping from that local optimum increases exponentially with the dimension. Therefore, the phenomenon that SGD prefers wide minima within a considerably small number of iterations cannot be explained using the asymptotic distribution of the SDE given in (6).

The second issue is related to the local behavior of the process and becomes clear when we consider the metastability analysis of Brownian motion-driven SDEs. These studies consider the case where $\mathbf{w}_{0}$ is initialized in a quadratic basin and then analyze the minimum time $t$ such that $\mathbf{w}_{t}$ is outside that basin. They show that this so-called first exit time depends exponentially on the height of the basin; however, this dependency is only polynomial with the width of the basin. These theoretical results directly contradict with the the wide minima phenomenon: even if the height of a basin is slightly larger, the exit-time from this basin will be dominated by its height, which implies that the process would stay longer in (or in other words, ‘prefer’) deeper minima as opposed to wider minima. The reason why the exit-time is dominated by the height is due to the continuity of the Brownian motion, which is in fact a direct consequence of the Gaussian noise assumption.

A final remark on the issues of this approach is the observation that landscape is flat at the bottom regardless of the batch size used in SGD . In particular, the spectrum of the Hessian at a near critical point with close to zero loss value has many near zero eigenvalues. Therefore, local curvature measures that are used as a proxy for measuring the width of a basin correlates with the magnitudes of large eigenvalues of the Hessian which are few. Besides, during the dynamics of SGD it has been observed that the algorithm does not cross barriers except perhaps at the very initial phase . Such dependence of width on an essentially-flat landscape combined with the lack of explicit barrier crossing during the SGD descent forces us to rethink the analysis of basin hopping under a noisy dynamics.

Proposed framework: In this study, we aim at addressing these contradictions and come up with an arguably better-suited hypothesis for the stochastic gradient noise that has more pertinent theoretical implications for the phenomena associated with SGD. In particular, we go back to (3) and (4) and reconsider the application of CLT. This classical CLT assumes that $U_{k}$ is a sum of many independent and identically distributed (i.i.d.) random variables, whose variance is finite, and then it states that the law of $U_{k}$ converges to a Gaussian distribution, which then paves the way for (5). Even though the finite-variance assumption seems natural and intuitive at the first sight, it turns out that in many domains, such as turbulent motions , oceanic fluid flows , finance , biological evolution , audio signals , the assumption might fail to hold (see for more examples). In such cases, the classical CLT along with the Gaussian approximation will no longer hold. While this might seem daunting, fortunately, one can prove an extended CLT and show that the law of the sum of these i.i.d. variables with infinite variance still converges to a family of heavy-tailed distributions that is called the $\alpha$ -stable distribution . As we will detail in Section 2, these distributions are parametrized by their tail-index $\alpha\in(0,2]$ and they coincide with the Gaussian distribution when $\alpha=2$ .

In this study, we relax the finite-variance assumption on the stochastic gradient noise and by invoking the extended CLT, we assume that $U_{k}$ follows an $\alpha$ -stable distribution, as hinted in Figure 1(c). By following a similar rationale to (5) and (6), we reformulate SGD with this new assumption and consider its continuous-time limit for small step-sizes. Since the noise might not be Gaussian anymore (i.e. when $\alpha\neq 2$ ), the use of the Brownian motion would not be appropriate in this case and we need to replace it with the $\alpha$ -stable Lévy motion, whose increments have an $\alpha$ -stable distribution . Due to the heavy-tailed nature of $\alpha$ -stable distribution, the Lévy motion might incur large discontinuous jumps and therefore exhibits a fundamentally different behavior than the Brownian motion, whose paths are on the contrary almost surely continuous. As we will describe in detail in Section 2, the discontinuities also reflect in the metastability properties of Lévy-driven SDEs, which indicate that, as soon as $\alpha<2$ , the first exit time from a basin does not depend on its height; on the contrary, it directly depends on its width and the tail-index $\alpha$ . Informally, this implies that the process will escape from narrow minima – no matter how deep they are – and stay longer in wide minima. Besides, as $\alpha$ get smaller, the probability for the dynamics to jump in a wide basin will increase. Therefore, if the $\alpha$ -stable assumption on the stochastic gradient noise holds, then the existing metastability results automatically provide strong theoretical insights for illuminating the behavior of SGD.

Contributions: The main contributions of this paper are twofold: (i) we perform an extensive empirical analysis of the tail-index of the stochastic gradient noise in deep neural networks and (ii) based on these empirical results, we bring an alternative perspective to the existing approaches for analyzing SGD and shed more light on the folklore that SGD prefers wide minima by establishing a bridge between SGD and the related theoretical results from statistical physics and stochastic analysis.

We conduct experiments on the most common deep learning architectures. In particular, we investigate the tail behavior under fully-connected and convolutional models using negative log likelihood and linear hinge loss functions on MNIST, CIFAR10, and CIFAR100 datasets. For each configuration, we scale the size of the network and batch size used in SGD and monitor the effect of each of these settings on the tail index $\alpha$ .

Our experiments reveal several remarkable results:

In all our configurations, the stochastic gradient noise turns out to be highly non-Gaussian and possesses a heavy-tailed behavior.

Increasing the size of the minibatch has a very little impact on the tail-index, and as opposed to the common belief that larger minibatches result in Gaussian gradient noise, the noise is still far from being Gaussian.

There is a strong interaction between the network architecture, network size, dataset, and the tail-index, which ultimately determine the dynamics of SGD on the training surface. This observation supports the view that, the geometry of the problem and the dynamics induced by the algorithm cannot be separated from each other.

In almost all configurations, we observe two distinct phases of SGD throughout iterations. During the first phase, the tail-index rapidly decreases and SGD possesses a clear jump when the tail-index is at its lowest value and causes a sudden jump in the accuracy. This behavior strengthens the view that SGD crosses barriers at the very initial phase.

Our methodology also opens up several interesting future directions and open questions, as we discuss in Section 5.

Stable distributions and SGD as a Lévy-Driven SDE

The CLT states that the sum of i.i.d. random variables with a finite second moment converges to a normal distribution if the number of summands grow. However, if the variables have heavy-tail, the second moment may not exist. For instance, if their density $p(x)$ has a power-law tail decreasing as $1/|x|^{\alpha+1}$ where $0<\alpha<2$ ; only $\alpha$ -th moment exist with $\alpha<2$ . In this case, generalized central limit theorem (GCLT) says that the sum of such variables will converge to a distribution called the $\alpha$ -stable distribution instead as the number of summands grows (see e.g. . In this work, we focus on the centered symmetric $\alpha$ -stable ( ${\cal S}\alpha{\cal S}$ ) distribution, which is a special case of $\alpha$ -stable distributions that are symmetric around the origin.

In this study, we make the following assumption on the stochastic gradient noise:

where $[v]_{i}$ denotes the $i$ ’th component of a vector $v$ . Informally, we assume that each coordinate of $U_{k}$ is ${\cal S}\alpha{\cal S}$ distributed with the same $\alpha$ and the scale parameter $\sigma$ depends on the state $\mathbf{w}$ . Here, this dependency is not crucial since we are mainly interested in the tail-index $\alpha$ , which can be estimated independently from the scale parameter. Therefore, we will simply denote $\sigma(\mathbf{w})$ as $\sigma$ for clarity.

By using the assumption (7), we can rewrite the SGD recursion as follows:

Under some assumptions on the objective $f$ , it is known that the process (10) admits a stationary density . For a general $f$ , an explicit formula for the equilibrium distribution is not known, however when the noise level $\varepsilon$ is small enough, finer characterizations of the structure of the equilibrium density in dimension one is known. We next summarize known results in this area, which show that Lévy-driven dynamics spends more time in ‘wide valleys’ in the sense of when $\varepsilon$ goes to zero.

Assume that $f$ is smooth with $r$ local minima $\{m_{i}\}_{i=1}^{r}$ separated by $r-1$ local maxima $\{s_{i}\}_{i=1}^{r-1}$ , i.e.

Furthermore, assume that the local minima and maxima are not degenerate, i.e. $f^{\prime\prime}(m_{i})>0$ and $f^{\prime\prime}(s_{i})<0$ for every $i$ . We also assume the objective gradient has a growth condition $f^{\prime}(w)>|w|^{1+c}$ for some constant $c>0$ and when $|w|$ is large enough. Each local minima $m_{i}$ lies in the (interval) valley $S_{i}=(s_{i-1},s_{i})$ of (width) length $L_{i}=|s_{i}-s_{i-1}|$ . Consider also a $\delta$ -neighborhood $B_{i}:=\{|x-m_{i}|\leq\delta\}$ around the local minimum with $\delta>0$ small enough so that the neighborhood is contained in the valley $S_{i}$ for every $i$ . We are interested in the first exit time from $B_{i}$ starting from a point $w_{0}\in B_{i}$ and the transition time $T_{w_{0}}^{i}(\varepsilon):=\inf\{t\geq 0:w_{t}^{\varepsilon}\not\in\cup_{j\neq i}B_{j}\}$ to a neighborhood of another local minimum, we will remove the dependency to $w_{0}$ of the transition time in our discussions as it is clear from the context. The following result shows that the transition times are asymptotically exponentially distributed in the limit of small noise and scales like $\frac{1}{\varepsilon^{\alpha}}$ with $\varepsilon$ .

For an initial point $w_{0}\in B_{i}$ , in the limit $\varepsilon\to 0$ , the following statements hold regarding the transition time:

If the SDE (10) would be driven by the Brownian motion instead, then an analogous theorem to Theorem 2 holds saying that the transition times are still exponentially distributed but the scaling $\varepsilon^{\alpha}$ needs to be replaced by $e^{2H/\varepsilon^{2}}$ where $H$ is the maximal depth of the basins to be traversed between the two local minima . This means that in the small noise limit, Brownian-motion driven gradient descent dynamics need exponential time to transit to another minimum whereas Levy-driven gradient descent dynamics need only polynomial time. We also note from Theorem 2 that the mean transition time between valleys for Lévy SDE does not depend on the depth $H$ of the valleys they reside in which is an advantage over Brownian motion driven SDE in the existence of deep valleys. Informally, this difference is due to the fact that Brownian motion driven SDE has to typically climb up a valley to exit it, whereas Lévy-driven SDE could jump out.

The following theorem says that as $\varepsilon\to 0$ , up to a normalization in time, the process $w_{t}^{\varepsilon}$ behaves like a finite state-space Markov process that has support over the set of local minima $\{m_{i}\}_{i=1}^{r}$ admitting a stationary density $\pi=(\pi_{i})_{i=1}^{r}$ with an infinitesimal generator $Q$ . The process jumps between the valleys $S_{i}$ , spending time proportional to probability $p_{i}$ amount of time in each valley in the equilibrium where the probabilities $\pi=(\pi_{i})_{i=1}^{r}$ are given by the solution to the linear system $Q\pi=0$ .

Let $w_{0}\in S_{i}$ , for some $1\leq i\leq r$ . For $t\geq 0$ , $w_{t\varepsilon^{-\alpha}}^{\varepsilon}\to Y_{m_{i}}(t)$ , as $\varepsilon\to 0$ , in the sense of finite-dimensional distributions, where $Y=(Y_{y}(t))_{t\geq 0}$ is a continuous-time Markov chain on a state space $\{m_{1},m_{2},\dots,m_{r}\}$ with the infinitesimal generator $Q=(q_{ij})_{i,j=1}^{r}$ with

This process admits a density $\pi$ satisfying $Q^{T}\pi=0$ .

A consequence of this theorem is that equilibrium probabilities $p_{i}$ are typically larger for “wide valleys". To see this consider the special case illustrated in Figure 3 with $r=2$ local minima $m_{1}<s_{1}=0<m_{2}$ separated by a local maximum at $s_{1}=0$ . For this example, $m_{2}>|m_{1}|$ , and the second local minimum lies in a wider valley. A simple computation reveals

We see that $\pi_{2}>\pi_{1}$ , that is in the equilibrium the process spends more time on the wider walley. In particular, the ratio $\frac{\pi_{2}}{\pi_{1}}=\left(\frac{m_{2}}{|m_{1}|}\right)^{\alpha}$ grows with an exponent $\alpha$ when the ratio $\frac{m_{2}}{|m_{1}|}$ of the width of the valleys grows. Consequently, if the gradient noise is indeed $\alpha$ -stable distributed, these results directly provide theoretical evidence for the wide-minima behavior of SGD.

Experimental Setup and Methodology

Experimental setup: We investigate the tail behavior of the stochastic gradient noise in a variety of scenarios. We first consider a fully-connected network (FCN) on the MNIST and CIFAR10 datasets. For this model, we vary the depth (i.e. the number of layers) in the set $\{2,3,\dots,10\}$ , the width (i.e. the number of neurons per layer) in the set $\{2,4,8,\dots,1024\}$ , and the minibatch size ranging from $1$ to full batch. We then consider a convolutional neural network (CNN) architecture (AlexNet) on the CIFAR10 and CIFAR100 datasets. We scale the number of filters in each convolutional layer in range $\{2,4,\dots,512\}$ . We randomly split the MNIST dataset into train and test parts of sizes $60$ K and $10$ K, and CIFAR10 and CIFAR100 datasets into train and test parts of sizes $50$ K and $10$ K, respectively. The order of the total number of parameters $p$ range from several thousands to tens of millions.

For both fully connected and convolutional settings, we run each configuration with the negative-log-likelihood (i.e. cross entropy) and with the linear hinge loss, and we repeat each experiment with three different random seeds. The training algorithm is SGD with no explicit modification such as momentum or weight decay. The training runs until 100% training accuracy is achieved or until maximum number of iterations limit is reached (the latter limit is effective in the under-parametrized models). At every $100$ th iteration, we log the full training and test accuracies, and the tail estimate of the gradients that are sampled using the corresponding mini-batch size. The codebase is implemented in python using pytorch and provided it in the supplementary material. Total runtime is $\sim$ 3 weeks on 8 relatively modern GPUs.

Method for tail-index estimation: Estimating the tail-index of an extreme-value distribution is a long-standing topic. Some of the well-known estimators for this task are . Despite their popularity, these methods are not specifically developed for $\alpha$ -stable distributions and it has been shown that they might fail for estimating the tail-index for $\alpha$ -stable distributions .

In this study, we use a relatively recent estimator proposed in for $\alpha$ -stable distributions. It is given in the following theorem.

Let $\{X_{i}\}_{i=1}^{K}$ be a collection of random variables with $X_{i}\sim{\cal S}\alpha{\cal S}(\sigma)$ and $K=K_{1}\times K_{2}$ . Define $Y_{i}\triangleq\sum_{j=1}^{K_{1}}X_{j+(i-1)K_{1}}\>$ for $i\in\llbracket 1,K_{2}\rrbracket$ . Then, the estimator

converges to $1/{\alpha}$ almost surely, as $K_{2}\rightarrow\infty$ .

As shown in Theorem 2.3 of , this estimator admits a provably faster convergence rate and smaller asymptotic variance than all the aforementioned methods.

In order to verify the accuracy of this estimator, we conduct a preliminary experiment, where we first generate $K=K_{1}\times K_{2}$ many ${\cal S}\alpha{\cal S}(1)$ distributed random variables with $K_{1}=100$ , $K_{2}=1000$ for $100$ different values of $\alpha$ . Then, we estimate $\alpha$ by using $\hat{\alpha}\triangleq(\widehat{\phantom{a}\frac{1}{\alpha}\phantom{a}})^{-1}$ . We repeat this experiment $100$ times for each $\alpha$ . As shown in Figure 3, the estimator is very accurate for a large range of $\alpha$ . Due to its favorable theoretical properties such as independence of the scale parameter $\sigma$ , combined with its empirical stability, we choose this estimator in our experiments.

Results

In this section we present the most important and representative results. We have observed that, in all configurations, the choice of the two loss functions and the three different initializations yield no significant difference. Therefore, throughout this section, we will focus on the negative-log-likelihood loss. Unless stated otherwise, we set the minibatch size $b=500$ and the step-size $\eta=0.1$ .

Effect of varying network size: In our first set of experiments, we measure the tail-index for varying the widths and depths for the FCN, and varying widths (i.e. the number of filters) for the CNN. For very small sizes, the networks perform poorly, therefore, we only illustrate sufficiently large network sizes, which yield similar accuracies. For these experiments, we compute the average of the tail-index measurements for the last $10$ K iterations (i.e. when $\hat{\alpha}$ becomes stationary) to focus on the late stage dynamics.

Figure 4 shows the results for the FCN. The first striking observation is that in all the cases, the estimated tail-index is far from $2$ with a very high confidence (the variance of the estimates were around $0.001$ ), meaning that the distribution of the gradient noise is highly non-Gaussian. For the MNIST dataset, we observe that $\alpha$ systematically decreases for increasing network size, where this behavior becomes more prominent with the depth. This result shows that, for MNIST, increasing the dimension of the network results in a gradient noise with heavier tails and therefore increases the probability to end up in a wider basin.

For the CIFAR10 dataset, we still observe that $\alpha$ is far from $2$ ; however, in this case, increasing the network size does not have a clear effect on $\alpha$ : in all cases, we observe that $\alpha$ is in the range $1.1$ – $1.2$ .

Figure 5 shows the results for the CNN. In this figure, we also depict the train and test accuracy, as well as the tail-index that is estimated on the test set. These results show that, for both CIFAR10 and CIFAR100, the tail-index is extremely low for the under-parametrized regime (e.g. the case when the width is $2$ , $4$ , or $8$ for CIFAR10). As we increase the size of the network the value of $\alpha$ increases until the network performs reasonably well and stabilizes in the range $1.0$ – $1.1$ . We also observe that $\alpha$ behaves similarly for both train and test setsWe observed a similar behavior in under-parametrized FCN; however, did not plot those results to avoid clutter..

These results show that there is strong interplay between the network architecture, dataset, and the algorithm dynamics: (i) we see that the size of the network can strongly influence $\alpha$ , (ii) for the exact same network architecture, the choice of the dataset has a significant impact on not only the landscape of the problem, but also the noise characteristics, hence on the algorithm dynamics.

Effect of the minibatch size: In our second set of experiments, we investigate the effect of the size of the minibatch on $\alpha$ . We focus on the FCN and monitor the behavior of $\alpha$ for different network and minibatch sizes $b$ . Figure 6 illustrates the results. These rather remarkable results show that, as opposed to the common belief that the gradient noise behaves similar to a Gaussian for large $b$ , the tail-index does not increase at all with the increasing $b$ . We observe that $\alpha$ stays almost the same when the depth is $2$ and it moves in a small interval when the depth is set to $4$ . We note that we obtained the same the train and test accuracies for different minibatch sizes.

Tail behavior throughout iterations: So far, we have focused on the last iterations of SGD, where $\alpha$ is in a stationary regime. In our last set of experiments, we shift our focus on the first iterations and report an interesting behavior that we observed in almost all our experiments. As a representative, in Figure 7, we show the temporal evolution of SGD for the FCN with $9$ layers and $512$ neurons/layer.

The results clearly show that there are two distinct phases of SGD (in this configuration before and after iteration $1000$ ). In the first phase, the loss decreases very slowly, the accuracy slightly increases, and more interestingly $\alpha$ rapidly decreases. When $\alpha$ reaches its lowest level, the process possesses a jump, which causes a sudden decrease in the accuracy. After this point the process recovers again and we see a stationary behavior in $\alpha$ and an increasing behavior in the accuracy.

The fact that the process has a jump when $\alpha$ is at its smallest value provides a strong support to our assumptions and the metastability theory that we discussed in the previous section. Furthermore, these results further strengthen the view that SGD crosses barriers at the very initial phase. On the other hand, our current analysis is not able to determine whether the process jumps in a different basin or a ‘better’ part of the same basin and we leave it as a future work.

Conclusion and Open Problems

We investigated the tail behavior of the gradient noise in deep neural networks and empirically showed that the gradient noise is highly non-Gaussian. This outcome enabled us to analyze SGD as an SDE driven by a Lévy motion and establish a bridge between SGD and existing theoretical results, which provides more illumination on the behavior of SGD, especially in terms of choosing wide minima.

This study also brings up interesting open questions and future directions: (i) While the current metastability theory applies for the continuous-time processes, the behavior of the discretized process and its dependence on the algorithm parameters (e.g., the step-size, minibatch size) are not clear and yet to be investigated. (ii) We observe that, especially during the first iterations, the tail-index depends on the current state $\mathbf{w}_{k}$ , which suggests analyzing SGD as a stable-like process where the tail-index can depend on time. However, the metastability behavior of these processes are not clear at the moment and its theory is still in an early phase . (iii) Furthermore, an extension of the current metastability theory that includes minima with zero modes is also missing and appears to be challenging yet crucial direction of future research.

Acknowledgments

This work is partly supported by the French National Research Agency (ANR) as a part of the FBIMATRIX (ANR-16-CE23-0014) project. Mert Gürbüzbalaban acknowledges support from the grants NSF DMS-1723085 and NSF CCF-1814888.