Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, Riccardo Zecchina

cs.LG stat.ML

Introduction

This paper presents a new optimization tool for deep learning designed to exploit the local geometric properties of the objective function. Consider the histogram we obtained in Fig. 1 showing the spectrum of the Hessian at an extremum discovered by Adam (Kingma & Ba, 2014) for a convolutional neural network on MNIST (LeCun et al., 1998) ( $\approx 47,000$ weights, cf. Sec. 5.1). It is evident that:

a large number of directions ( $\approx 94\%$ ) have near-zero eigenvalues (magnitude less than $10^{-4}$ ),

positive eigenvalues (right inset) have a long tail with the largest one being almost $40$ ,

negative eigenvalues (left inset), which are directions of descent that the optimizer missed, have a much faster decay (the largest negative eigenvalue is only $-0.46$ ).

Interestingly, this trend is not unique to this particular network. Rather, its qualitative properties are shared across a variety of network architectures, network sizes, datasets or optimization algorithms (refer to Sec. 5 for more experiments). Local minima that generalize well and are discovered by gradient descent lie in “wide valleys” of the energy landscape, rather than in sharp, isolated minima. For an intuitive understanding of this phenomenon, imagine a Bayesian prior concentrated at the minimizer of the expected loss, the marginal likelihood of wide valleys under this prior is much higher than narrow, sharp valleys even if the latter are close to the global minimum in training loss. Almost-flat regions of the energy landscape are robust to data perturbations, noise in the activations, as well as perturbations of the parameters, all of which are widely-used techniques to achieve good generalization. This suggests that wide valleys should result in better generalization and, indeed, standard optimization algorithms in deep learning seem to discover exactly that — without being explicitly tailored to do so. For another recent analysis of the Hessian, see the parallel work of Sagun et al. (2016).

Based on this understanding of how the local geometry looks at the end of optimization, can we modify SGD to actively seek such regions? Motivated by the work of Baldassi et al. (2015) on shallow networks, instead of minimizing the original loss $f(x)$ , we propose to maximize

Related work

Our above observation about the spectrum of Hessian (further discussed in Sec. 5) is similar to results on a perceptron model in Dauphin et al. (2014) where the authors connect the loss function of a deep network to a high-dimensional Gaussian random field. They also relate to earlier studies such as Baldi & Hornik (1989); Fyodorov & Williams (2007); Bray & Dean (2007) which show that critical points with high training error are exponentially likely to be saddle points with many negative directions and all local minima are likely to have error that is very close to that of the global minimum. The authors also argue that convergence of gradient descent is affected by the proliferation of saddle points surrounded by high error plateaus — as opposed to multiple local minima. One can also see this via an application of Kramer’s law: the time spent by diffusion is inversely proportional to the smallest negative eigenvalue of the Hessian at a saddle point (Bovier & den Hollander, 2006).

The existence of multiple — almost equivalent — local minima in deep networks has been predicted using a wide variety of theoretical analyses and empirical observations, e.g., papers such as Choromanska et al. (2015a; b); Chaudhari & Soatto (2015) that build upon results from statistical physics as also others such as Haeffele & Vidal (2015) and Janzamin et al. (2015) that obtain similar results for matrix and tensor factorization problems. Although assumptions in these works are somewhat unrealistic in the context of deep networks used in practice, similar results are also true for linear networks which afford a more thorough analytical treatment (Saxe et al., 2014). For instance, Soudry & Carmon (2016) show that with mild over-parameterization and dropout-like noise, training error for a neural network with one hidden layer and piece-wise linear activation is zero at every local minimum. All these results suggest that the energy landscape of deep neural networks should be easy to optimize and they more or less hold in practice — it is easy to optimize a prototypical deep network to near-zero loss on the training set (Hardt et al., 2015; Goodfellow & Vinyals, 2015).

Obtaining good generalization error, however, is challenging: complex architectures are sensitive to initial conditions and learning rates (Sutskever et al., 2013) and even linear networks (Kawaguchi, 2016) may have degenerate and hard to escape saddle points (Ge et al., 2015; Anandkumar & Ge, 2016). Techniques such as adaptive (Duchi et al., 2011) and annealed learning rates, momentum (Tieleman & Hinton, 2012), as well as architectural modifications like dropout (Srivastava et al., 2014), batch-normalization (Ioffe & Szegedy, 2015; Cooijmans et al., 2016), weight scaling (Salimans & Kingma, 2016) etc. are different ways of tackling this issue by making the underlying landscape more amenable to first-order algorithms. However, the training process often requires a combination of such techniques and it is unclear beforehand to what extent each one of them helps.

Closer to the subject of this paper are results by Baldassi et al. (2015; 2016a; 2016b) who show that the energy landscape of shallow networks with discrete weights is characterized by an exponential number of isolated minima and few very dense regions with lots of local minima close to each other. These dense local minima can be shown to generalize well for random input data; more importantly, they are also accessible by efficient algorithms using a novel measure called “robust ensemble” that amplifies the weight of such dense regions. The authors use belief propagation to estimate local entropy for simpler models such as committee machines considered there. A related work in this context is EASGD (Zhang et al., 2015) which trains multiple deep networks in parallel and modulates the distance of each worker from the ensemble average. Such an ensemble training procedure enables improved generalization by ensuring that different workers land in the same wide valley and indeed, it turns out to be closely related to the replica theoretic analysis of Baldassi et al. (2016a).

Our work generalizes the local entropy approach above to modern deep networks with continuous weights. It exploits the same phenomenon of wide valleys in the energy landscape but does so without incurring the hardware and communication complexity of replicated training or being limited to models where one can estimate local entropy using belief propagation. The enabling technique in our case is using Langevin dynamics for estimating the gradient of local entropy, which can be done efficiently even for large deep networks using mini-batch updates.

Motivated by the same final goal, viz. flat local minima, the authors in Hochreiter & Schmidhuber (1997b) introduce hard constraints on the training loss and the width of local minima and show using the Gibbs formalism (Haussler & Opper, 1997) that this leads to improved generalization. As the authors discuss, the effect of hyper-parameters for the constraints is intricately tied together and they are difficult to choose even for small problems. Our local entropy based objective instead naturally balances the energetic term (training loss) and the entropic term (width of the valley). The role of $\gamma$ is clear as a focusing parameter (cf. Sec. 4.3) and effectively exploiting this provides significant computational advantages. Among other conceptual similarities with our work, let us note that local entropy in a flat valley is a direct measure of the width of the valley which is similar to their usage of Hessian, while the Gibbs variant to averaging in weight space (Eqn. 33 of Hochreiter & Schmidhuber (1997b)) is similar to our Eqn. (7). Indeed, Gibbs formalism used in their analysis is a very promising direction to understanding generalization in deep networks (Zhang et al., 2016).

Homotopy continuation methods convolve the loss function to solve sequentially refined optimization problems (Allgower & Georg, 2012; Mobahi & Fisher III, 2015), similarly, methods that perturb the weights or activations to average the gradient (Gulcehre et al., 2016) do so with an aim to smooth the rugged energy landscape. Such smoothing is however very different from local entropy. For instance, the latter places more weight on wide local minima even if they are much shallower than the global minimum (cf. Fig. 2); this effect cannot be obtained by smoothing. In fact, smoothing can introduce an artificial minimum between two nearby sharp valleys which is detrimental to generalization. In order to be effective, continuation techniques also require that minimizers of successively smaller convolutions of the loss function lie close to each other (Hazan et al., 2016); it is not clear whether this is true for deep networks. Local entropy, on the other hand, exploits wide minima which have been shown to exist in a variety of learning problems (Monasson & Zecchina, 1995; Cocco et al., 1996). Please refer to Appendix C for a more elaborate discussion as well as possible connections to stochastic variational inference (Blei et al., 2016).

Local entropy

where $\beta$ is known as the inverse temperature and $Z_{\beta}$ is a normalizing constant, also known as the partition function. As $\beta\to\infty$ , the probability distribution above concentrates on the global minimum of $f(x)$ (assuming it is unique) given as:

which establishes the link between the Gibbs distribution and a generic optimization problem (2). We would instead like the probability distribution — and therefore the underlying optimization problem — to focus on flat regions such as $x_{\textrm{robust}}$ in Fig. 2. With this in mind, let us construct a modified Gibbs distribution:

The distribution in (3) is a function of a dummy variable $x^{\prime}$ and is parameterized by the original location $x$ . The parameter $\gamma$ biases the modified distribution (3) towards $x$ ; a large $\gamma$ results in a $P(x^{\prime};\ x,\beta,\gamma)$ with all its mass near $x$ irrespective of the energy term $f(x^{\prime})$ . For small values of $\gamma$ , the term $f(x^{\prime})$ in the exponent dominates and the modified distribution is similar to the original Gibbs distribution in (1). We will set the inverse temperature $\beta$ to $1$ because $\gamma$ affords us similar control on the Gibbs distribution.

The local free entropy of the Gibbs distribution in (1), colloquially called “local entropy” in the sequel and denoted by $F(x,\gamma)$ , is defined as the log-partition function of modified Gibbs distribution (3), i.e.,

The parameter $\gamma$ is used to focus the modified Gibbs distribution upon a local neighborhood of $x$ and we call it a “scope” henceforth.

Fig. 2 shows the negative local entropy $-F(x,\gamma)$ for two different values of $\gamma$ . We expect $x_{\textrm{robust}}$ to be more robust than $x_{\textrm{non-robust}}$ to perturbations of data or parameters and thus generalize well and indeed, the negative local entropy in Fig. 2 has a global minimum near $x_{\textrm{robust}}$ . For low values of $\gamma$ , the energy landscape is significantly smoother than the original landscape and still maintains our desired characteristic, viz. global minimum at a wide valley. As $\gamma$ increases, the local entropy energy landscape gets closer to the original energy landscape and they become equivalent at $\gamma\to\infty$ . On the other hand, for very small values of $\gamma$ , the local entropy energy landscape is almost uniform.

The quantity we have defined as local entropy in Def. 1 is different from classical entropy which counts the number of likely configurations under a given distribution. For a continuous parameter space, this is given by

for the Gibbs distribution in (3). Minimizing classical entropy however does not differentiate between flat regions that have very high loss versus flat regions that lie deeper in the energy landscape. For instance in Fig. 2, classical entropy is smallest in the neighborhood of $x_{\textrm{candidate}}$ which is a large region with very high loss on the training dataset and is unlikely to generalize well.

Entropy-guided SGD

where we have made the dependence of local entropy $F(x,\gamma)$ on the dataset $\Xi$ explicit.

where the notation $\left\langle\cdot\right\rangle$ denotes an expectation of its arguments (we have again made the dependence on the data explicit) over a Gibbs distribution of the original optimization problem modified to focus on the neighborhood of $x$ ; this is given by

2 Algorithm and Implementation details

Let us now discuss a few implementation details. Although we have written Alg. 1 in the classical SGD setup, we can easily modify it to include techniques such as momentum and gradient pre-conditioning (Duchi et al., 2011) by changing lines $5$ and $7$ . In our experiments, we have used SGD with Nesterov’s momentum (Sutskever et al., 2013) and Adam for outer and inner loops with similar qualitative results. We use exponential averaging to estimate $\mu$ in the SGLD loop (line 6) with $\alpha=0.75$ so as to put more weight on the later samples, this is akin to a burn-in in standard SGLD.

3 Scoping of γ𝛾\gamma

The scope $\gamma$ is fixed in Alg. 1. For large values of $\gamma$ , the SGLD updates happen in a small neighborhood of the current parameters $x$ while low values of $\gamma$ allow the inner SGLD to explore further away from $x$ . In the context of the discussion in Sec. 3, a “reverse-annealing” of the scope $\gamma$ , i.e. increasing $\gamma$ as training progresses has the effect of exploring the energy landscape on progressively finer scales. We call this process “scoping” which is similar to that of Baldassi et al. (2016a) and use a simple exponential schedule given by

Scoping of $\gamma$ unfortunately interferes with the learning rate annealing that is popular in deep learning, this is a direct consequence of the update step on line 7 of Alg. 1. In practice, we therefore scale down the local entropy gradient by $\gamma$ before the weight update and modify the line to read

Our general strategy during hyper-parameter tuning is to set the initial scope $\gamma_{0}$ to be very small, pick a large value of $\eta$ and set $\gamma_{1}$ to be such that the magnitude of the local entropy gradient is comparable to that of SGD. We can use much larger learning rates than SGD in our experiments because the local entropy gradient is less noisy than the original back-propagated gradient. This also enables very fast progress in the beginning with a smooth landscape of a small $\gamma$ .

4 Theoretical Properties

The objective $F(x,\gamma;\ \Xi)$ in (6) is $\frac{\alpha}{1+\gamma^{-1}\ c}$ -Lipschitz and $\frac{\beta}{1+\gamma^{-1}\ c}$ -smooth.

Hardt et al. (2015) connect uniform stability to generalization error and show that an $\epsilon$ -stable algorithm $A(\Xi)$ has generalization error bounded by $\epsilon$ , i.e., if $A(\Xi)$ terminates with parameters $x^{*}$ ,

For an $\alpha$ -Lipschitz and $\beta$ -smooth loss function, if SGD converges in $T$ iterations on $N$ samples with decreasing learning rate $\eta_{t}\leq 1/t$ the stability is bounded by

The above analysis hinges upon an assumption that the Hessian $\nabla^{2}f(x)$ does not have eigenvalues in the set $[-2\gamma-c,c]$ for a constant $c>0$ . This is admittedly unrealistic, for instance, the eigenspectrum of the Hessian at a local minimum in Fig. 1 has a large fraction of its eigenvalues almost zero. Let us however remark that the result in Thm. 3 by Hardt et al. (2015) assumes global conditions on the smoothness of the loss function; one imagines that Eqn. 9 remains qualitatively the same (with respect to $T$ in particular) even if this assumption is violated to an extent before convergence happens. Obtaining a rigorous generalization bound without this assumption would require a dynamical analysis of SGD and seems out of reach currently.

Experiments

In Sec. 5.1, we discuss experiments that suggest that the characteristics of the energy landscape around local minimal accessible by SGD are universal to deep architectures. We then present experimental results on two standard image classification datasets, viz. MNIST and CIFAR-10 and two datasets for text prediction, viz. PTB and the text of War and Peace. Table 1 summarizes the results of these experiments on deep networks.

We use automatic differentiationhttps://github.com/HIPS/autograd to compute the Hessian at a local minimum obtained at the end of training for the following networks:

small-LeNet on MNIST: This network has $47,658$ parameters and is similar to LeNet but with $10$ and $20$ channels respectively in the first two convolutional layers and $128$ hidden units in the fully-connected layer. We train this with Adam to obtain a test error of $2.4\%$ .

small-mnistfc on MNIST: A fully-connected network ( $50,890$ parameters) with one layer of $32$ hidden units, ReLU non-linearities and cross-entropy loss; it converges to a test error of $2.5\%$ with momentum-based SGD.

char-lstm for text generation: This is a recurrent network with $48$ hidden units and Long Short-Term Memory (LSTM) architecture (Hochreiter & Schmidhuber, 1997a). It has $32,640$ parameters and we train it with Adam to re-generate a small piece of text consisting of $256$ lines of length $32$ each and $96$ -bit one-hot encoded characters.

We choose to compute the exact Hessian and to keep the computational and memory requirements manageable, the first three networks considered above are smaller than standard deep networks used in practice. For the last network, we sacrifice the exact computation and instead approximate the Hessian of a large deep network. We note that recovering an approximate Hessian from Hessian-vector products (Pearlmutter, 1994) could be a viable strategy for large networks.

Fig. 1 in the introductory Sec. 1 shows the eigenspectrum of the Hessian for small-LeNet while Fig. 3 shows the eigenspectra for the other three networks. A large proportion of eigenvalues of the Hessian are very close to zero or positive with a very small (relative) magnitude. This suggests that the local geometry of the energy landscape is almost flat at local minima discovered by gradient descent. This agrees with theoretical results such as Baldassi et al. (2016c) where the authors predict that flat regions of the landscape generalize better. Standard regularizers in deep learning such as convolutions, max-pooling and dropout seem to bias SGD towards flatter regions in the energy landscape. Away from the origin, the right tails of the eigenspectra are much longer than the left tails. Indeed, as discussed in numerous places in literature (Bray & Dean, 2007; Dauphin et al., 2014; Choromanska et al., 2015a), SGD finds low-index critical points, i.e., optimizers with few negative eigenvalues of the Hessian. What is interesting and novel is that the directions of descent that SGD misses do not have a large curvature.

2 MNIST

We consider two prototypical networks: the first, called mnistfc, is a fully-connected network with two hidden layers of $1024$ units each and the second is a convolutional neural network with the same size as LeNet but with batch-normalization (Ioffe & Szegedy, 2015); both use a dropout of probability $0.5$ after each layer. As a baseline, we train for $100$ epochs with Adam and a learning rate of $10^{-3}$ that drops by a factor of $5$ after every $30$ epochs to obtain an average error of $1.39\pm 0.03\%$ and $0.51\pm 0.01\%$ for mnistfc and LeNet respectively, over $5$ independent runs.

3 CIFAR-10

We train on CIFAR-10 without data augmentation after performing global contrast normalization and ZCA whitening (Goodfellow et al., 2013). We consider the All-CNN-C network of Springenberg et al. (2014) with the only difference that a batch normalization layer is added after each convolutional layer; all other architecture and hyper-parameters are kept the same. We train for $200$ epochs with SGD and Nesterov’s momentum during which the initial learning rate of $0.1$ decreases by a factor of $5$ after every $60$ epochs. We obtain an average error of $7.71\pm 0.19\%$ in $200$ epochs vs. $9.08\%$ error in $350$ epochs that the authors in Springenberg et al. (2014) report and this is thus a very competitive baseline for this network. Let us note that the best result in the literature on non-augmented CIFAR-10 is the ELU-network by Clevert et al. (2015) with $6.55\%$ test error.

4 Recurrent neural networks

We train an LSTM network on the Penn Tree Bank (PTB) dataset for word-level text prediction. This dataset contains about one million words divided into a training set of about $930,000$ words, a validation set of $74,000$ words and $82,000$ words with a vocabulary of size $10,000$ . Our network called PTB-LSTM consists of two layers with $1500$ hidden units, each unrolled for $35$ time steps; note that this is a large network with about $66$ million weights. We recreated the training pipeline of Zaremba et al. (2014) for this network (SGD without momentum) and obtained a word perplexity of $81.43\pm 0.2$ on the validation set and $78.6\pm 0.26$ on the test set with this setup; these numbers closely match the results of the original authors.

4.2 char-LSTM on War and Peace

Next, we train an LSTM to perform character-level text-prediction. As a dataset, following the experiments of Karpathy et al. (2015), we use the text of War and Peace by Leo Tolstoy which contains about $3$ million characters divided into training ( $80\%$ ), validation ( $10\%$ ) and test ( $10\%$ ) sets. We use an LSTM consisting of two layers of $128$ hidden units unrolled for $50$ time steps and a vocabulary of size $87$ . We train the baseline with Adam for $50$ epochs with an initial learning rate of $0.002$ that decays by a factor of $2$ after every $5$ epochs to obtain a validation perplexity of $1.224\pm 0.008$ and a test perplexity of $1.226\pm 0.01$ .

Discussion

Conclusions

Acknowledgments

This work was supported by ONR N00014-13-1-034, AFOSR F9550-15-1-0229 and ARO W911NF-15-1-0564/66731-CS.

References

Appendix A Stochastic gradient Langevin dynamics (SGLD)

Langevin dynamics (Neal, 2011) injects Gaussian noise into maximum-a-posteriori (MAP) updates to prevent over-fitting the solution $x^{*}$ of the above equation. The updates can be written as

Let us note that there is a variety of increasingly sophisticated MCMC algorithms applicable to our problem, e.g., Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) by Chen et al. (2014) based on volume preserving flows in the “parameter-momentum” space, stochastic annealing thermostats (Santa) by Chen et al. (2015) etc. We can also employ these techniques, although we use SGLD for ease of implementation; the authors in Ma et al. (2015) provide an elaborate overview.

Appendix B Proofs

The above expression is the mean of a distribution $\propto e^{-\nabla f(x)^{\top}s-\frac{1}{2}\ s^{\top}\left(\gamma+\nabla^{2}f(x)\right)s}$ . We can approximate it using the saddle point method as the value of $s$ that minimizes the exponent to get

Let us denote $A(x):=\left(I+\gamma^{-1}\ \nabla^{2}f(x)\right)^{-1}$ . Plugging this into the condition for smoothness, we have

Unfortunately, we can only get a uniform bound if we assume that for a small constant $c>0$ , no eigenvalue of $\nabla^{2}f(x)$ lies in the set $[-2\gamma-c,c]$ . This gives

This shows that a smaller value of $\gamma$ results in a smoother energy landscape, except at places with very flat directions. The Lipschitz constant also decreases by the same factor.

Appendix C Connection to variational inference

The fundamental motivations of (stochastic) variational inference (SVI) and local entropy are similar: they both aim to generalize well by constructing a distribution on the weight space. In this section, we explore whether they are related and how one might reconcile the theoretical and algorithmic implications of the local entropy objective with that of SVI.

Let $\Xi$ denote the entire dataset, $z$ denote the weights of a deep neural network and $x$ be the parameters of a variational distribution $q_{x}(z)$ . The Evidence Lower Bound (ELBO) can be then be written as

On the other hand, if we define the loss as the log-likelihood of data, viz. $f(z):=-\log p(\Xi|z)$ , we can write the logarithm of the local entropy in Eqn. (4) as

by an application of Jensen’s inequality. It is thus clear that Eqn. (13) and (14) are very different in general and one cannot choose a prior, or a variational family, that makes them equivalent and interpret local entropy as ELBO.

Eschewing rigor, formally, if we modify Eqn. (13) to allow the prior $p(z)$ to depend upon $x$ , we can see that the two lower bounds above are equivalent iff $q_{x}(z)$ belongs to a “flat variational family”, i.e. uniform distributions with $x$ as the mean and $p_{x}(z)\propto\exp\left(-\frac{\gamma}{2}\ \lVert x-z\rVert^{2}\right)$ . We emphasize that the distribution $p_{x}(z)$ depends on the parameters $x$ themselves and is thus, not really a prior, or one that can be derived using the ELBO.

This “moving prior” is absent in variational inference and indeed, a crucial feature of the local entropy objective. The gradient of local entropy in Eqn. (7) clarifies this point:

where the distribution $r(z;x)$ is given by

it thus contains a data likelihood term along with a prior that “moves” along with the current iterate $x$ .

Let us remark that methods in the deep learning literature that average the gradient through perturbations in the neighborhood of $x$ (Mobahi, 2016) or noisy activation functions (Gulcehre et al., 2016) can be interpreted as computing the data likelihood in ELBO (without the KL-term); such an averaging is thus different from local entropy.

We use stochastic gradient Langevin dynamics (cf. Appendix A) to estimate the gradient of local entropy in Alg. 1. It is natural then, to ask the question whether vanilla SGLD performs as well as local entropy. To this end, we compare the performance of SGLD on two prototypical networks: LeNet on MNIST and All-CNN-BN on CIFAR-10. We follow the experiments in Welling & Teh (2011) and Chen et al. (2015) and set the learning rate schedule to be $\eta/(1+t)^{b}$ where the initial learning rate $\eta$ and $b$ are hyper-parameters. We make sure that other architectural aspects (dropout, batch-normalization) and regularization (weight decay) are consistent with the experiments in Sec. 5.

Training deep networks with SGLD, or other more sophisticated MCMC algorithms such as SGHMC, SGNHT etc. (Chen et al., 2014; Neal, 2011) to errors similar to those of SGD is difficult, and the lack of such results in the literature corroborates our experimental experience. Roughly speaking, local entropy is so effective because it operates on a transformation of the energy landscape that exploits entropic effects. Conventional MCMC techniques such as SGLD or Nose’-Hoover thermostats (Ding et al., 2014) can only trade energy for entropy via the temperature parameter which does not allow the direct use of the geometric information of the energy landscape and does not help with narrow minima.