Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis
Maxim Raginsky, Alexander Rakhlin, Matus Telgarsky
Introduction and informal summary of results
Consider a stochastic optimization problem
When the functions are not convex, theoretical analysis of global convergence becomes largely intractable. On the other hand, non-convex optimization is currently witnessing an impressive string of empirical successes, most notably in the realm of deep neural networks. Towards the aim of bridging this gap between theory and practice, this paper provides a theoretical justification for Stochastic Gradient Langevin Dynamics (SGLD), a popular variant of stochastic gradient descent, in which properly scaled isotropic Gaussian noise is added to an unbiased estimate of the gradient at each iteration (Gelfand and Mitter, 1991; Borkar and Mitter, 1999; Welling and Teh, 2011).
Since the population distribution is unknown, we attempt to (approximately) minimize
It is well-recognized, however, that minimization of the empirical risk does not immediately translate into minimization of the population risk . A standard approach for addressing the issue is to decompose the excess risk into a sum of two terms, (the generalization error of ) and (the gap between the empirical risk of and the minimum of the population risk), and then show that both of these terms are small (either in expectation or with high probability). Taking and letting be the output of the Gibbs algorithm under which the conditional distribution of given is equal to , we decompose the excess risk (1.1) as follows:
where the first term is the difference of expected population risks of SGLD and the Gibbs algorithm, the second term is the generalization error of the Gibbs algorithm, and the third term is easily upper-bounded in terms of expected suboptimality \mathbf{E}\big{(}F_{\mathbf{Z}}(\widehat{W}^{*})-\min_{w}F_{\mathbf{Z}}(w)\big{)} of the Gibbs algorithm for the empirical risk. Observe that only the first term pertains to SGLD, whereas the other two involve solely the Gibbs distribution. The main contribution of this work is in showing finite-time convergence of SGLD for a non-convex objective function. Informally, we can state our main result as follows:
For any , the first term in (1.5) scales as
where is a certain spectral gap parameter that governs the exponential rate of convergence of the Langevin diffusion to its stationary distribution. This spectral gap parameter itself might depend on and , but is independent of .
The second and third terms in (1.5) scale, respectively, as
Our analysis draws heavily on the theory of optimal transportation (Villani, 2003) and on the analysis of Markov diffusion operators (Bakry et al., 2014) (the necessary background on Markov semigroups and functional inequalities is given in Appendix A). In particular, we control the convergence of SGLD to the Gibbs distribution in terms of -Wasserstein distance
To control the first term on the right-hand side of (1.5), we first upper-bound the -Wasserstein distance between the distributions of (the th iterate of SGLD) and (the point reached by the Langevin diffusion at time ). This requires some heavy lifting: Existing bounds on the -Wasserstein distance between a diffusion process and its time-discretized version due to Alfonsi et al. (2015) scale like , which is far too crude for our purposes. By contrast, we take an indirect route via a Girsanov-type change of measure and a weighted transportation-cost inequality of Bolley and Villani (2005) to obtain a bound that scales like . This step relies crucially on a certain exponential integrability property of the Langevin diffusion. Next, we show that the Gibbs distribution satisfies a logarithmic Sobolev inequality, which allows us to conclude that the -Wasserstein distance between the distribution of and the Gibbs distribution decays exponentially as . Since satisfies the triangle inequality, we can produce an upper bound on the first term in (1.5) that scales as . This immediately suggests that we can make this term as small as we wish by first choosing a large enough horizon and then a small enough step size . Overall, this leads to the bounds stated in (1.6).
To control the second term in (1.5), we show that the Gibbs algorithm is stable in -Wasserstein distance with respect to local perturbations of the training dataset. This step, again, relies on the logarithmic Sobolev inequality for the Gibbs distribution. To control the third term, we use a nonasymptotic Laplace integral approximation to show that a single draw from the Gibbs distribution is an approximate minimizer of the empirical risk. We use a Wasserstein continuity result due to Polyanskiy and Wu (2016) and a well-known equivalence between stability of empirical minimization and generalization (Mukherjee et al., 2006; Rakhlin et al., 2005) to show that, in fact, the Gibbs algorithm samples from near-minimizers of the population risk.
We remark that our result readily extends to the case when the stochastic gradients in (1.3) are formed with respect to independent draws from the data-generating distribution – e.g., when taking a single pass through the dataset. In this case, the target of optimization is itself rather than , and we simply omit the second term in (1.5). If the main concern is not consistency (as in (1.1)) but rather the generalization performance of SGLD itself, then the same analysis applied to the decomposition
gives an upper bound of (1.6) plus the first term of (1.7). In other words, while the rate of (1.1) may be hampered by the slow convergence of , the rate of generalization is not. Finally, if each data point is used only once, the generalization performance is controlled by (1.6) alone.
2 Related work
The asymptotic study of convergence of discretized Langevin diffusions for non-convex objectives has a long history, starting with the work of Gelfand and Mitter (1991). Most of the work has focused on annealing-type schemes, where both the step size and the temperature are decreased with time. Márquez (1997) and Pelletier (1998) studied the rates of weak convergence for both the Langevin diffusion and the discrete-time updates. However, when and are kept fixed, the updates do not converge to a global minimizer, but one can still aim for convergence to a stationary distribution. An asymptotic study of this convergence, in the sense of relative entropy, was initiated by Borkar and Mitter (1999).
Dalalyan and Tsybakov (2012) and Dalalyan (2016) analyzed rates of convergence of discrete-time Langevin updates (with exact gradients) in the case of convex functions, and provided nonasymptotic rates of convergence in the total variation distance for sampling from log-concave densities. Durmus and Moulines (2015) refined these results by establishing geometric convergence in total variation distance for convex and strongly convex objective functions, and provided some results for non-convex objectives that can be represented as a bounded perturbation of a convex or a strongly convex function. Bubeck et al. (2015) studied projected Langevin updates in the convex case.
Our work is motivated in part by recent papers on non-convex optimization and, in particular, on optimization problems related to neural networks. A heuristic analysis of SGLD was given by Welling and Teh (2011), and a modification of SGLD to improve generalization performance was recently proposed by Chaudhari et al. (2016). Deliberate addition of noise was also proposed by Ge et al. (2015) as a strategy for escaping from saddle points, and Belloni et al. (2015) analyzed a simulated anealing method based on Hit-and-Run for sampling from nearly log-concave distributions. While these methods aim at avoiding local minima through randomn perturbations, the line of work on continuation methods and graduated optimization (Hazan et al., 2016) attempts to create sequences of smoothed approximations that can successively localize the optimum.
Hardt et al. (2015) studied uniform stability and generalization properties of stochastic gradient descent with both convex and non-convex objectives. For the non-convex case, their upper bound on stability degrades with the number of steps of the optimization procedure, which was taken by the authors as a prescription for early stopping. In contrast, we show that, under our assumptions, non-convexity does not imply loss of stability when the latter is measured in terms of -Wasserstein distance to the stationary distribution. In addition, we use the fact that Gibbs distribution concentrates on approximate empirical minimizers, implying convergence for the population risk via stability (Rakhlin et al., 2005; Mukherjee et al., 2006).
The main result
where is a random element of with probability law . Conditionally on , the SGLD update takes the form
The function takes nonnegative real values, and there exist constants , such that
For each , the function is -smooth: for some ,
For each , the function is -dissipative (Hale, 1988): for some and ,
There exists a constant , such that, for each ,We are reusing the constants and from (A.1) and (A.2) in (2.4) mainly out of considerations of technical convenience; any other constants can be substituted in their place without affecting the results.
We are now ready to state our main result. A crucial role will be played by the uniform spectral gap
Proof of Theorem 2.1
Let , , and . In a nutshell, our proof of Theorem 2.1 consists of the following steps:
We first show that, for all sufficiently small , the SGLD recursion (2.2) tracks the continuous-time Langevin diffusion process (1.4) in -Wasserstein distance:
(the precise statement with explicit constants is given in Proposition 3.1).
Next, we show that the Langevin diffusion (1.4) converges exponentially fast to the Gibbs distribution :
This, together with (3.1) and the triangle inequality, yields the estimate
(see Proposition 3.3 for explicit constants). Observe that there are two terms on the right-hand side of (3.2), one of which grows linearly with , while the other one decays exponentially with . Thus, we can first choose large enough and then small enough, so that
The resulting choices of and translate into the expressions for and given in (A.14).
The upshot of Eq. (3.3) is that, for large enough , the conditional probability law of given is close, in -Wasserstein, to the Gibbs distribution . Thus, we are led to consider the Gibbs algorithm that generates a random draw from . We show that the resulting hypothesis is an almost-minimizer of the empirical risk, i.e.,
(see Proposition 3.4 for the exact statement), and also that the Gibbs algorithm is stable in the -Wasserstein distance: for any two datasets that differ in a single coordinate,
This estimate, together with Lemma 3.5 below, implies that the Gibbs algorithm is uniformly stable (Bousquet and Elisseeff, 2002):
(see Proposition 3.5). The almost-ERM property (3.4) and the uniform stability propery (3.5), together with (3.3), yield the statement of the theorem.
2 Technical lemmas
We first collect a few lemmas that will be used in the sequel; see Appendix C for the proofs.
For all and all ,
for some constants and . Then
3 The diffusion approximation
Recall that and , and we take . In this section, we derive an upper bound on the 2-Wasserstein distance . The analysis consists of two steps. We first upper-bound the relative entropy via a change-of-measure argument following Dalalyan and Tsybakov (2012) (see also Dalalyan (2016)), except that we also have to deal with the error introduced by the stochastic gradient oracle. We then use a weighted transportation-cost inequality of Bolley and Villani (2005) to control the Wasserstein distance in terms of .
The proof of the following lemma is somewhat lengthy, and is given in Appendix D:
Let , , and take . Suppose . Since , we can use Lemma 3.3 to write
Moreover, for all and satisfying the conditions of Lemma 3.6, plus the additional requirement , we can write
Putting everything together, we obtain the following result:
4 Wasserstein distance to the Gibbs distribution
We now fix a time and examine the 2-Wasserstein distance . At this point, we need to use a number of concepts from the analysis of Markov diffusion operators; see Appendix A for the requisite background. We start by showing the following:
For , all of the the Gibbs measures satisfy a logarithmic Sobolev inequality with constant
by the Otto–Villani theorem, and, since by Lemma 3.4, we also have
by the theorem on exponential decay of entropy. Combining Eqs. (3.16) (with ) and (3.17) and using Lemma 3.4, we get
Letting and invoking Proposition 3.1, we obtain the following:
For all and satisfying the conditions of Proposition 3.1, we have
5 Almost-ERM property of the Gibbs algorithm
is the differential entropy of (Cover and Thomas, 2006). To upper-bound , we estimate the second moment of . From (3.17), it follows that . Since convergence of probability measures in 2-Wasserstein distance is equivalent to weak convergence plus convergence of second moments (Villani, 2003, Theorem 7.12), we have by Lemma 3.2
The differential entropy of a probability density with a finite second moment is upper-bounded by that of a Gaussian density with the same second moment, so we immediately get
Using Eqs. (3.20) and (3.21) in (3.18) and simplifying, we obtain the following result:
6 Stability of the Gibbs algorithm
where is the index of the coordinate where and differ. In particular,
Therefore, since satisfies a logarithmic Sobolev inequality with constant given in Proposition 3.2, we can write
where the last line follows from the quadratic growth estimate (3.6). Taking in (3.16) and using the above bound and the second-moment estimate (3.19), we obtain
Finally, observe that, for each , the function satisfies the conditions of Lemma 3.5 with and , while and satisfy the conditions of Lemma 3.5 with . Thus, we obtain the following uniform stability estimate for the Gibbs algorithm:
For any two that differ only in a single coordinate,
7 Completing the proof
Now consider the random hypotheses and with and . Then
The function satisfies the conditions of Lemma 3.5 with and , while the probability measures satisfy the conditions of Lemma 3.5 with
by Lemma 3.2. Therefore, for all ,
It remains to analyze the expected excess risk of the Gibbs algorithm. To that end, we will use stability-based arguments along the lines of Bousquet and Elisseeff (2002) and Rakhlin et al. (2005). We begin by decomposing the excess risk as
The term is the generalization error of the Gibbs algorithm. To upper-bound it, let be independent of and . Then
Using the fact that are i.i.d., as well as the fact that is indepdenent of , the th term in the summation in (3.7) can be written out explicitly as follows:
where . Noting that and differ only in the th coordinate, we can use Proposition 3.5 to upper-bound the integral in (3.24). Since the resulting estimate is uniform in , from (3.7) we obtain
where the last step is by Proposition 3.4. From (3.25) and (3.26), we get
Combining Eqs. (3.22) and (3.27), we obtain the claimed excess risk bound (2.6).
Discussion and directions for future research
The first two assumptions are fairly standard in the literature on non-convex optimization. The dissipativity assumption (A.3) merits some discussion. The term “dissipative” comes from the theory of dynamical systems (Hale, 1988; Stuart and Humphries, 1996), where it has the following interpretation: Consider the gradient flow described by the ordinary differential equation
Then it is not hard to show that, if the function is -Lipschitz, then satisfies (A.2) with and . Thus, a byproduct of our analysis is a fine-grained characterization of the impact of weight decay on learning.
Finally, the exponential integrability assumption (A.5) is satisfied, for example, by the Gaussian initialization with .
Effect of gradient noise and minibatch size selection.
Observe that the excess risk bound (2.6) contains a term that goes to zero as , as well as a term that grows as , but goes to zero as the gradient noise level . This suggests selecting the minibatch size
to offset the term.
Uniform spectral gap.
As shown in Appendix B, Assumptions (A.1)–(A.3) are enough to guarantee that the spectral gap is strictly positive. In particular, we give a very conservative estimate
Therefore, in order to apply Theorem 2.1, one needs to fully exploit the structural properties of the problem at hand and produce an upper bound on which is polynomial in or even dimension-free. (By contrast, exponential dependence of on is unavoidable in the presence of multiple local minima and saddle points; this is a consequence of sharp upper and lower bounds on the spectral gap due to Bovier et al. (2005).) For example, consider replacing the empirical risk (1.2) with a smoothed objective
Note that here, in contrast with (4.3), this bound is completely dimension-free. A tantalizing line of future work is, therefore, to find other settings where is indeed small.
Acknowledgments
The authors would like to thank Arnak Dalalyan and Ramon van Handel for enlightening discussions.
References
Appendix A Background on Markov semigroups and functional inequalities
Our analysis relies on the theory of Markov diffusion operators and associated functional inequalities. In this Appendix, we only summarize the key ideas and results; the book by Bakry et al. (2014) provides an in-depth exposition.
It can be shown that , i.e., is a positive operator (since , zero is an eigenvalue).
Let be a Markov semigroup with the unique invariant distribution and the Dirichlet form . We say that satisfies a Poincaré (or spectral gap) inequality with constant if, for all probability measures ,
Hence, if , then the spectrum of is contained in the set , so is the gap between the zero eigenvalue and the rest of the spectrum. We say that satisfies a logarithmic Sobolev inequality with constant if, for all ,
Exponential decay of entropy (Bakry et al., 2014, Th. 5.2.1): Let . Then
where is a function and is the standard -dimensional Brownian motion. (Replacing the factor by is equivalent to the time rescaling .) The generator of this semigroup is the second-order differential operator
Thus, the Gibbs measure satisfies a Poincaré inequality with constant if, for any ,
and a logarithmic Sobolev inequality with constant if
If is and strongly convex, i.e., for some , then satisfies a logarithmic Sobolev inequality with constant . In the absence of convexity, it is in general difficult to obtain upper bounds on Poincaré or log-Sobolev constants. The following two propositions give sufficient conditions based on so-called Lyapunov function criteria:
Then satisfies a Poincaré inequality with constant
satisfies a Poincaré inequality with constant .
There exists some constant , such that .
Let and be defined, for some , by
Then satisfies a logarithmic Sobolev inequality with constant .
In particular, if , we can take , in which case
Appendix B A lower bound on the uniform spectral gap
Our goal here is to prove the crude lower bound on given in Section 4. To that end, we will use the Lyapunov function criterion due to Bakry et al. (2008), which is reproduced as Proposition A.1 in Appendix A.
We will apply this criterion to the Gibbs distribution for some . Thus, we have and
Consider the candidate Lyapunov function . From the fact that and from the dissipativity assumption (A.3), it follows that
Thus, evidently satisfies (A.11) with , and , where
Moreover, from Lemma 3.1 and from the fact that , it follows that
Thus, by Proposition A.1, satisfies a Poincaré inequality with constant
Observe that this bound holds for all . Using this fact and the relation between the spectral gap and the Poincaré constant, we see that
Appendix C Proofs for Section 3.2
where (i) follows from (A.1) and from Cauchy–Schwarz, while (ii) follows from (3.6). This proves the upper bound on . Now take for some to be chosen later. With this choice, we proceed from Eq. (C.1) as follows:
where (i) uses the fact that , while (ii) uses the dissipativity property (2.3). Taking , we get the lower bound in (3.7). ∎
where the second step uses independence of and and the unbiasedness property (2.1) of the gradient oracle. We can further expand the first term in (C.2):
where we have used (2.1) once again. By (2.4), the second term in (C.3) can be upper-bounded by
whereas the first term can be estimated as
where the inequality follows from the dissipativity condition (2.3) and the bound (3.6) in Lemma 3.1. Combining all of the above, we arrive at the recursion
Fix some . There are two cases to consider:
If , then from (C.4) it follows that
If , then iterating (C.4) gives
The bound (3.8) follows from Eqs. (C.5) and (C.6) and from the estimate
which easily follows from the independence of and and from Jensen’s inequality.
We now analyze the diffusion (1.4). Let . Then Itô’s lemma gives
Recognizing the left-hand side of (C.8) as the total Itô derivative of , we arrive at
which, upon integrating and rearranging, becomes
Now, using the dissipativity condition (2.3), we can write
Substituting this into (C.10), we end up with
Taking expectations and using the martingale property of the Itô integral together with (C.7), we get (3.9). Eq. (3.10) follows from maximizing the right-hand side of (3.9) over all . ∎
For , Itô’s lemma gives
From the dissipativity condition (2.3) and from the assumption that , it follows that
Eq. (3.11) then follows by an application of the Gronwall lemma. ∎
Since everywhere, we can write
We first upper-bound the partition function:
where the inequality follows from Lemma 3.1. Thus,
Moreover, invoking Lemma 3.1 once again, we have
Substituting (C.12), (C.13), and (C.14) into (C.11), we get (3.12). ∎
where we have used Cauchy–Schwarz and the growth condition (3.13). Now let be the coupling of and that achieves . That is, with , , and
Interchanging the roles of and , we complete the proof. ∎
Appendix D Proof of Lemma 3.6
Conditioned on , is a time-homogeneous Markov process. Consider the following continuous-time interpolation of this process:
where for . Note that, for each , and have the same probability law . Moreover, by a result of Gyöngy (1986), the process has the same one-time marginals as the Itô process
Crucially, is a Markov process, while is not. Let \mathbf{P}^{t}_{V}\mathrel{\mathop{\mathchar 58\relax}}=\mathcal{L}\big{(}V(s)\mathrel{\mathop{\mathchar 58\relax}}0\leq s\leq t\big{|}\mathbf{Z}=\mathbf{z}\big{)} and \mathbf{P}^{t}_{W}\mathrel{\mathop{\mathchar 58\relax}}=\mathcal{L}\big{(}W(s)\mathrel{\mathop{\mathchar 58\relax}}0\leq s\leq t\big{|}\mathbf{Z}=\mathbf{z}\big{)}. The Radon–Nikodym derivative of w.r.t. is given by the Girsanov formula
(see, e.g., Sec. 7.6.4 in Liptser and Shiryaev (2001)). Using (D.3) and the martingale property of the Itô integral, we have
where the last line follows from the fact that for each .
We first estimate the first summation in (D.4). Consider some . From (D.1), we have
Therefore, using Lemmas 3.1 and 3.2 and the gradient noise assumption (A.4), we arrive at
Similarly, the second summation on the right-hand side of (D.4) can be estimated as follows:
Substituting Eqs. (D.5) and (D.6) into (D.4), we obtain
Now, since and , the data-processing inequality for the KL divergence gives
Appendix E Proof of Proposition 3.2
To establish the log-Sobolev inequality, we will use the Lyapunov function criterion of Cattiaux et al. (2010), reproduced as Proposition A.2 in Appendix A.
We will apply this proposition to the Gibbs distribution for some , so that and
We consider the same Lyapunov function as in Appendix B. From Eq. (B.1), evidently satisfies (A.13) with and given in (B.2), i.e., the first condition of Proposition A.2 is satisfied. Moreover, satisfies a Poincaré inequality with constant . Thus, the second condition is also satisfied. Finally, by the -smoothness assumption (A.2), , so the third condition of Proposition A.2 is satisfied with . Consequently, the constants and in (A.14) are given by
where we have also used the estimate (3.19) to upper-bound . Therefore, from Proposition A.2 and from (E.1) it follows that satisfies a logarithmic Sobolev inequality with