Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains
Aymeric Dieuleveut, Alain Durmus, Francis Bach
Introduction
We consider the minimization of an objective function given access to unbiased estimates of the function gradients. This key methodological problem has raised interest in different communities: in large-scale machine learning , optimization , and stochastic approximation . The most widely used algorithms are stochastic gradient descent (SGD), a.k.a. Robbins-Monro algorithm , and some of its modifications based on averaging of the iterates .
While the choice of the step-size may be done robustly in the deterministic case (see e.g. ), this remains a traditional theoretical and practical issue in the stochastic case. Indeed, early work suggested to use step-size decaying with the number of iterations as , but it appeared to be non-robust to ill-conditioning and slower decays such as together with averaging lead to both good practical and theoretical performance .
We consider in this paper constant step-size SGD, which is often used in practice. Although the algorithm is not converging in general to the global optimum of the objective function, constant step-sizes come with benefits: (a) there is a single parameter value to set as opposed to the several choices of parameters to deal with decaying step-sizes, e.g. as ; the initial conditions are forgotten exponentially fast for well-conditioned (e.g. strongly convex) problems , and the performance, although not optimal, is sufficient in practice (in a machine learning set-up, being only 0.1% away from the optimal prediction often does not matter).
where is the objective function to minimize (in machine learning the generalization performance), the zero-mean statistically independent noise (in machine learning, obtained from a single observation). Following , we leverage the property that the sequence of iterates is an homogeneous Markov chain.
This interpretation allows us to capture the general behavior of the algorithm. In the strongly convex case, this Markov chain converges exponentially fast to a unique stationary distribution (see Section 3) highlighting the facts that (a) initial conditions of the algorithms are forgotten quickly and (b) the algorithm does not converge to a point but oscillates around the mean of . See an illustration in Figure 1 (left). It is known that the oscillations of the non-averaged iterates have an average magnitude of .
Consider the process given for all by
Then under appropriate conditions on the Markov chain , a central limit theorem on holds which implies that converges at rate to
The deviation between and the global optimum is thus composed of a stochastic part and a deterministic part .
For quadratic functions, it turns out that the deterministic part vanishes , that is, and thus averaged SGD with a constant step-size does converge. However, it is not true for general objective functions where we can only show that , and this deviation is the reason why constant step-size SGD is not convergent.
In summary, we make the following contributions:
We show in Section 2 that Richardson-Romberg extrapolation may be used to get closer to the global optimum.
We bring and adapt in Section 3 tools from analysis of discretization of diffusion processes into the one of SGD and create new ones. We believe that this analogy and the associated ideas are interesting in their own right.
We show in Section 4 empirical improvements of the extrapolation schemes.
Main results
In this section, we describe the assumptions underlying our analysis, describe our main results and their implications.
We also consider the following conditions on the noise, for :
In other words, we assume that the covariance matrix is a regular enough function, which is satisfied in natural settings.
For all , does not depend on . This two parts in the noise will appear in Section 3.2. Finally assume that there exists such that
then 4 is satisfied. This assumption is satisfied, e.g., for a.s. bounded data, or for data with bounded kurtosis, see for details.
2 Summary and discussion of main results
This expansion in the step-size shows that a Richardson-Romberg extrapolation can be used to have better estimates of . Consider the average iterates and associated with SGD with step size and respectively. Then (9) shows that satisfies
and therefore is closer to the optimum . This very simple trick improves the convergence by a factor of (at the expense of a slight increase of the variance). In practice, while the un-averaged gradient iterate saturates rapidly, may already perform well enough to avoid saturation on real data-sets . The Richardson-Romberg extrapolated iterate very rarely reaches saturation in practice. This appears in synthetic experiments presented in Section 4. Moreover, this procedure only requires to compute two parallel SGD recursions, either with the same inputs, or with different ones, and is naturally parallelizable.
In Section 3.2, we give a quantitative version of a central limit theorem for , for a fixed and going to : under appropriate conditions, there exist constants and such that
Combining (9) and (10) characterizes the bias/variance trade-off of SGD used to estimate .
3 Related work
The idea to study stochastic approximation algorithms using results and techniques from the Markov chain literature is not new. It goes back to , which shows under appropriate conditions that solutions of stochastic differential equations (SDE)
where is a -dimensional Brownian motion and is a one-dimensional positive function, , converge in probability to some minima of . An other example is which extends the classical Foster-Lyapunov criterion from Markov chain theory (see ) to study the stability of the LMS algorithm. In , the authors are interested in the convergence of the multidimensional Kohonen algorithm. They show that the Markov chain defined by this algorithm is uniformly ergodic and derive asymptotic properties on its limiting distribution.
To the authors knowledge, the use of the Richardson-Romberg method for stochastic approximation has only been considered in to recover the minimax rate for recursive estimation of time varying autoregressive process.
Several attempts have been made to improve convergence of SGD. proposed an online Newton algorithm which converges in practice to the optimal point with constant step-size but has no convergence guarantees. The quadratic case was studied by , for the (uniform) average iterate: the variance term is upper bounded by and the squared bias term by . This last term was improved to by , showing that asymptotically, the bias term is negligible, see also . Analysis has been extended to “tail averaging” , to improve the dependence on the initial conditions. Note that this procedure can be seen as a Richardson-Romberg trick with respect to . Other strategies were suggested to improve the speed at which initial conditions were forgotten, for example using acceleration when the noise is additive . A criterion to check when SGD with constant step size is close to its limit distribution was recently proposed in .
In the context of discretization of ergodic diffusions, weak error estimates between the stationary distribution of the discretization and the invariant distribution of the associated diffusion have been first shown by and in the case of the Euler-Maruyama scheme. Then, suggested the use of Richardson-Romberg interpolation to improve the accuracy of estimates of integrals with respect to the invariant distribution of the diffusion. Extension of these results have been obtained for other types of discretization by and . We show in Section 3.3 that a weak error expansion in the step-size also holds for SGD between and . Interestingly as to the Euler-Maruyama discretization, SGD has a weak error of order . In addition, proposed and analyzed the use of Richardson-Romberg extrapolation applied to the stochastic gradient Langevin dynamics (SGLD) algorithm. This method introduced by combines SGD and the Euler-Maruyama discretization of the Langevin diffusion associated to a target probability measure . Note that this method is however completely different from SGD, in part because Gaussian noise of order (instead of ) is injected in SGD which changes the overall dynamics.
Detailed analysis
In this Section, we describe in detail our approach. A first step is to describe the existence of a unique stationary distribution for the Markov chain and the convergence of this Markov chain to in the Wasserstein distance of order .
Note that using that are independent of , we have for using 3, that
Since for all , the distribution of belongs to , by definition of the Wasserstein distance we get
using (12) for , 4 for , and finally 1 for .
Thus by a straightforward induction, we get, setting
We show that the limit does not depend on . Assume that there exists such that . By the triangle inequality
Thus by (13) and (14), taking the limits as , we get and . The limit is thus the same for all initial distributions and is denoted by .
Using (13) and (14), we get taking , and . The fact that is the unique stationary distribution is straightforward by contradiction and using (13).
Taking , , using the invariance of and (13), we get (a).
When is a quadratic function, i.e. is affine, we have the following result.
Assume , , where is a positive definite matrix, and 2-3-4(4). Let . Then, it holds , is invertible and
where and are given by (3) and (5) respectively, and is the invariant probability measure of given by Section 3.
While the quadratic case led to particularly simple expressions, in general, we can only get a first order development of these expectations as . Note that it improved on , which shows a similar expansion but an error of order of .
Assume 1-2-3-4()-5 and let . Then is invertible and
and are given by (3) and (5) respectively, and is the invariant probability measure of given by Section 3.
This shows that is a differentiable function at . The “drift” can be understood as an additional error occurring because the function is non quadratic () and the step-sizes are not decaying to zero. The mean under the limit distribution is at distance from . In comparison, the final iterate oscillates in a sphere of radius proportional to .
2 Expansion for a given γ>0𝛾0\gamma>0 when k𝑘k tends to +∞+\infty
the Poisson solution associated to ,
the Poisson solution associated to ,
the Poisson solution associated to ,
the Poisson solution associated to .
where , are given by (2) and (3) respectively, and is the invariant probability measure of given by Section 3.
In order to give the intuition of the proof and to underline how the associated Poisson solutions are introduced, we here sketch the proof of the first result. By definition of and since satisfies , we have
Finally, we have that converges to 0 at linear speed, using Section 3 and .
The formal and complete proof of this result is postponed to Section 6.5. ∎
This result gives an exact closed form for the asymptotic bias and variance, for a fixed , as . Unfortunately, in the general case, it is neither possible to compute the Poisson solutions exactly, nor is it possible to prove a first order development of the limits as .
When is a quadratic function, it is possible, for any , to compute and explicitly; we get the following decomposition of the error, which exactly recovers the result of or .
With , and
The proof is postponed to the supplementary paper , Section S3. ∎
The bound on the second order moment is composed of a variance term , a bias term which decays as , and a non-positive residual term. Interestingly, the bias is if we start under the limit distribution.
3 Continuous interpretation of SGD and weak error expansion
Denote by , the infinitesimal generator associated with the flow defined by
where is the Markov chain starting from and defined by the recursion (1) and is given by (5). In addition for some constant independent of and , we have
4 Discussion
Classical proofs of convergence rely on another decomposition, originally proposed by and used in recent papers analyzing the averaged iterate . We here sketch the arguments of these decompositions, in order to highlight the main difference, namely the fact that the residual term is not well controlled when goes to zero in the classical proof.
As a consequence, using the definition of the SGD recursion (1),
Averaging over the first iterates yields:
The term on the right-hand part of Equation (23) is composed of a bias term (depending on the initial condition), a variance term, and a residual term. This residual term differentiates the general setting from the quadratic one (in which it does not appear, as the first order Taylor expansion of is exact). This decomposition has been used in to prove upper bound on the error, but does not allow for a tight decomposition in powers of when . Indeed, the residual simply does not go to 0 when : on the contrary, the chain becomes ill-conditioned when .
Finally, averaging over the first iterations and taking give
This expansion is the root of the proof of Theorem 4, which formalizes the expansion as powers of . The key difference between decomposition (23) and (24) is that in the latter, when , the expectation of the residual term tends to 0 and can naturally be controlled.
Experiments
Conclusion
In this paper, we have used and developed Markov chain tools to analyze the behavior of constant step-size SGD, with a complete analysis of its convergence, outlining the effect of initial conditions, noise and step-sizes. For machine learning problems, this allows us to extend known results from least-squares to all loss functions. This analysis leads naturally to using Romberg-Richardson extrapolation, that provably improves the convergence behavior of the averaged SGD iterates. Our work opens up several avenues for future work: (a) show that Richardson-Romberg trick can be applied to the decreasing step-sizes setting, (b) study the extension of our results under self-concordance condition .
Postponed proofs
Assumption 4, made in the text, can be weakened in order to apply to settings where input observations are un-bounded (typically, Gaussian inputs would not satisfy Assumption 4). Especially, in many cases, we only need Assumption 7 below. Let .
where is the same constant appearing in 2 and is defined by (4).
On the other hand, we consider also the stronger assumption that the noise is independent of (referred to as the “semi-stochastic” setting, see ), or more generally that the noise has a uniformly bounded fourth order moment.
2 Preliminary results
We preface the proofs of the main results by some technical lemmas.
Therefore by definition (26), is Lipschitz continuous. Finally, it is straightforward to verify that satisfies the stated properties.
Assume 1-2-3-4(2). Then we have for any .
where is given by (1). Moreover, if , we have
The proof and result is very close to the ones from but we extend it without a.s. Lipschitzness (4) but with 7. Using 3-1 and , we have
In addition, under 3-7() and using (4), we have:
Combining this result and (30) concludes the proof of the first inequality.
Taking and applying the monotone convergence theorem concludes the proof. ∎
Using Section 6.2, we can extend Section 6.2 to functions which are locally Lipschitz.
For any step-size , it holds:
In this proof, is a constant which can change from line to line.
By definition (26), satisfies (32). Finally, it is straightforward to verify that satisfies the stated properties.
It is worth pointing out that under Assumption 8 (the “semi-stochastic” assumption), a slightly different result holds. The following result underlines the difference between a stochastic noise and a semi-stochastic noise, especially the fact that the maximal step-size differs depending on this assumption made.
where is given by (1).
So that finally, using (29), 3, (33), 2 and rearranging terms we get
Using that concludes the proof. ∎
We give uniform bound on the moments of the chain for . For , recall that under 4(2p), the noise at optimal point has a moment of order and we denote
We give a bound on the -order moment of the chain, under the assumption that the noise has a moment of order .
For moment of order larger than , we have the following result.
Note that there is no contradiction between (35) and Theorem 4, as for any , one has for and the solution to the Poisson equation, that , so that the first term in the development (of order ) is indeed 0.
We upper bounds each term for , as follows:
For , we have .
For , we have , for which it holds by 3
Else, either or , thus . We first upper bound, by the Cauchy–Schwarz inequality:
Combining this result, (38) and (39) implies using ,
Note that using , for such that , it holds
Therefore, we have combining this inequality, (37)-(40) in (36),
Using 1, for , , we get
Note that using (41), and , . Using this inequality and for we get by (42) setting ,
A straightforward induction implies the first statement. The proof of (35) is similar to the one of (28) and is omitted. ∎
Let . Consider the two processes , defined by (11) with and . By Jensen inequality, we have
Using Section 6.2, the Cauchy Schwarz and Minkowski inequalities and (13) we get
Combining this result and (43) concludes the proof.
3 Proof of Section 3.1
Taking the expectation, using 3, is independent of and , we get
Note that in the case of the regression setting described in Example 1, we can specify Section 3.1 as follows.
where and are defined by (18) and (7) respectively.
The proof follows the same line as the proof of Section 3.1 and is omitted. ∎
4 Proof of Theorem 2
We preface the proof by a couple of preliminaries lemmas.
Assume 1-2-3-4()-5 and let . Then
where is defined by (17), and are given by (3) and (5) respectively.
It follows from Section 6.2, taking the integral with respect to ,
Using (47), Section 6.2 and Hölder inequality, we get
Moreover, we have by a second order Taylor expansion with integral remainder of around ,
Taking the second order moment of this equation, and using 3, is independent of , (49), Section 6.2 and Hölder inequality, we get
Assume 1-2-3-4()-5. It holds as ,
The proof consists in showing that the residual term in (45) of Section 6.4 is of order and not only . Note that we have already prove that . To find the next term in the development, we develop further each of the terms. By a fourth order Taylor expansion with integral remainder of around , and using 2, we have
Since is invertible by 1, To get the next term in the development, we show that
(a) Denote for , . By (46)-(47), Section 6.2 and 3-4(12), we get
Using 1 and the same reasoning as to show that in (17), is well defined, we get that is invertible. Then since and has the same distribution , we get
Combining this result and (45) implies (a).
(b) First, we have using (50), 3 and Section 6.2 that:
Since and follow the same distribution , it follows that
Then by linearity of and using (a) we get (b).
Finally the proof of (15) follows from combining the results of (a)-(b) in (51).
5 Proof of Theorem 3
Theorem 3 follows from the following more general result taking .
where , , , are solutions of the Poisson equation (26) associated with , , and respectively.
We now consider the Poisson solution associated with , . By Section 6.2, such a function exists and satisfies , . Therefore, we obtain using in addition the Markov property:
where and are the Poisson solution associated with and respectively. Combining (55)-(56) in (53), we obtain
In addition, by Section 6.2 and definition, we have for all
Combining this result and (58) in (57) concludes the proof.
6 Proof of Section 3.2
In this section we apply Theorem 3 to the case of a quadratic function, more specifically to the LMS algorithm described in Example 1, to prove Section 3.2. Recall that the sequence of iterates can be written,
First note that with the notations of the text, and with , operator is a positive operator on the set of symmetric matrices, and is thus invertible.
We consider the linear function which is , thus . First, by Section 3.1, . We have the following equalities:
Indeed, for (59) (other equations are basic linear algebra), starting from any :
Moreover, the expectation of under the stationary distribution is known according to Section 3.1,
For the term proportional to , we first need to compute the function , solution to the Poisson equation associated with .
Following the proof of Section 6.3, we have:
Formally, the simplification comes from the fact that we study an arithmetico-geometric recursion of the form , , and study . (note that here we cannot apply the recursion with because then “” would depend on .)
This term is the sum of the following three terms:
using , and . Finally,
With .
Combining (63), (64) and (65), we conclude the proof of Section 3.2.
7 Proof of Theorem 4
Before giving the proof of Theorem 4, we need several results regarding Poisson solutions associated with the gradient flow ODE (20).
where and are the eigenvectors and the eigenvalues of respectively satisfying for all , .
This is a fundamental result on the regularity of flows of autonomous differential equations, see e.g. [24, Theorem 4.1 Chapter V]
with . The proof then follows from uniqueness of the solution of (66).
By c) and since is an equilibrium point we get that satisfies
Therefore we get for all ,
This ordinary differential equation can be solved analytically which finishes the proof.
Note that is well-defined by Section 6.7.1-b) and since is assumed to be locally-Lipschitz. In addition by (20), satisfies
Note that is also well-defined by Section 6.7.1-b).
The proof then follows from Section 6.7.1-b).
The proof is a direct consequence of Section 6.7.1-c)-d) and (67).
7.2 Proof of Theorem 4
We preface the proof of the Theorem by two fundamental first estimates.
where is the Markov chain starting from , defined by the recursion (1), and
for some constant independent of and .
where and . Therefore by (68), we get
Taking the expectation and using 3, we have
The proof of (74) then follows from 2, 3, (73) and Section 6.2.
This result is a direct consequence of Section 6.2 and (a).
Acknowledgments
The authors would like to thank Éric Moulines and Arnak Dalalyan for helpful discussions. We acknowledge support from the chaire Economie des nouvelles donnees with the data science joint research initiative with the fonds AXA pour la recherche, and the Initiative de Recherche “Machine Learning for Large-Scale Insurance” from the Institut Louis Bachelier.