Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains

Aymeric Dieuleveut, Alain Durmus, Francis Bach

Introduction

We consider the minimization of an objective function given access to unbiased estimates of the function gradients. This key methodological problem has raised interest in different communities: in large-scale machine learning , optimization , and stochastic approximation . The most widely used algorithms are stochastic gradient descent (SGD), a.k.a. Robbins-Monro algorithm , and some of its modifications based on averaging of the iterates .

While the choice of the step-size may be done robustly in the deterministic case (see e.g. ), this remains a traditional theoretical and practical issue in the stochastic case. Indeed, early work suggested to use step-size decaying with the number $k$ of iterations as $O(1/k)$ , but it appeared to be non-robust to ill-conditioning and slower decays such as $O(1/\sqrt{k})$ together with averaging lead to both good practical and theoretical performance .

We consider in this paper constant step-size SGD, which is often used in practice. Although the algorithm is not converging in general to the global optimum of the objective function, constant step-sizes come with benefits: (a) there is a single parameter value to set as opposed to the several choices of parameters to deal with decaying step-sizes, e.g. as $1/(\square k+\triangle)^{\circ}$ ; the initial conditions are forgotten exponentially fast for well-conditioned (e.g. strongly convex) problems , and the performance, although not optimal, is sufficient in practice (in a machine learning set-up, being only 0.1% away from the optimal prediction often does not matter).

where $f$ is the objective function to minimize (in machine learning the generalization performance), $\varepsilon_{k+1}(\theta_{k}^{(\gamma)})$ the zero-mean statistically independent noise (in machine learning, obtained from a single observation). Following , we leverage the property that the sequence of iterates $(\theta_{k}^{(\gamma)})_{k\geq 0}$ is an homogeneous Markov chain.

This interpretation allows us to capture the general behavior of the algorithm. In the strongly convex case, this Markov chain converges exponentially fast to a unique stationary distribution $\pi_{\gamma}$ (see Section 3) highlighting the facts that (a) initial conditions of the algorithms are forgotten quickly and (b) the algorithm does not converge to a point but oscillates around the mean of $\pi_{\gamma}$ . See an illustration in Figure 1 (left). It is known that the oscillations of the non-averaged iterates have an average magnitude of $\gamma^{1/2}$ .

Consider the process $(\bar{\theta}_{k}^{(\gamma)})_{k\geq 0}$ given for all $k\geq 0$ by

Then under appropriate conditions on the Markov chain $(\theta_{k}^{(\gamma)})_{k\geq 0}$ , a central limit theorem on $(\bar{\theta}_{k}^{(\gamma)})_{k\geq 0}$ holds which implies that $\bar{\theta}_{k}^{(\gamma)}$ converges at rate $O(1/\sqrt{k})$ to

The deviation between $\bar{\theta}_{k}^{(\gamma)}$ and the global optimum $\theta^{*}$ is thus composed of a stochastic part $\bar{\theta}_{k}^{(\gamma)}-\bar{\theta}_{\gamma}$ and a deterministic part $\bar{\theta}_{\gamma}-\theta^{*}$ .

For quadratic functions, it turns out that the deterministic part vanishes , that is, $\bar{\theta}_{\gamma}=\theta^{*}$ and thus averaged SGD with a constant step-size does converge. However, it is not true for general objective functions where we can only show that $\bar{\theta}_{\gamma}-\theta^{*}=O(\gamma)$ , and this deviation is the reason why constant step-size SGD is not convergent.

In summary, we make the following contributions:

We show in Section 2 that Richardson-Romberg extrapolation may be used to get closer to the global optimum.

We bring and adapt in Section 3 tools from analysis of discretization of diffusion processes into the one of SGD and create new ones. We believe that this analogy and the associated ideas are interesting in their own right.

We show in Section 4 empirical improvements of the extrapolation schemes.

Main results

In this section, we describe the assumptions underlying our analysis, describe our main results and their implications.

We also consider the following conditions on the noise, for $p\geq 2$ :

In other words, we assume that the covariance matrix $\theta\mapsto C(\theta)$ is a regular enough function, which is satisfied in natural settings.

For all $k\geq 1$ , $\xi_{k}$ does not depend on $\theta$ . This two parts in the noise will appear in Section 3.2. Finally assume that there exists $r\geq 0$ such that

then 4 $(4)$ is satisfied. This assumption is satisfied, e.g., for a.s. bounded data, or for data with bounded kurtosis, see for details.

2 Summary and discussion of main results

This expansion in the step-size $\gamma$ shows that a Richardson-Romberg extrapolation can be used to have better estimates of $\theta^{*}$ . Consider the average iterates $(\bar{\theta}_{2\gamma}^{(k)})_{k\geq 0}$ and $(\bar{\theta}_{k}^{(\gamma)})_{k\geq 0}$ associated with SGD with step size $2\gamma$ and $\gamma$ respectively. Then (9) shows that $(2\bar{\theta}_{k}^{(\gamma)}-\bar{\theta}_{k}^{(2\gamma)})_{k\geq 0}$ satisfies

and therefore is closer to the optimum $\theta^{*}$ . This very simple trick improves the convergence by a factor of $\gamma$ (at the expense of a slight increase of the variance). In practice, while the un-averaged gradient iterate $\theta_{k}^{(\gamma)}$ saturates rapidly, $\bar{\theta}_{k}^{(\gamma)}$ may already perform well enough to avoid saturation on real data-sets . The Richardson-Romberg extrapolated iterate $2\bar{\theta}_{k}^{(\gamma)}-\bar{\theta}_{k}^{(2\gamma)}$ very rarely reaches saturation in practice. This appears in synthetic experiments presented in Section 4. Moreover, this procedure only requires to compute two parallel SGD recursions, either with the same inputs, or with different ones, and is naturally parallelizable.

In Section 3.2, we give a quantitative version of a central limit theorem for $(\bar{\theta}_{k}^{(\gamma)})_{k\geq 0}$ , for a fixed $\gamma>0$ and $k$ going to $+\infty$ : under appropriate conditions, there exist constants $B_{1}(\gamma)$ and $B_{2}(\gamma)$ such that

Combining (9) and (10) characterizes the bias/variance trade-off of SGD used to estimate $\theta^{*}$ .

3 Related work

The idea to study stochastic approximation algorithms using results and techniques from the Markov chain literature is not new. It goes back to , which shows under appropriate conditions that solutions of stochastic differential equations (SDE)

where $(B_{t})_{t\geq 0}$ is a $d$ -dimensional Brownian motion and $(\gamma_{t})_{t\geq 0}$ is a one-dimensional positive function, $\lim_{t\to+\infty}\gamma_{t}=0$ , converge in probability to some minima of $f$ . An other example is which extends the classical Foster-Lyapunov criterion from Markov chain theory (see ) to study the stability of the LMS algorithm. In , the authors are interested in the convergence of the multidimensional Kohonen algorithm. They show that the Markov chain defined by this algorithm is uniformly ergodic and derive asymptotic properties on its limiting distribution.

To the authors knowledge, the use of the Richardson-Romberg method for stochastic approximation has only been considered in to recover the minimax rate for recursive estimation of time varying autoregressive process.

Several attempts have been made to improve convergence of SGD. proposed an online Newton algorithm which converges in practice to the optimal point with constant step-size but has no convergence guarantees. The quadratic case was studied by , for the (uniform) average iterate: the variance term is upper bounded by ${\sigma^{2}d}/{n}$ and the squared bias term by $\|\theta^{*}\|^{2}/(\gamma n)$ . This last term was improved to $\|\Sigma^{-1/2}\theta^{*}\|^{2}/(\gamma n)^{2}$ by , showing that asymptotically, the bias term is negligible, see also . Analysis has been extended to “tail averaging” , to improve the dependence on the initial conditions. Note that this procedure can be seen as a Richardson-Romberg trick with respect to $k$ . Other strategies were suggested to improve the speed at which initial conditions were forgotten, for example using acceleration when the noise is additive . A criterion to check when SGD with constant step size is close to its limit distribution was recently proposed in .

In the context of discretization of ergodic diffusions, weak error estimates between the stationary distribution of the discretization and the invariant distribution of the associated diffusion have been first shown by and in the case of the Euler-Maruyama scheme. Then, suggested the use of Richardson-Romberg interpolation to improve the accuracy of estimates of integrals with respect to the invariant distribution of the diffusion. Extension of these results have been obtained for other types of discretization by and . We show in Section 3.3 that a weak error expansion in the step-size $\gamma$ also holds for SGD between $\pi_{\gamma}$ and $\updelta_{\theta^{*}}$ . Interestingly as to the Euler-Maruyama discretization, SGD has a weak error of order $\gamma$ . In addition, proposed and analyzed the use of Richardson-Romberg extrapolation applied to the stochastic gradient Langevin dynamics (SGLD) algorithm. This method introduced by combines SGD and the Euler-Maruyama discretization of the Langevin diffusion associated to a target probability measure . Note that this method is however completely different from SGD, in part because Gaussian noise of order $\gamma^{1/2}$ (instead of $\gamma$ ) is injected in SGD which changes the overall dynamics.

Detailed analysis

In this Section, we describe in detail our approach. A first step is to describe the existence of a unique stationary distribution $\pi_{\gamma}$ for the Markov chain $(\theta_{k}^{(\gamma)})_{k\geq 0}$ and the convergence of this Markov chain to $\pi_{\gamma}$ in the Wasserstein distance of order $2$ .

Note that using that $\theta^{(1)}_{0},\theta^{(2)}_{0}$ are independent of $\varepsilon_{1}$ , we have for $i,j\in\{1,2\}$ using 3, that

Since for all $k\geq 0$ , the distribution of $(\theta_{k}^{(1)},\theta_{k}^{(2)})$ belongs to $\Pi(\lambda_{1}R_{\gamma}^{k},\lambda_{2}R_{\gamma}^{k})$ , by definition of the Wasserstein distance we get

using (12) for $i)$ , 4 $(2)$ for $ii)$ , and finally 1 for $iii)$ .

Thus by a straightforward induction, we get, setting $\rho=(1-2\mu\gamma(1-\gamma L/2))$

We show that the limit $\pi_{\gamma}^{\lambda_{1}}$ does not depend on $\lambda_{1}$ . Assume that there exists $\pi_{\gamma}^{\lambda_{2}}$ such that $\lim_{k\to+\infty}W_{2}(\lambda_{2}R_{\gamma}^{k},\pi_{\gamma}^{\lambda_{2}})=0$ . By the triangle inequality

Thus by (13) and (14), taking the limits as $k\to+\infty$ , we get $W_{2}(\pi_{\gamma}^{\lambda_{1}},\pi_{\gamma}^{\lambda_{2}})=0$ and $\pi_{\gamma}^{\lambda_{1}}=\pi_{\gamma}^{\lambda_{2}}$ . The limit is thus the same for all initial distributions and is denoted by $\pi_{\gamma}$ .

Using (13) and (14), we get taking $k\to+\infty$ , $W_{2}(\pi_{\gamma}R_{\gamma},\pi_{\gamma})=0$ and $\pi_{\gamma}R_{\gamma}=\pi_{\gamma}$ . The fact that $\pi_{\gamma}$ is the unique stationary distribution is straightforward by contradiction and using (13).

Taking $\lambda_{1}=\updelta_{\theta}$ , $\lambda_{2}=\pi_{\gamma}$ , using the invariance of $\pi_{\gamma}$ and (13), we get (a).

When $f$ is a quadratic function, i.e. $f^{\prime}$ is affine, we have the following result.

Assume $f=f_{\Sigma}$ , $f_{\Sigma}:\theta\mapsto\left\|\Sigma^{1/2}(\theta-\theta^{*})\right\|^{2}/2$ , where $\Sigma$ is a positive definite matrix, and 2-3-4(4). Let $\gamma\in\left(0,2/L\right)$ . Then, it holds $\bar{\theta}_{\gamma}=\theta^{*}$ , $\Sigma\otimes I+I\otimes\Sigma-\gamma\Sigma\otimes\Sigma$ is invertible and

where $\bar{\theta}_{\gamma}$ and $\mathcal{C}$ are given by (3) and (5) respectively, and $\pi_{\gamma}$ is the invariant probability measure of $R_{\gamma}$ given by Section 3.

While the quadratic case led to particularly simple expressions, in general, we can only get a first order development of these expectations as $\gamma\to 0$ . Note that it improved on , which shows a similar expansion but an error of order of $O(\gamma^{3/2})$ .

Assume 1-2-3-4( $6\vee[2(k_{\varepsilon}+1)]$ )-5 and let $\gamma\in\left(0,2/L\right)$ . Then $f^{\prime\prime}(\theta^{*})\otimes I+I\otimes f^{\prime\prime}(\theta^{*})$ is invertible and

$\bar{\theta}_{\gamma}$ and $\mathcal{C}$ are given by (3) and (5) respectively, and $\pi_{\gamma}$ is the invariant probability measure of $R_{\gamma}$ given by Section 3.

This shows that $\gamma\mapsto\bar{\theta}_{\gamma}$ is a differentiable function at $\gamma=0$ . The “drift” $\bar{\theta}_{\gamma}-\theta^{*}$ can be understood as an additional error occurring because the function is non quadratic ( $f^{\prime\prime\prime}(\theta^{*})\not=0$ ) and the step-sizes are not decaying to zero. The mean under the limit distribution is at distance $\gamma$ from $\theta^{*}$ . In comparison, the final iterate oscillates in a sphere of radius proportional to $\sqrt{\gamma}$ .

2 Expansion for a given γ>0𝛾0\gamma>0 when k𝑘k tends to +∞+\infty

$\psi_{\gamma}$ the Poisson solution associated to $\varphi:\theta\mapsto\theta-\theta^{*}$ ,

$\varpi_{\gamma}$ the Poisson solution associated to $\theta\mapsto\psi_{\gamma}(\theta)$ ,

$\chi^{1}_{\gamma}$ the Poisson solution associated to $\theta\mapsto(\psi_{\gamma}(\theta))^{\otimes 2}$ ,

$\chi^{2}_{\gamma}$ the Poisson solution associated to $\theta\mapsto((\psi_{\gamma}-\varphi)(\theta))^{\otimes 2}$ .

where $\bar{\theta}_{k}^{(\gamma)}$ , $\bar{\theta}_{\gamma}$ are given by (2) and (3) respectively, and $\pi_{\gamma}$ is the invariant probability measure of $R_{\gamma}$ given by Section 3.

In order to give the intuition of the proof and to underline how the associated Poisson solutions are introduced, we here sketch the proof of the first result. By definition of $\varphi:\theta\mapsto\theta-\theta^{*}$ and since $\psi_{\gamma}$ satisfies $(\operatorname{Id}-R_{\gamma})\psi_{\gamma}=\varphi$ , we have

Finally, we have that $R_{\gamma}^{k}\psi_{\gamma}(\theta_{0})$ converges to 0 at linear speed, using Section 3 and $\pi_{\gamma}(\psi_{\gamma})=0$ .

The formal and complete proof of this result is postponed to Section 6.5. ∎

This result gives an exact closed form for the asymptotic bias and variance, for a fixed $\gamma$ , as $k\rightarrow\infty$ . Unfortunately, in the general case, it is neither possible to compute the Poisson solutions exactly, nor is it possible to prove a first order development of the limits as $\gamma\rightarrow 0$ .

When $f_{\Sigma}$ is a quadratic function, it is possible, for any $\gamma>0$ , to compute $\psi_{\gamma}$ and $\chi^{1,2}_{\gamma}$ explicitly; we get the following decomposition of the error, which exactly recovers the result of or .

With $\Omega=(\Sigma\otimes I+I\otimes\Sigma-\gamma\Sigma\otimes\Sigma)(\Sigma\otimes I+I\otimes\Sigma-\gamma\mathbf{T})^{-1}$ , and

The proof is postponed to the supplementary paper , Section S3. ∎

The bound on the second order moment is composed of a variance term $k^{-1}\Sigma^{-1}\pi_{\gamma}(\mathcal{C})\Sigma^{-1}$ , a bias term which decays as $k^{-2}$ , and a non-positive residual term. Interestingly, the bias is if we start under the limit distribution.

3 Continuous interpretation of SGD and weak error expansion

Denote by $(\mathcal{A},D(\mathcal{A}))$ , the infinitesimal generator associated with the flow $(\varphi_{t})_{t\geq 0}$ defined by

where $\theta_{k}^{(\gamma)}$ is the Markov chain starting from $\theta_{0}$ and defined by the recursion (1) and $\mathcal{C}$ is given by (5). In addition for some constant $C\geq 0$ independent of $\gamma$ and $k$ , we have

4 Discussion

Classical proofs of convergence rely on another decomposition, originally proposed by and used in recent papers analyzing the averaged iterate . We here sketch the arguments of these decompositions, in order to highlight the main difference, namely the fact that the residual term is not well controlled when $\gamma$ goes to zero in the classical proof.

As a consequence, using the definition of the SGD recursion (1),

Averaging over the first $k$ iterates yields:

The term on the right-hand part of Equation (23) is composed of a bias term (depending on the initial condition), a variance term, and a residual term. This residual term differentiates the general setting from the quadratic one (in which it does not appear, as the first order Taylor expansion of $f^{\prime}$ is exact). This decomposition has been used in to prove upper bound on the error, but does not allow for a tight decomposition in powers of $\gamma$ when $\gamma\to 0$ . Indeed, the residual $\theta_{i}^{(\gamma)}-\theta^{*}$ simply does not go to 0 when $\gamma\to 0$ : on the contrary, the chain becomes ill-conditioned when $\gamma=0$ .

Finally, averaging over the first $k$ iterations and taking $g=\operatorname{Id}$ give

This expansion is the root of the proof of Theorem 4, which formalizes the expansion as powers of $\gamma$ . The key difference between decomposition (23) and (24) is that in the latter, when $\gamma\to 0$ , the expectation of the residual term tends to 0 and can naturally be controlled.

Experiments

Conclusion

In this paper, we have used and developed Markov chain tools to analyze the behavior of constant step-size SGD, with a complete analysis of its convergence, outlining the effect of initial conditions, noise and step-sizes. For machine learning problems, this allows us to extend known results from least-squares to all loss functions. This analysis leads naturally to using Romberg-Richardson extrapolation, that provably improves the convergence behavior of the averaged SGD iterates. Our work opens up several avenues for future work: (a) show that Richardson-Romberg trick can be applied to the decreasing step-sizes setting, (b) study the extension of our results under self-concordance condition .

Postponed proofs

Assumption 4, made in the text, can be weakened in order to apply to settings where input observations are un-bounded (typically, Gaussian inputs would not satisfy Assumption 4). Especially, in many cases, we only need Assumption 7 below. Let $p\geq 2$ .

where $L$ is the same constant appearing in 2 and $f^{\prime}_{1}$ is defined by (4).

On the other hand, we consider also the stronger assumption that the noise is independent of $\theta$ (referred to as the “semi-stochastic” setting, see ), or more generally that the noise has a uniformly bounded fourth order moment.

2 Preliminary results

We preface the proofs of the main results by some technical lemmas.

Therefore by definition (26), $\psi_{\gamma}$ is Lipschitz continuous. Finally, it is straightforward to verify that $\psi_{\gamma}$ satisfies the stated properties.

Assume 1-2-3-4(2). Then we have for any $\gamma\in(0,2/L)$ .

where $(\theta_{k}^{(\gamma)})_{k\geq 0}$ is given by (1). Moreover, if $\gamma\in\left(0,1/L\right)$ , we have

The proof and result is very close to the ones from but we extend it without a.s. Lipschitzness (4) but with 7. Using 3-1 and $f^{\prime}(\theta^{*})=0$ , we have

In addition, under 3-7( $2$ ) and using (4), we have:

Combining this result and (30) concludes the proof of the first inequality.

Taking $M\to+\infty$ and applying the monotone convergence theorem concludes the proof. ∎

Using Section 6.2, we can extend Section 6.2 to functions $\phi$ which are locally Lipschitz.

For any step-size $\gamma\in(0,1/L)$ , it holds:

In this proof, $C\geq 0$ is a constant which can change from line to line.

By definition (26), $\psi_{\gamma}$ satisfies (32). Finally, it is straightforward to verify that $\psi_{\gamma}$ satisfies the stated properties.

It is worth pointing out that under Assumption 8 (the “semi-stochastic” assumption), a slightly different result holds. The following result underlines the difference between a stochastic noise and a semi-stochastic noise, especially the fact that the maximal step-size differs depending on this assumption made.

where $(\theta_{k}^{(\gamma)})_{k\geq 0}$ is given by (1).

So that finally, using (29), 3, (33), 2 and rearranging terms we get

Using that $\gamma\leq 2/(m+L)$ concludes the proof. ∎

We give uniform bound on the moments of the chain $(\theta_{k}^{(\gamma)})_{k\geq 0}$ for $\gamma>0$ . For $p\geq 1$ , recall that under 4(2p), the noise at optimal point has a moment of order $2p$ and we denote

We give a bound on the $p$ -order moment of the chain, under the assumption that the noise has a moment of order $2p$ .

For moment of order larger than $2$ , we have the following result.

Note that there is no contradiction between (35) and Theorem 4, as for any $p\geq 2$ , one has for $g(\theta)=\left\|\theta-\theta^{*}\right\|^{2}$ and $h_{g}$ the solution to the Poisson equation, that $h^{\prime\prime}_{g}(\theta^{*})=0$ , so that the first term in the development (of order $\gamma$ ) is indeed 0.

We upper bounds each term for $i,j,l\in\{0,\ldots,p\}$ , as follows:

For $i=p,j=l=0$ , we have $\delta_{k}^{2p}$ .

For $i=p-1,j=1,l=0$ , we have $p2\gamma\langle f_{k+1}^{\prime}(\theta_{k}),\theta_{k}-\theta^{*}\rangle\delta_{k}^{2(p-1)}$ , for which it holds by 3

Else, either $l\geq 1$ or $j\geq 2$ , thus $2l+j\geq 2$ . We first upper bound, by the Cauchy–Schwarz inequality:

Combining this result, (38) and (39) implies using $i+j+l\leq p$ ,

Note that using $j+2l\geq 2$ , for $\gamma$ such that $\gamma L<1/C_{p}$ , it holds

Therefore, we have combining this inequality, (37)-(40) in (36),

Using 1, for $j\in\{0,\ldots,p\}$ , $(\gamma\tau_{2p}\delta_{k})^{j}\leq 2(\gamma\tau_{2p})^{2j}+2(\delta_{k})^{2j}$ , we get

Note that using (41), $C_{p}\geq 2$ and $\mu\leq L$ , $(1-2\gamma\mu(1-\gamma LC_{p}/2))\geq(1-\gamma LC_{p}(1-\gamma LC_{p}/2))\geq 1/2$ . Using this inequality and $1-pt\leq(1-t)^{p}$ for $t\geq 0$ we get by (42) setting $\rho=(1-2\gamma\mu(1-\gamma LC_{p}/2))$ ,

A straightforward induction implies the first statement. The proof of (35) is similar to the one of (28) and is omitted. ∎

Let $\gamma\in(0,1/(LC_{p}))$ . Consider the two processes $(\theta_{k}^{(1)})_{\geq 0}$ , $(\theta_{k}^{(2)})_{k\geq 0}$ defined by (11) with $\lambda_{1}=\updelta_{\theta}$ and $\lambda_{2}=\updelta_{\vartheta}$ . By Jensen inequality, we have

Using Section 6.2, the Cauchy Schwarz and Minkowski inequalities and (13) we get

Combining this result and (43) concludes the proof.

3 Proof of Section 3.1

Taking the expectation, using 3, $\theta_{0}^{(\gamma)}$ is independent of $\varepsilon_{1}$ and $\pi_{\gamma}R_{\gamma}=\pi_{\gamma}$ , we get

Note that in the case of the regression setting described in Example 1, we can specify Section 3.1 as follows.

where $\mathbf{T}$ and $\xi_{1}$ are defined by (18) and (7) respectively.

The proof follows the same line as the proof of Section 3.1 and is omitted. ∎

4 Proof of Theorem 2

We preface the proof by a couple of preliminaries lemmas.

Assume 1-2-3-4( $6\vee 2k_{\varepsilon}$ )-5 and let $\gamma\in\left(0,2/L\right)$ . Then

where $\mathbf{A}$ is defined by (17), $\bar{\theta}_{\gamma}$ and $\mathcal{C}$ are given by (3) and (5) respectively.

It follows from Section 6.2, taking the integral with respect to $\pi_{\gamma}$ ,

Using (47), Section 6.2 and Hölder inequality, we get

Moreover, we have by a second order Taylor expansion with integral remainder of $f^{\prime}$ around $\theta^{*}$ ,

Taking the second order moment of this equation, and using 3, $\theta_{0}$ is independent of $\varepsilon_{1}$ , (49), Section 6.2 and Hölder inequality, we get

Assume 1-2-3-4( $6\vee[2(k_{\varepsilon}+1)]$ )-5. It holds as $\gamma\to 0$ ,

The proof consists in showing that the residual term in (45) of Section 6.4 is of order $O(\gamma^{2})$ and not only $O(\gamma^{3/2})$ . Note that we have already prove that $\bar{\theta}_{\gamma}-\theta^{*}=O(\gamma)$ . To find the next term in the development, we develop further each of the terms. By a fourth order Taylor expansion with integral remainder of $f^{\prime}$ around $\theta^{*}$ , and using 2, we have

Since $f^{\prime\prime}(\theta^{*})$ is invertible by 1, To get the next term in the development, we show that

(a) Denote for $i=0,1$ , $\eta_{i}=\theta_{i}-\theta^{*}$ . By (46)-(47), Section 6.2 and 3-4(12), we get

Using 1 and the same reasoning as to show that $\mathbf{A}$ in (17), is well defined, we get that $\mathbf{B}$ is invertible. Then since $\eta_{0}$ and $\eta_{1}$ has the same distribution $\pi_{\gamma}$ , we get

Combining this result and (45) implies (a).

(b) First, we have using (50), 3 and Section 6.2 that:

Since $\theta_{0}$ and $\theta_{1}$ follow the same distribution $\pi_{\gamma}$ , it follows that

Then by linearity of $f^{\prime\prime\prime}(\theta^{*})$ and using (a) we get (b).

Finally the proof of (15) follows from combining the results of (a)-(b) in (51).

5 Proof of Theorem 3

Theorem 3 follows from the following more general result taking $\varphi:\theta\mapsto\theta-\theta^{*}$ .

where $\psi_{\gamma}$ , $\varpi_{\gamma}$ , $\chi^{1}_{\gamma}$ , $\chi^{2}_{\gamma}$ are solutions of the Poisson equation (26) associated with $\varphi$ , $\psi_{\gamma}$ , $\psi_{\gamma}^{\otimes 2}$ and $(\psi_{\gamma}-\varphi)^{\otimes 2}$ respectively.

We now consider the Poisson solution associated with $\varphi\varphi^{\top}$ , $\chi^{3}_{\gamma}$ . By Section 6.2, such a function exists and satisfies $\pi_{\gamma}(\chi^{3}_{\gamma})=0$ , $R_{\gamma}^{k}\chi^{3}_{\gamma}(\theta_{0})=O(\rho^{k})$ . Therefore, we obtain using in addition the Markov property:

where $\chi^{4}_{\gamma}$ and $\chi^{5}_{\gamma}$ are the Poisson solution associated with $\varphi\psi_{\gamma}^{\top}$ and $\psi_{\gamma}\varphi^{\top}$ respectively. Combining (55)-(56) in (53), we obtain

In addition, by Section 6.2 and definition, we have for all $\theta_{0}$

Combining this result and (58) in (57) concludes the proof.

6 Proof of Section 3.2

In this section we apply Theorem 3 to the case of a quadratic function, more specifically to the LMS algorithm described in Example 1, to prove Section 3.2. Recall that the sequence of iterates can be written,

First note that with the notations of the text, and with $\gamma\leq 1/r^{2}$ , operator $(\Sigma\otimes\operatorname{Id}+\operatorname{Id}\otimes\Sigma-\gamma\mathbf{T})$ is a positive operator on the set of symmetric matrices, and is thus invertible.

We consider the linear function $\varphi$ which is $\varphi(\theta)=\theta-\theta^{*}$ , thus $\Phi_{k}=\bar{\theta}_{k}^{(\gamma)}-\theta^{*}$ . First, by Section 3.1, $\pi_{\gamma}(\varphi)=0$ . We have the following equalities:

Indeed, for (59) (other equations are basic linear algebra), starting from any $\theta_{0}$ :

Moreover, the expectation of $\varphi(\theta)^{\otimes 2}$ under the stationary distribution is known according to Section 3.1,

For the term proportional to $1/k^{2}$ , we first need to compute the function $\chi^{3}$ , solution to the Poisson equation associated with $\theta\mapsto\varphi(\theta)^{\otimes 2}$ .

Following the proof of Section 6.3, we have:

Formally, the simplification comes from the fact that we study an arithmetico-geometric recursion of the form $w_{k+1}=aw_{k}+b$ , $a<1$ , and study $\sum_{i=0}^{\infty}w_{k}-w_{\infty}=(1-a)^{-1}(w_{0}-w_{\infty})$ . (note that here we cannot apply the recursion with $(\Sigma\otimes\operatorname{Id}+\operatorname{Id}\otimes\Sigma-\gamma\Sigma\otimes\Sigma)$ because then “ $b$ ” would depend on $k$ .)

This term is the sum of the following three terms:

using $\psi_{\gamma}=(\gamma\Sigma)^{-1}\varphi$ , and $R_{\gamma}{}\psi_{\gamma}=\psi_{\gamma}-\varphi=-(\operatorname{Id}-(\gamma\Sigma)^{-1})\varphi$ . Finally,

With $\Omega=(\Sigma\otimes\operatorname{Id}+\operatorname{Id}\otimes\Sigma-\gamma\Sigma\otimes\Sigma)(\Sigma\otimes\operatorname{Id}+\operatorname{Id}\otimes\Sigma-\gamma T)^{-1}$ .

Combining (63), (64) and (65), we conclude the proof of Section 3.2.

7 Proof of Theorem 4

Before giving the proof of Theorem 4, we need several results regarding Poisson solutions associated with the gradient flow ODE (20).

where $\{\mathbf{f}_{1},\ldots,\mathbf{f}_{d}\}$ and $\{\lambda_{1},\ldots,\lambda_{d}\}$ are the eigenvectors and the eigenvalues of $f{{}^{\prime\prime}}(\theta^{*})$ respectively satisfying for all $i\in\{1,\ldots,d\}$ , $f{{}^{\prime\prime}}(\theta^{*})\mathbf{f}_{i}=\lambda_{i}\mathbf{f}_{i}$ .

This is a fundamental result on the regularity of flows of autonomous differential equations, see e.g. [24, Theorem 4.1 Chapter V]

with $\xi^{x}_{0}(\theta^{*})=x$ . The proof then follows from uniqueness of the solution of (66).

By c) and since $\theta^{*}$ is an equilibrium point we get that $\xi^{x_{1},x_{2}}_{t}(\theta^{*})$ satisfies

Therefore we get for all $i,j,l\in\{1,\ldots,d\}$ ,

This ordinary differential equation can be solved analytically which finishes the proof.

Note that $h_{g}$ is well-defined by Section 6.7.1-b) and since $g$ is assumed to be locally-Lipschitz. In addition by (20), $h_{g}$ satisfies

Note that $h_{\operatorname{Id}}$ is also well-defined by Section 6.7.1-b).

The proof then follows from Section 6.7.1-b).

The proof is a direct consequence of Section 6.7.1-c)-d) and (67).

7.2 Proof of Theorem 4

We preface the proof of the Theorem by two fundamental first estimates.

where $\theta_{k}^{(\gamma)}$ is the Markov chain starting from $\theta_{0}$ , defined by the recursion (1), and

for some constant $C\geq 0$ independent of $\gamma$ and $k$ .

where $s^{(\gamma)}_{i}\in\left[0,1\right]$ and $\Delta\theta_{i+1}^{(\gamma)}=\theta_{i+1}^{(\gamma)}-\theta_{i}^{(\gamma)}$ . Therefore by (68), we get

Taking the expectation and using 3, we have

The proof of (74) then follows from 2, 3, (73) and Section 6.2.

This result is a direct consequence of Section 6.2 and (a).

Acknowledgments

The authors would like to thank Éric Moulines and Arnak Dalalyan for helpful discussions. We acknowledge support from the chaire Economie des nouvelles donnees with the data science joint research initiative with the fonds AXA pour la recherche, and the Initiative de Recherche “Machine Learning for Large-Scale Insurance” from the Institut Louis Bachelier.