Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains

Aymeric Dieuleveut, Alain Durmus, Francis Bach

Introduction

We consider the minimization of an objective function given access to unbiased estimates of the function gradients. This key methodological problem has raised interest in different communities: in large-scale machine learning , optimization , and stochastic approximation . The most widely used algorithms are stochastic gradient descent (SGD), a.k.a. Robbins-Monro algorithm , and some of its modifications based on averaging of the iterates .

While the choice of the step-size may be done robustly in the deterministic case (see e.g. ), this remains a traditional theoretical and practical issue in the stochastic case. Indeed, early work suggested to use step-size decaying with the number kk of iterations as O(1/k)O(1/k) , but it appeared to be non-robust to ill-conditioning and slower decays such as O(1/k)O(1/\sqrt{k}) together with averaging lead to both good practical and theoretical performance .

We consider in this paper constant step-size SGD, which is often used in practice. Although the algorithm is not converging in general to the global optimum of the objective function, constant step-sizes come with benefits: (a) there is a single parameter value to set as opposed to the several choices of parameters to deal with decaying step-sizes, e.g. as 1/(k+)1/(\square k+\triangle)^{\circ}; the initial conditions are forgotten exponentially fast for well-conditioned (e.g. strongly convex) problems , and the performance, although not optimal, is sufficient in practice (in a machine learning set-up, being only 0.1% away from the optimal prediction often does not matter).

where ff is the objective function to minimize (in machine learning the generalization performance), εk+1(θk(γ))\varepsilon_{k+1}(\theta_{k}^{(\gamma)}) the zero-mean statistically independent noise (in machine learning, obtained from a single observation). Following , we leverage the property that the sequence of iterates (θk(γ))k0(\theta_{k}^{(\gamma)})_{k\geq 0} is an homogeneous Markov chain.

This interpretation allows us to capture the general behavior of the algorithm. In the strongly convex case, this Markov chain converges exponentially fast to a unique stationary distribution πγ\pi_{\gamma} (see Section 3) highlighting the facts that (a) initial conditions of the algorithms are forgotten quickly and (b) the algorithm does not converge to a point but oscillates around the mean of πγ\pi_{\gamma}. See an illustration in Figure 1 (left). It is known that the oscillations of the non-averaged iterates have an average magnitude of γ1/2\gamma^{1/2} .

Consider the process (θˉk(γ))k0(\bar{\theta}_{k}^{(\gamma)})_{k\geq 0} given for all k0k\geq 0 by

Then under appropriate conditions on the Markov chain (θk(γ))k0(\theta_{k}^{(\gamma)})_{k\geq 0}, a central limit theorem on (θˉk(γ))k0(\bar{\theta}_{k}^{(\gamma)})_{k\geq 0} holds which implies that θˉk(γ)\bar{\theta}_{k}^{(\gamma)} converges at rate O(1/k)O(1/\sqrt{k}) to

The deviation between θˉk(γ)\bar{\theta}_{k}^{(\gamma)} and the global optimum θ\theta^{*} is thus composed of a stochastic part θˉk(γ)θˉγ\bar{\theta}_{k}^{(\gamma)}-\bar{\theta}_{\gamma} and a deterministic part θˉγθ\bar{\theta}_{\gamma}-\theta^{*}.

For quadratic functions, it turns out that the deterministic part vanishes , that is, θˉγ=θ\bar{\theta}_{\gamma}=\theta^{*} and thus averaged SGD with a constant step-size does converge. However, it is not true for general objective functions where we can only show that θˉγθ=O(γ)\bar{\theta}_{\gamma}-\theta^{*}=O(\gamma), and this deviation is the reason why constant step-size SGD is not convergent.

In summary, we make the following contributions:

We show in Section 2 that Richardson-Romberg extrapolation may be used to get closer to the global optimum.

We bring and adapt in Section 3 tools from analysis of discretization of diffusion processes into the one of SGD and create new ones. We believe that this analogy and the associated ideas are interesting in their own right.

We show in Section 4 empirical improvements of the extrapolation schemes.

Main results

In this section, we describe the assumptions underlying our analysis, describe our main results and their implications.

We also consider the following conditions on the noise, for p2p\geq 2:

In other words, we assume that the covariance matrix θC(θ)\theta\mapsto C(\theta) is a regular enough function, which is satisfied in natural settings.

For all k1k\geq 1, ξk\xi_{k} does not depend on θ\theta. This two parts in the noise will appear in Section 3.2. Finally assume that there exists r0r\geq 0 such that

then 4(4)(4) is satisfied. This assumption is satisfied, e.g., for a.s. bounded data, or for data with bounded kurtosis, see for details.

2 Summary and discussion of main results

This expansion in the step-size γ\gamma shows that a Richardson-Romberg extrapolation can be used to have better estimates of θ\theta^{*}. Consider the average iterates (θˉ2γ(k))k0(\bar{\theta}_{2\gamma}^{(k)})_{k\geq 0} and (θˉk(γ))k0(\bar{\theta}_{k}^{(\gamma)})_{k\geq 0} associated with SGD with step size 2γ2\gamma and γ\gamma respectively. Then (9) shows that (2θˉk(γ)θˉk(2γ))k0(2\bar{\theta}_{k}^{(\gamma)}-\bar{\theta}_{k}^{(2\gamma)})_{k\geq 0} satisfies

and therefore is closer to the optimum θ\theta^{*}. This very simple trick improves the convergence by a factor of γ\gamma (at the expense of a slight increase of the variance). In practice, while the un-averaged gradient iterate θk(γ)\theta_{k}^{(\gamma)} saturates rapidly, θˉk(γ)\bar{\theta}_{k}^{(\gamma)} may already perform well enough to avoid saturation on real data-sets . The Richardson-Romberg extrapolated iterate 2θˉk(γ)θˉk(2γ)2\bar{\theta}_{k}^{(\gamma)}-\bar{\theta}_{k}^{(2\gamma)} very rarely reaches saturation in practice. This appears in synthetic experiments presented in Section 4. Moreover, this procedure only requires to compute two parallel SGD recursions, either with the same inputs, or with different ones, and is naturally parallelizable.

In Section 3.2, we give a quantitative version of a central limit theorem for (θˉk(γ))k0(\bar{\theta}_{k}^{(\gamma)})_{k\geq 0}, for a fixed γ>0\gamma>0 and kk going to ++\infty : under appropriate conditions, there exist constants B1(γ)B_{1}(\gamma) and B2(γ)B_{2}(\gamma) such that

Combining (9) and (10) characterizes the bias/variance trade-off of SGD used to estimate θ\theta^{*}.

3 Related work

The idea to study stochastic approximation algorithms using results and techniques from the Markov chain literature is not new. It goes back to , which shows under appropriate conditions that solutions of stochastic differential equations (SDE)

where (Bt)t0(B_{t})_{t\geq 0} is a dd-dimensional Brownian motion and (γt)t0(\gamma_{t})_{t\geq 0} is a one-dimensional positive function, limt+γt=0\lim_{t\to+\infty}\gamma_{t}=0, converge in probability to some minima of ff. An other example is which extends the classical Foster-Lyapunov criterion from Markov chain theory (see ) to study the stability of the LMS algorithm. In , the authors are interested in the convergence of the multidimensional Kohonen algorithm. They show that the Markov chain defined by this algorithm is uniformly ergodic and derive asymptotic properties on its limiting distribution.

To the authors knowledge, the use of the Richardson-Romberg method for stochastic approximation has only been considered in to recover the minimax rate for recursive estimation of time varying autoregressive process.

Several attempts have been made to improve convergence of SGD. proposed an online Newton algorithm which converges in practice to the optimal point with constant step-size but has no convergence guarantees. The quadratic case was studied by , for the (uniform) average iterate: the variance term is upper bounded by σ2d/n{\sigma^{2}d}/{n} and the squared bias term by θ2/(γn)\|\theta^{*}\|^{2}/(\gamma n). This last term was improved to Σ1/2θ2/(γn)2\|\Sigma^{-1/2}\theta^{*}\|^{2}/(\gamma n)^{2} by , showing that asymptotically, the bias term is negligible, see also . Analysis has been extended to “tail averaging” , to improve the dependence on the initial conditions. Note that this procedure can be seen as a Richardson-Romberg trick with respect to kk. Other strategies were suggested to improve the speed at which initial conditions were forgotten, for example using acceleration when the noise is additive . A criterion to check when SGD with constant step size is close to its limit distribution was recently proposed in .

In the context of discretization of ergodic diffusions, weak error estimates between the stationary distribution of the discretization and the invariant distribution of the associated diffusion have been first shown by and in the case of the Euler-Maruyama scheme. Then, suggested the use of Richardson-Romberg interpolation to improve the accuracy of estimates of integrals with respect to the invariant distribution of the diffusion. Extension of these results have been obtained for other types of discretization by and . We show in Section 3.3 that a weak error expansion in the step-size γ\gamma also holds for SGD between πγ\pi_{\gamma} and \updeltaθ\updelta_{\theta^{*}}. Interestingly as to the Euler-Maruyama discretization, SGD has a weak error of order γ\gamma. In addition, proposed and analyzed the use of Richardson-Romberg extrapolation applied to the stochastic gradient Langevin dynamics (SGLD) algorithm. This method introduced by combines SGD and the Euler-Maruyama discretization of the Langevin diffusion associated to a target probability measure . Note that this method is however completely different from SGD, in part because Gaussian noise of order γ1/2\gamma^{1/2} (instead of γ\gamma) is injected in SGD which changes the overall dynamics.

Detailed analysis

In this Section, we describe in detail our approach. A first step is to describe the existence of a unique stationary distribution πγ\pi_{\gamma} for the Markov chain (θk(γ))k0(\theta_{k}^{(\gamma)})_{k\geq 0} and the convergence of this Markov chain to πγ\pi_{\gamma} in the Wasserstein distance of order 22.

Note that using that θ0(1),θ0(2)\theta^{(1)}_{0},\theta^{(2)}_{0} are independent of ε1\varepsilon_{1}, we have for i,j{1,2}i,j\in\{1,2\} using 3, that

Since for all k0k\geq 0, the distribution of (θk(1),θk(2))(\theta_{k}^{(1)},\theta_{k}^{(2)}) belongs to Π(λ1Rγk,λ2Rγk)\Pi(\lambda_{1}R_{\gamma}^{k},\lambda_{2}R_{\gamma}^{k}), by definition of the Wasserstein distance we get

using (12) for i)i), 4(2)(2) for ii)ii), and finally 1 for iii)iii).

Thus by a straightforward induction, we get, setting ρ=(12μγ(1γL/2))\rho=(1-2\mu\gamma(1-\gamma L/2))

We show that the limit πγλ1\pi_{\gamma}^{\lambda_{1}} does not depend on λ1\lambda_{1}. Assume that there exists πγλ2\pi_{\gamma}^{\lambda_{2}} such that limk+W2(λ2Rγk,πγλ2)=0\lim_{k\to+\infty}W_{2}(\lambda_{2}R_{\gamma}^{k},\pi_{\gamma}^{\lambda_{2}})=0. By the triangle inequality

Thus by (13) and (14), taking the limits as k+k\to+\infty, we get W2(πγλ1,πγλ2)=0W_{2}(\pi_{\gamma}^{\lambda_{1}},\pi_{\gamma}^{\lambda_{2}})=0 and πγλ1=πγλ2\pi_{\gamma}^{\lambda_{1}}=\pi_{\gamma}^{\lambda_{2}}. The limit is thus the same for all initial distributions and is denoted by πγ\pi_{\gamma}.

Using (13) and (14), we get taking k+k\to+\infty, W2(πγRγ,πγ)=0W_{2}(\pi_{\gamma}R_{\gamma},\pi_{\gamma})=0 and πγRγ=πγ\pi_{\gamma}R_{\gamma}=\pi_{\gamma}. The fact that πγ\pi_{\gamma} is the unique stationary distribution is straightforward by contradiction and using (13).

Taking λ1=\updeltaθ\lambda_{1}=\updelta_{\theta}, λ2=πγ\lambda_{2}=\pi_{\gamma}, using the invariance of πγ\pi_{\gamma} and (13), we get (a).

When ff is a quadratic function, i.e. ff^{\prime} is affine, we have the following result.

Assume f=fΣf=f_{\Sigma}, fΣ:θΣ1/2(θθ)2/2f_{\Sigma}:\theta\mapsto\left\|\Sigma^{1/2}(\theta-\theta^{*})\right\|^{2}/2, where Σ\Sigma is a positive definite matrix, and 2-3-4(4). Let γ(0,2/L)\gamma\in\left(0,2/L\right). Then, it holds θˉγ=θ\bar{\theta}_{\gamma}=\theta^{*}, ΣI+IΣγΣΣ\Sigma\otimes I+I\otimes\Sigma-\gamma\Sigma\otimes\Sigma is invertible and

where θˉγ\bar{\theta}_{\gamma} and C\mathcal{C} are given by (3) and (5) respectively, and πγ\pi_{\gamma} is the invariant probability measure of RγR_{\gamma} given by Section 3.

While the quadratic case led to particularly simple expressions, in general, we can only get a first order development of these expectations as γ0\gamma\to 0. Note that it improved on , which shows a similar expansion but an error of order of O(γ3/2)O(\gamma^{3/2}).

Assume 1-2-3-4(6[2(kε+1)]6\vee[2(k_{\varepsilon}+1)])-5 and let γ(0,2/L)\gamma\in\left(0,2/L\right). Then f(θ)I+If(θ)f^{\prime\prime}(\theta^{*})\otimes I+I\otimes f^{\prime\prime}(\theta^{*}) is invertible and

θˉγ\bar{\theta}_{\gamma} and C\mathcal{C} are given by (3) and (5) respectively, and πγ\pi_{\gamma} is the invariant probability measure of RγR_{\gamma} given by Section 3.

This shows that γθˉγ\gamma\mapsto\bar{\theta}_{\gamma} is a differentiable function at γ=0\gamma=0. The “drift” θˉγθ\bar{\theta}_{\gamma}-\theta^{*} can be understood as an additional error occurring because the function is non quadratic (f(θ)0f^{\prime\prime\prime}(\theta^{*})\not=0) and the step-sizes are not decaying to zero. The mean under the limit distribution is at distance γ\gamma from θ\theta^{*}. In comparison, the final iterate oscillates in a sphere of radius proportional to γ\sqrt{\gamma}.

2 Expansion for a given γ>0𝛾0\gamma>0 when k𝑘k tends to +∞+\infty

ψγ\psi_{\gamma} the Poisson solution associated to φ:θθθ\varphi:\theta\mapsto\theta-\theta^{*},

ϖγ\varpi_{\gamma} the Poisson solution associated to θψγ(θ)\theta\mapsto\psi_{\gamma}(\theta),

χγ1\chi^{1}_{\gamma} the Poisson solution associated to θ(ψγ(θ))2\theta\mapsto(\psi_{\gamma}(\theta))^{\otimes 2},

χγ2\chi^{2}_{\gamma} the Poisson solution associated to θ((ψγφ)(θ))2\theta\mapsto((\psi_{\gamma}-\varphi)(\theta))^{\otimes 2}.

where θˉk(γ)\bar{\theta}_{k}^{(\gamma)}, θˉγ\bar{\theta}_{\gamma} are given by (2) and (3) respectively, and πγ\pi_{\gamma} is the invariant probability measure of RγR_{\gamma} given by Section 3.

In order to give the intuition of the proof and to underline how the associated Poisson solutions are introduced, we here sketch the proof of the first result. By definition of φ:θθθ\varphi:\theta\mapsto\theta-\theta^{*} and since ψγ\psi_{\gamma} satisfies (IdRγ)ψγ=φ(\operatorname{Id}-R_{\gamma})\psi_{\gamma}=\varphi, we have

Finally, we have that Rγkψγ(θ0)R_{\gamma}^{k}\psi_{\gamma}(\theta_{0}) converges to 0 at linear speed, using Section 3 and πγ(ψγ)=0\pi_{\gamma}(\psi_{\gamma})=0.

The formal and complete proof of this result is postponed to Section 6.5. ∎

This result gives an exact closed form for the asymptotic bias and variance, for a fixed γ\gamma, as kk\rightarrow\infty. Unfortunately, in the general case, it is neither possible to compute the Poisson solutions exactly, nor is it possible to prove a first order development of the limits as γ0\gamma\rightarrow 0.

When fΣf_{\Sigma} is a quadratic function, it is possible, for any γ>0\gamma>0, to compute ψγ\psi_{\gamma} and χγ1,2\chi^{1,2}_{\gamma} explicitly; we get the following decomposition of the error, which exactly recovers the result of or .

With Ω=(ΣI+IΣγΣΣ)(ΣI+IΣγT)1\Omega=(\Sigma\otimes I+I\otimes\Sigma-\gamma\Sigma\otimes\Sigma)(\Sigma\otimes I+I\otimes\Sigma-\gamma\mathbf{T})^{-1}, and

The proof is postponed to the supplementary paper , Section S3. ∎

The bound on the second order moment is composed of a variance term k1Σ1πγ(C)Σ1k^{-1}\Sigma^{-1}\pi_{\gamma}(\mathcal{C})\Sigma^{-1}, a bias term which decays as k2k^{-2}, and a non-positive residual term. Interestingly, the bias is if we start under the limit distribution.

3 Continuous interpretation of SGD and weak error expansion

Denote by (A,D(A))(\mathcal{A},D(\mathcal{A})), the infinitesimal generator associated with the flow (φt)t0(\varphi_{t})_{t\geq 0} defined by

where θk(γ)\theta_{k}^{(\gamma)} is the Markov chain starting from θ0\theta_{0} and defined by the recursion (1) and C\mathcal{C} is given by (5). In addition for some constant C0C\geq 0 independent of γ\gamma and kk, we have

4 Discussion

Classical proofs of convergence rely on another decomposition, originally proposed by and used in recent papers analyzing the averaged iterate . We here sketch the arguments of these decompositions, in order to highlight the main difference, namely the fact that the residual term is not well controlled when γ\gamma goes to zero in the classical proof.

As a consequence, using the definition of the SGD recursion (1),

Averaging over the first kk iterates yields:

The term on the right-hand part of Equation (23) is composed of a bias term (depending on the initial condition), a variance term, and a residual term. This residual term differentiates the general setting from the quadratic one (in which it does not appear, as the first order Taylor expansion of ff^{\prime} is exact). This decomposition has been used in to prove upper bound on the error, but does not allow for a tight decomposition in powers of γ\gamma when γ0\gamma\to 0. Indeed, the residual θi(γ)θ\theta_{i}^{(\gamma)}-\theta^{*} simply does not go to 0 when γ0\gamma\to 0: on the contrary, the chain becomes ill-conditioned when γ=0\gamma=0.

Finally, averaging over the first kk iterations and taking g=Idg=\operatorname{Id} give

This expansion is the root of the proof of Theorem 4, which formalizes the expansion as powers of γ\gamma. The key difference between decomposition (23) and (24) is that in the latter, when γ0\gamma\to 0, the expectation of the residual term tends to 0 and can naturally be controlled.

Experiments

Conclusion

In this paper, we have used and developed Markov chain tools to analyze the behavior of constant step-size SGD, with a complete analysis of its convergence, outlining the effect of initial conditions, noise and step-sizes. For machine learning problems, this allows us to extend known results from least-squares to all loss functions. This analysis leads naturally to using Romberg-Richardson extrapolation, that provably improves the convergence behavior of the averaged SGD iterates. Our work opens up several avenues for future work: (a) show that Richardson-Romberg trick can be applied to the decreasing step-sizes setting, (b) study the extension of our results under self-concordance condition .

Postponed proofs

Assumption 4, made in the text, can be weakened in order to apply to settings where input observations are un-bounded (typically, Gaussian inputs would not satisfy Assumption 4). Especially, in many cases, we only need Assumption 7 below. Let p2p\geq 2.

where LL is the same constant appearing in 2 and f1f^{\prime}_{1} is defined by (4).

On the other hand, we consider also the stronger assumption that the noise is independent of θ\theta (referred to as the “semi-stochastic” setting, see ), or more generally that the noise has a uniformly bounded fourth order moment.

2 Preliminary results

We preface the proofs of the main results by some technical lemmas.

Therefore by definition (26), ψγ\psi_{\gamma} is Lipschitz continuous. Finally, it is straightforward to verify that ψγ\psi_{\gamma} satisfies the stated properties.

Assume 1-2-3-4(2). Then we have for any γ(0,2/L)\gamma\in(0,2/L).

where (θk(γ))k0(\theta_{k}^{(\gamma)})_{k\geq 0} is given by (1). Moreover, if γ(0,1/L)\gamma\in\left(0,1/L\right), we have

The proof and result is very close to the ones from but we extend it without a.s. Lipschitzness (4) but with 7. Using 3-1 and f(θ)=0f^{\prime}(\theta^{*})=0, we have

In addition, under 3-7(22) and using (4), we have:

Combining this result and (30) concludes the proof of the first inequality.

Taking M+M\to+\infty and applying the monotone convergence theorem concludes the proof. ∎

Using Section 6.2, we can extend Section 6.2 to functions ϕ\phi which are locally Lipschitz.

For any step-size γ(0,1/L)\gamma\in(0,1/L), it holds:

In this proof, C0C\geq 0 is a constant which can change from line to line.

By definition (26), ψγ\psi_{\gamma} satisfies (32). Finally, it is straightforward to verify that ψγ\psi_{\gamma} satisfies the stated properties.

It is worth pointing out that under Assumption 8 (the “semi-stochastic” assumption), a slightly different result holds. The following result underlines the difference between a stochastic noise and a semi-stochastic noise, especially the fact that the maximal step-size differs depending on this assumption made.

where (θk(γ))k0(\theta_{k}^{(\gamma)})_{k\geq 0} is given by (1).

So that finally, using (29), 3, (33), 2 and rearranging terms we get

Using that γ2/(m+L)\gamma\leq 2/(m+L) concludes the proof. ∎

We give uniform bound on the moments of the chain (θk(γ))k0(\theta_{k}^{(\gamma)})_{k\geq 0} for γ>0\gamma>0. For p1p\geq 1, recall that under 4(2p), the noise at optimal point has a moment of order 2p2p and we denote

We give a bound on the pp-order moment of the chain, under the assumption that the noise has a moment of order 2p2p.

For moment of order larger than 22, we have the following result.

Note that there is no contradiction between (35) and Theorem 4, as for any p2p\geq 2, one has for g(θ)=θθ2g(\theta)=\left\|\theta-\theta^{*}\right\|^{2} and hgh_{g} the solution to the Poisson equation, that hg(θ)=0h^{\prime\prime}_{g}(\theta^{*})=0, so that the first term in the development (of order γ\gamma) is indeed 0.

We upper bounds each term for i,j,l{0,,p}i,j,l\in\{0,\ldots,p\}, as follows:

For i=p,j=l=0i=p,j=l=0, we have δk2p\delta_{k}^{2p}.

For i=p1,j=1,l=0i=p-1,j=1,l=0, we have p2γfk+1(θk),θkθδk2(p1)p2\gamma\langle f_{k+1}^{\prime}(\theta_{k}),\theta_{k}-\theta^{*}\rangle\delta_{k}^{2(p-1)}, for which it holds by 3

Else, either l1l\geq 1 or j2j\geq 2, thus 2l+j22l+j\geq 2. We first upper bound, by the Cauchy–Schwarz inequality:

Combining this result, (38) and (39) implies using i+j+lpi+j+l\leq p,

Note that using j+2l2j+2l\geq 2, for γ\gamma such that γL<1/Cp\gamma L<1/C_{p}, it holds

Therefore, we have combining this inequality, (37)-(40) in (36),

Using 1, for j{0,,p}j\in\{0,\ldots,p\}, (γτ2pδk)j2(γτ2p)2j+2(δk)2j(\gamma\tau_{2p}\delta_{k})^{j}\leq 2(\gamma\tau_{2p})^{2j}+2(\delta_{k})^{2j}, we get

Note that using (41), Cp2C_{p}\geq 2 and μL\mu\leq L, (12γμ(1γLCp/2))(1γLCp(1γLCp/2))1/2(1-2\gamma\mu(1-\gamma LC_{p}/2))\geq(1-\gamma LC_{p}(1-\gamma LC_{p}/2))\geq 1/2. Using this inequality and 1pt(1t)p1-pt\leq(1-t)^{p} for t0t\geq 0 we get by (42) setting ρ=(12γμ(1γLCp/2))\rho=(1-2\gamma\mu(1-\gamma LC_{p}/2)),

A straightforward induction implies the first statement. The proof of (35) is similar to the one of (28) and is omitted. ∎

Let γ(0,1/(LCp))\gamma\in(0,1/(LC_{p})). Consider the two processes (θk(1))0(\theta_{k}^{(1)})_{\geq 0},(θk(2))k0(\theta_{k}^{(2)})_{k\geq 0} defined by (11) with λ1=\updeltaθ\lambda_{1}=\updelta_{\theta} and λ2=\updeltaϑ\lambda_{2}=\updelta_{\vartheta}. By Jensen inequality, we have

Using Section 6.2, the Cauchy Schwarz and Minkowski inequalities and (13) we get

Combining this result and (43) concludes the proof.

3 Proof of Section 3.1

Taking the expectation, using 3, θ0(γ)\theta_{0}^{(\gamma)} is independent of ε1\varepsilon_{1} and πγRγ=πγ\pi_{\gamma}R_{\gamma}=\pi_{\gamma}, we get

Note that in the case of the regression setting described in Example 1, we can specify Section 3.1 as follows.

where T\mathbf{T} and ξ1\xi_{1} are defined by (18) and (7) respectively.

The proof follows the same line as the proof of Section 3.1 and is omitted. ∎

4 Proof of Theorem 2

We preface the proof by a couple of preliminaries lemmas.

Assume 1-2-3-4(62kε6\vee 2k_{\varepsilon})-5 and let γ(0,2/L)\gamma\in\left(0,2/L\right). Then

where A\mathbf{A} is defined by (17), θˉγ\bar{\theta}_{\gamma} and C\mathcal{C} are given by (3) and (5) respectively.

It follows from Section 6.2, taking the integral with respect to πγ\pi_{\gamma},

Using (47), Section 6.2 and Hölder inequality, we get

Moreover, we have by a second order Taylor expansion with integral remainder of ff^{\prime} around θ\theta^{*},

Taking the second order moment of this equation, and using 3, θ0\theta_{0} is independent of ε1\varepsilon_{1}, (49), Section 6.2 and Hölder inequality, we get

Assume 1-2-3-4(6[2(kε+1)]6\vee[2(k_{\varepsilon}+1)])-5. It holds as γ0\gamma\to 0,

The proof consists in showing that the residual term in (45) of Section 6.4 is of order O(γ2)O(\gamma^{2}) and not only O(γ3/2)O(\gamma^{3/2}). Note that we have already prove that θˉγθ=O(γ)\bar{\theta}_{\gamma}-\theta^{*}=O(\gamma). To find the next term in the development, we develop further each of the terms. By a fourth order Taylor expansion with integral remainder of ff^{\prime} around θ\theta^{*}, and using 2, we have

Since f(θ)f^{\prime\prime}(\theta^{*}) is invertible by 1, To get the next term in the development, we show that

(a) Denote for i=0,1i=0,1, ηi=θiθ\eta_{i}=\theta_{i}-\theta^{*}. By (46)-(47), Section 6.2 and 3-4(12), we get

Using 1 and the same reasoning as to show that A\mathbf{A} in (17), is well defined, we get that B\mathbf{B} is invertible. Then since η0\eta_{0} and η1\eta_{1} has the same distribution πγ\pi_{\gamma}, we get

Combining this result and (45) implies (a).

(b) First, we have using (50), 3 and Section 6.2 that:

Since θ0\theta_{0} and θ1\theta_{1} follow the same distribution πγ\pi_{\gamma}, it follows that

Then by linearity of f(θ)f^{\prime\prime\prime}(\theta^{*}) and using (a) we get (b).

Finally the proof of (15) follows from combining the results of (a)-(b) in (51).

5 Proof of Theorem 3

Theorem 3 follows from the following more general result taking φ:θθθ\varphi:\theta\mapsto\theta-\theta^{*}.

where ψγ\psi_{\gamma}, ϖγ\varpi_{\gamma}, χγ1\chi^{1}_{\gamma}, χγ2\chi^{2}_{\gamma} are solutions of the Poisson equation (26) associated with φ\varphi, ψγ\psi_{\gamma}, ψγ2\psi_{\gamma}^{\otimes 2} and (ψγφ)2(\psi_{\gamma}-\varphi)^{\otimes 2} respectively.

We now consider the Poisson solution associated with φφ\varphi\varphi^{\top}, χγ3\chi^{3}_{\gamma}. By Section 6.2, such a function exists and satisfies πγ(χγ3)=0\pi_{\gamma}(\chi^{3}_{\gamma})=0, Rγkχγ3(θ0)=O(ρk)R_{\gamma}^{k}\chi^{3}_{\gamma}(\theta_{0})=O(\rho^{k}). Therefore, we obtain using in addition the Markov property:

where χγ4\chi^{4}_{\gamma} and χγ5\chi^{5}_{\gamma} are the Poisson solution associated with φψγ\varphi\psi_{\gamma}^{\top} and ψγφ\psi_{\gamma}\varphi^{\top} respectively. Combining (55)-(56) in (53), we obtain

In addition, by Section 6.2 and definition, we have for all θ0\theta_{0}

Combining this result and (58) in (57) concludes the proof.

6 Proof of Section 3.2

In this section we apply Theorem 3 to the case of a quadratic function, more specifically to the LMS algorithm described in Example 1, to prove Section 3.2. Recall that the sequence of iterates can be written,

First note that with the notations of the text, and with γ1/r2\gamma\leq 1/r^{2}, operator (ΣId+IdΣγT)(\Sigma\otimes\operatorname{Id}+\operatorname{Id}\otimes\Sigma-\gamma\mathbf{T}) is a positive operator on the set of symmetric matrices, and is thus invertible.

We consider the linear function φ\varphi which is φ(θ)=θθ\varphi(\theta)=\theta-\theta^{*}, thus Φk=θˉk(γ)θ\Phi_{k}=\bar{\theta}_{k}^{(\gamma)}-\theta^{*}. First, by Section 3.1, πγ(φ)=0\pi_{\gamma}(\varphi)=0. We have the following equalities:

Indeed, for (59) (other equations are basic linear algebra), starting from any θ0\theta_{0}:

Moreover, the expectation of φ(θ)2\varphi(\theta)^{\otimes 2} under the stationary distribution is known according to Section 3.1,

For the term proportional to 1/k21/k^{2}, we first need to compute the function χ3\chi^{3}, solution to the Poisson equation associated with θφ(θ)2\theta\mapsto\varphi(\theta)^{\otimes 2}.

Following the proof of Section 6.3, we have:

Formally, the simplification comes from the fact that we study an arithmetico-geometric recursion of the form wk+1=awk+bw_{k+1}=aw_{k}+b, a<1a<1, and study i=0wkw=(1a)1(w0w)\sum_{i=0}^{\infty}w_{k}-w_{\infty}=(1-a)^{-1}(w_{0}-w_{\infty}). (note that here we cannot apply the recursion with (ΣId+IdΣγΣΣ)(\Sigma\otimes\operatorname{Id}+\operatorname{Id}\otimes\Sigma-\gamma\Sigma\otimes\Sigma) because then “bb” would depend on kk.)

This term is the sum of the following three terms:

using ψγ=(γΣ)1φ\psi_{\gamma}=(\gamma\Sigma)^{-1}\varphi, and Rγψγ=ψγφ=(Id(γΣ)1)φR_{\gamma}{}\psi_{\gamma}=\psi_{\gamma}-\varphi=-(\operatorname{Id}-(\gamma\Sigma)^{-1})\varphi. Finally,

With Ω=(ΣId+IdΣγΣΣ)(ΣId+IdΣγT)1\Omega=(\Sigma\otimes\operatorname{Id}+\operatorname{Id}\otimes\Sigma-\gamma\Sigma\otimes\Sigma)(\Sigma\otimes\operatorname{Id}+\operatorname{Id}\otimes\Sigma-\gamma T)^{-1}.

Combining (63), (64) and (65), we conclude the proof of Section 3.2.

7 Proof of Theorem 4

Before giving the proof of Theorem 4, we need several results regarding Poisson solutions associated with the gradient flow ODE (20).

where {f1,,fd}\{\mathbf{f}_{1},\ldots,\mathbf{f}_{d}\} and {λ1,,λd}\{\lambda_{1},\ldots,\lambda_{d}\} are the eigenvectors and the eigenvalues of f(θ)f{{}^{\prime\prime}}(\theta^{*}) respectively satisfying for all i{1,,d}i\in\{1,\ldots,d\}, f(θ)fi=λifif{{}^{\prime\prime}}(\theta^{*})\mathbf{f}_{i}=\lambda_{i}\mathbf{f}_{i}.

This is a fundamental result on the regularity of flows of autonomous differential equations, see e.g. [24, Theorem 4.1 Chapter V]

with ξ0x(θ)=x\xi^{x}_{0}(\theta^{*})=x. The proof then follows from uniqueness of the solution of (66).

By c) and since θ\theta^{*} is an equilibrium point we get that ξtx1,x2(θ)\xi^{x_{1},x_{2}}_{t}(\theta^{*}) satisfies

Therefore we get for all i,j,l{1,,d}i,j,l\in\{1,\ldots,d\},

This ordinary differential equation can be solved analytically which finishes the proof.

Note that hgh_{g} is well-defined by Section 6.7.1-b) and since gg is assumed to be locally-Lipschitz. In addition by (20), hgh_{g} satisfies

Note that hIdh_{\operatorname{Id}} is also well-defined by Section 6.7.1-b).

The proof then follows from Section 6.7.1-b).

The proof is a direct consequence of Section 6.7.1-c)-d) and (67).

7.2 Proof of Theorem 4

We preface the proof of the Theorem by two fundamental first estimates.

where θk(γ)\theta_{k}^{(\gamma)} is the Markov chain starting from θ0\theta_{0}, defined by the recursion (1), and

for some constant C0C\geq 0 independent of γ\gamma and kk.

where si(γ)[0,1]s^{(\gamma)}_{i}\in\left[0,1\right] and Δθi+1(γ)=θi+1(γ)θi(γ)\Delta\theta_{i+1}^{(\gamma)}=\theta_{i+1}^{(\gamma)}-\theta_{i}^{(\gamma)}. Therefore by (68), we get

Taking the expectation and using 3, we have

The proof of (74) then follows from 2, 3, (73) and Section 6.2.

This result is a direct consequence of Section 6.2 and (a).

Acknowledgments

The authors would like to thank Éric Moulines and Arnak Dalalyan for helpful discussions. We acknowledge support from the chaire Economie des nouvelles donnees with the data science joint research initiative with the fonds AXA pour la recherche, and the Initiative de Recherche “Machine Learning for Large-Scale Insurance” from the Institut Louis Bachelier.

References