Asymptotic study of stochastic adaptive algorithm in non-convex landscape

Sébastien Gadat, Ioana Gavra

Introduction

whose early success in the sixties has been at least rejuvenated if not resurrected with the development of massive learning problems, in the last fifteen years. We refer among other to [BB08, MB11] or to [Bac14] and the references therein for various applications in machine learning. Although being one of the state-of-the-art method to handle massive datasets, SGD suffers from several issues: difficulty to tune the step-size sequence or dependence on the gradient flow that may be lazy in flat areas, which is especially the case when looking at non-convex neural network problems.

Some popular improvements are commonly patched to the vanilla SGD, and among others we refer to the popular acceleration obtained with the Polyak-Ruppert averaging [PJ92, Rup88, MB11, CCGB17, GP20], variance reduction with mini-batch strategies (see e.g. [LRSB+12, JZ13]).

While these two last improvements do not modify the underlying gradient flow, other strategies rely on a modification of the dynamical system exploiting acceleration brought by momentum with second order terms. The first historical example is the Heavy Ball with Friction optimization based on the seminal contribution [Pol64] and then translated into a stochastic framework (see e.g. [GPS18, SGD20, LR20]). Another second example is the Nesterov Accelerated Gradient Descent (see e.g. [Nes83]) translated and studied in the noisy situation recently in a large number of works (see among other [GL16, JNJ18]).

To the best of our knowledge, there is little convergence mathematical results on adaptive algorithms: [BdSG20] studies the deterministic dynamical system behind adaptive algorithms and obtained long-time behaviour of the trajectories or the value function following ideas of [CEG09a, SBC16]. More recently, [BB20] (that is more closely related to us) obtains the almost sure convergence of their algorithms towards critical points with a parametrization that is different from our but the authors leave as an open problem the important question of the convergence towards a minimizer of ff.The same week we sent our paper on Arxiv, [BBHS20] also published some results on the trap avoidance of adaptive algorithms but not consider the mini-batch effect that is known to be a crucial ingredient for the efficiency of adaptive methods. We refer to Theorem 2 and 3 below for the conditions we obtained on the mini-batch sequence. Finally, some recent contributions in machine learning [WWB19, ZSJ+19, DBBU20] address some “convergence” questions for adaptive algorithms with constant step-size. They provide a non-asymptotic study with a step-size that is tuned according to the finite horizon of simulation. Even though these results are of major interest from a numerical point of view, they do not really answer the question of convergence from a trajectorial point of view (see Section 2.2 below). The objective of this work may be seen as modest at the moment: we aim to study the asymptotic behaviour of Adagrad and Rmsprop i.e. we aim to show the almost sure convergence towards a local minimizer of the objective function ff. However limited at first sight, we will see that the convergence of the trajectories outside local traps is already challenging, especially when a mini-batch strategy is used.

Adaptive algorithms and main results

The algorithm we consider in this paper use the vectorial division/multiplication notations introduced in Adagrad (see [DHS11]) and now widely used in machine learning. The vectorial division uv\frac{u}{v} and multiplication uvu\cdot v are the coordinate per coordinate operations introduced by:

In the meantime, the notation u2u^{\odot 2} corresponds to the coordinate per coordinate square:

whereas u\sqrt{u} denotes the coordinate per coordinate square root:

where (gn)n1(g_{n})_{n\geq 1} corresponds to a noisy stochastic evaluation of the gradient of the function ff, corrupted by an additive noise sequence (ξn+1)n1(\xi_{n+1})_{n\geq 1}:

Following the initial vectorial notations, we emphasize that (1) means that for any coordinate i{1,,d}i\in\{1,\ldots,d\}, the position (θn)n1(\theta_{n})_{n\geq 1} and scaling factor (wn)n1(w_{n})_{n\geq 1} are updated according to:

2 Link with other parametrizations

We discuss here on our choice of the Adagrad/Rmsprop parametrization (1) using the one of [BdSG20], and its link with the standard parametrization introduced in [DHS11] or [HSS12] and used in later works [DBBU20, ZSJ+19] for Adam and in [WWB19] for Adagrad. We have chosen to use this formulation, which is inspired from the limiting O.D.E. of the continuous time adaptive gradient system following previous works on accelerated or second order dynamics and among other we refer to Memory gradient diffusion [GP14], Ruppert-Polyak averaging [GP20], Heavy Ball systems [AGR00, CEG09a, CEG09b, GPS18] or more generally Nesterov acceleration [Nes04, SBC16, ACR19] and dissipative systems [Har91, AABR02].

The pioneering works [DHS11] and [HSS12] use the following parametrization:

for β2(0,1)\beta_{2}\in(0,1) and when no heavy ball momentum (see e.g. [Pol64]) is used in the algorithm (which is also the case we are considering in this work). Notice that β2\beta_{2} may depend on the current iteration in a second stage and for the sake of completeness, we consider a general sequence (β2(n))n1(\beta_{2}(n))_{n\geq 1}.

We introduce the natural normalizing sequence (Sn)n1(S_{n})_{n\geq 1}, defined by S0=1S_{0}=1 and the following recursion:

We are led to introduce w~n=vn/Sn\widetilde{w}_{n}=v_{n}/S_{n} and ε~n=ε~/Sn\widetilde{\varepsilon}_{n}=\widetilde{\varepsilon}/S_{n} and we observe that:

whereas the second coordinate evolves according to:

2.2 Two-time scale parametrization

We deduce that the popular parametrization introduced in the seminal contributions of Adagrad, Rmsprop or ADAM and the one we used in our paper are equivalent and fall into our framework described in Equation (1) with possibly two time-scales on the system (θ~n,w~n)n1(\widetilde{\theta}_{n},\widetilde{w}_{n})_{n\geq 1}:

The choice of the sequence (β2(n))n1(\beta_{2}(n))_{n\geq 1} is key to understand the stochastic algorithm we obtain in (4).

\bullet Case constant β2=1\beta_{2}=1 (Adagrad of [DHS11]) This case is certainly the easiest to understand since the natural rescaling SnS_{n} of the sequence (vn)n1(v_{n})_{n\geq 1} is Sn=nS_{n}=n. In this case, we recover a joint evolution:

which entails γn+1=αn+1n\gamma_{n+1}=\frac{\alpha_{n+1}}{\sqrt{n}} and γ~n+1=1n+1\widetilde{\gamma}_{n+1}=\frac{1}{n+1}. With the choice of [DBBU20], we then obtain a constant step-size stochastic algorithm for the coordinate (θn)n1(\theta_{n})_{n\geq 1} associated with a uniform Cesaro averaging on the sequence of past squared gradients (gn+12)n1(g_{n+1}^{\odot 2})_{n\geq 1}.

\bullet Case constant β2(0,1)\beta_{2}\in(0,1) (Adam of [KB15] with no momentum). Since SnS_{n} converges exponentially fast towards (1β2)1(1-\beta_{2})^{-1}, the system is close to:

which entails γn+1=(1β2)αn+1\gamma_{n+1}=(1-\beta_{2})\alpha_{n+1} and γ~n+1=(1β2)\widetilde{\gamma}_{n+1}=(1-\beta_{2}).

\bullet Case β2(n)=1bnβ\beta_{2}(n)=1-bn^{-\beta} with b(0,1)b\in(0,1). This last case where the sequence goes to 11 with nn corresponds to an intermediary situation between β2=1\beta_{2}=1 and β2<1\beta_{2}<1, this transition being parametrized by β[0,+]\beta\in[0,+\infty]. We shall introduce the sequence of products:

It is well known (see e.g. [BCG20] Lemma 5.2) that when β=1\beta=1,

where Λ\Lambda can be made explicit in terms of the Riemann zeta function. We then conclude the following behaviour of (Sn)n1(S_{n})_{n\geq 1} (see Appendix B of [GPS18]) that:

which implies that the joint evolution shall be written as:

which entails γn+1=αn+1nβ12\gamma_{n+1}=\alpha_{n+1}n^{-\frac{\beta\wedge 1}{2}} and γ~n+1=n(β1)\widetilde{\gamma}_{n+1}=n^{-(\beta\wedge 1)}.

In all the situations above, we point out that we obtain some standard choices for the sequences (γn+1)n1(\gamma_{n+1})_{n\geq 1} and (γ~n+1)n1(\widetilde{\gamma}_{n+1})_{n\geq 1} involved in our two-time scale system (5).

2.3 Final remark on step-size sequences

In this work, we have chosen to restrict our study to a single time-scale parametrization with decreasing sequences (γn+1)n1=(γ~n+1)n1(\gamma_{n+1})_{n\geq 1}=(\widetilde{\gamma}_{n+1})_{n\geq 1} within the standard setup of stochastic algorithms:

This single time-scale restriction implies that Sn=αn+1\sqrt{S_{n}}=\alpha_{n+1}, so that when we choose in (1) pn=qn=1p_{n}=q_{n}=1 and γn+1=γ1(n+1)β\gamma_{n+1}=\gamma_{1}(n+1)^{-\beta}, our algorithm is strictly equivalent to the one initially introduced in (2) with SnnβS_{n}\propto n^{-\beta} and αn+1=nβ/2\alpha_{n+1}=n^{-\beta/2}. In particular, if β<1\beta<1, it corresponds to β2(n)=1bnβ\beta_{2}(n)=1-bn^{-\beta} while if β=1\beta=1, it corresponds to β2(n)=1\beta_{2}(n)=1.

We leave the more sophisticated general study of the two time-scale algorithm for future investiations and refer to [Bor97], [MP06] or [BCG20, CG20] for other examples of such two time-scale stochastic algorithms in various (but simpler) situations.

3 Assumptions and convenient notations

We introduce the canonical filtration associated to our random sequence Fn=σ((θk,wk)1kn)\mathcal{F}_{n}=\sigma\left((\theta_{k},w_{k})_{1\leq k\leq n}\right) and list below the main assumptions used in our work.

We use the symbols d,d\lesssim_{d},\gtrsim_{d} to refer to inequalities up to a multiplicative constant that are independent from the dimension dd: for two positive sequences (un)n0(u_{n})_{n\geq 0} and (vn)n0(v_{n})_{n\geq 0}, we write

and the constant CC is independent from the dimension of the ambient space dd. We will also use the notation un=Od(vn)u_{n}=\mathcal{O}_{d}(v_{n}) when undvnu_{n}\lesssim_{d}v_{n}. We also use the symbol \lesssim that refers to an inequality up to a multiplicative constant that can depend on dd, for the proof of the local trap avoidance since for this result we are not interested in a quantitative effect of the dimension.

We first describe our main assumption on the sequence (gn)n1(g_{n})_{n\geq 1}.

\bullet Assumption Hσp\mathbf{H}_{\sigma}^{p}. We assume that the sequence (gn)n1(g_{n})_{n\geq 1} used in (1) provides an unbiased estimation of the true gradient of ff at position θn\theta_{n}, i.e. we assume that:

We furthermore assume that the noise sequence (ξn+1)n1(\xi_{n+1})_{n\geq 1} satisfies:

where cc is a positive constant independent from dd. Assumption Hσp\mathbf{H}_{\sigma}^{p} stands for a classical framework in stochastic optimization methods: (σn)n1(\sigma_{n})_{n\geq 1} is an auxiliary sequence that translastes a possible use of mini-batches when σn0\sigma_{n}\longrightarrow 0 as n+n\longrightarrow+\infty. The moment assumption on (ζn)n1(\zeta_{n})_{n\geq 1} is the convenient assumption to handle standard problems like on-line regression, logistic regression or cascade of logistic regressions used in deep learning. We emphasize that we do not make any restrictive and somewhat irrealistic boundedness assumption of the noise (ζn)n1(\zeta_{n})_{n\geq 1} or of the sequence (θn)n1(\theta_{n})_{n\geq 1} itself. Below, we will use this assumption with p=4p=4 in Theorem 1. Finally, we should observe that this assumption introduces a possible linear dependency with dd on the size of the variance of the noise.

To derive the convergence of our algorithms towards a local minima, we will need a more stringent condition on the noise sequence. We then introduce the next assumption that will replace Hσp\mathbf{H}_{\sigma}^{p} in our second main result of almost sure convergence (see Theorem 3 below).

\bullet Assumption Hσ\mathbf{H}_{\sigma}^{\infty}.

(Hσ1)(\mathbf{H}_{\sigma}^{\infty}-1). The noise sequence (ξn+1)n1(\xi_{n+1})_{n\geq 1} is centered and satisfies:

(Hσ2)(\mathbf{H}_{\sigma}^{\infty}-2). The noise sequence is elliptic uniformly in nn:

We stress that the upper bound 11 on the second order moment is not restrictive, up to a modification of the calibration of the sequence (σn)n1(\sigma_{n})_{n\geq 1}. The second assumption will be of course used to exit local traps.

We now introduce some standard assumptions on ff.

(Hf1)(\mathbf{H}_{f}-1). The function ff is positive and coercive, i.e. ff satisfies:

Demanding the lower bound of ff to be strictly positive is mostly a convenient technical constraint and not fundamentally more restrictive than the classical assumption of positivity.

(Hf2)(\mathbf{H}_{f}-2). We assume that ff satisfies the so-called Lipschitz continuous gradient property:

We emphasize that this implies the famous descent inequality:

This assumption is commonly used in optimization theory and statistics. Even though it is possible to address some more sophisticated situations (see e.g. [BBT17]), it is generally admitted that most of machine learning optimization problems fall into the Lipschitz continuous gradient framework.

(Hf3)(\mathbf{H}_{f}-3). We also assume that another constant cfc_{f} exists such that:

This last assumption prevents from some too large growth of the function ff and it is immediate to verify that (Hf3)(\mathbf{H}_{f}-3) implies that ff has a subquadratic growth, i.e. limsupx+f(x)x2<+\lim\sup_{\|x\|\longrightarrow+\infty}\frac{f(x)}{\|x\|^{2}}<+\infty. It has been widely used in the literature of stochastic algorithm (see e.g. [GPS18] and the references therein).

We finally introduce our assumptions on the step-size sequences used all along the paper that are involved in (pn)n1,(qn)n1(p_{n})_{n\geq 1},(q_{n})_{n\geq 1} and (γn)n1(\gamma_{n})_{n\geq 1}. To easily assess some convergence results with quantitative conditions on our gain sequences, we will consider the situations where:

(HSteps1)(\mathbf{H}_{\text{Steps}}-1). The sequences (pn)n1(p_{n})_{n\geq 1} and (qn)n1(q_{n})_{n\geq 1} satisfy:

(HSteps2)(\mathbf{H}_{\text{Steps}}-2). The mini-batch sequence (σn)n1(\sigma_{n})_{n\geq 1} satisfies:

(HSteps3)(\mathbf{H}_{\text{Steps}}-3). As already discussed in Section 2.2.3, the sequence (γn)n0(\gamma_{n})_{n\geq 0} satisfies:

We point out that (pn)n1(p_{n})_{n\geq 1} and (σn)n1(\sigma_{n})_{n\geq 1} may be (or not) some vanishing sequences (if r>0r>0 and p=0p_{\infty}=0 or if s>0s>0).

4 Main results

We now state our three main convergence results for the stochastic algorithm defined in Equation (1).

Assume that Hf\mathbf{H}_{f}, HSteps\mathbf{H}_{\emph{Steps}} and Hσp\mathbf{H}_{\sigma}^{p} hold for p=4p=4. Then (θn,wn)n1(\theta_{n},w_{n})_{n\geq 1} converges almost surely towards (θ,0)(\theta_{\infty},0) where f(θ)=0\nabla f(\theta_{\infty})=0.

Theorem 1 is a purely asymptotic convergence results. It provides the convergence of our adaptive algorithm (1) towards a set of critical points under mild assumptions on the noise sequence and on the function ff. We emphasize that this results holds for a standard setup on stochastic algorithms with a decreasing learning rate (γn)n1(\gamma_{n})_{n\geq 1}. We observe that the essential condition involved in this result is the convergence of the series that depend on (γn,pn,σn2)(\gamma_{n},p_{n},\sigma_{n}^{2}). In particular, when γn=γ1nβ\gamma_{n}=\gamma_{1}n^{-\beta}, we observe that Theorem 1 holds when:

From a theoretical point of view, the less restrictive situation corresponds to the choice β=1\beta=1 since the series converges as soon as σn+12pn\sigma_{n+1}^{2}p_{n} decreases like log(n)2\log(n)^{-2}. It implies that either we need to use a very lengthy decrease of the update induced by (pn)n1(p_{n})_{n\geq 1}, or use a very lengthy increase of the minibatch proportional, with a batch of size log2(n)\log^{2}(n) at step nn. Of course, this last condition holds as soon as r+2s>0r+2s>0. When β\beta is chosen lower than 11, the condition becomes r+2s>1βr+2s>1-\beta, which may lead to a larger computational cost.

Using the point of view introduced in [GL13] to assess the computational cost of non-convex stochastic optimization, it is possible to derive a more quantitative result on the sequence (θn)n1(\theta_{n})_{n\geq 1}. This result is stated in terms of the expected value of the gradient of ff all along the algorithm. A δ\delta-approximation computational cost is then the number of samples that are necessary to obtain an average value below δ\delta.

Assume that Hf\mathbf{H}_{f} and Hσp\mathbf{H}_{\sigma}^{p} hold for p=4p=4 and consider an integer N>0N>0 and τ\tau an integer sampled uniformly over {1,,N}\{1,\ldots,N\}:

If γn=γ=1dN\gamma_{n}=\gamma=\frac{1}{d\sqrt{N}} and pn=qn=1Np_{n}=q_{n}=\frac{1}{\sqrt{N}} and σn2=1\sigma_{n}^{2}=1, then

and the computational cost to obtain a δ\delta-approximation is d2δ2d^{2}\delta^{-2}.

If γn=γ=1N\gamma_{n}=\gamma=\frac{1}{\sqrt{N}} and pn=qn=1p_{n}=q_{n}=1 and σn2=1dN\sigma_{n}^{2}=\frac{1}{d\sqrt{N}}, then

and the computational cost to obtain a δ\delta-approximation is of order dδ3d\delta^{-3}.

If γn=γ=1dN\gamma_{n}=\gamma=\frac{1}{\sqrt{dN}} and pn=qn=1dNp_{n}=q_{n}=\frac{1}{\sqrt{dN}} and σn2=1\sigma_{n}^{2}=1, then

and the computational cost to obtain a δ\delta-approximation is of order dδ2d\delta^{-2}.

If γn=γ=1N\gamma_{n}=\gamma=\frac{1}{\sqrt{N}} and pn=qn=1Np_{n}=q_{n}=\frac{1}{\sqrt{N}} and σn2=1d\sigma_{n}^{2}=\frac{1}{d}, then

and the computational cost to obtain a δ\delta-approximation is of order dδ2d\delta^{-2}.

We emphasize that this last result is not a real convergence result, which is indeed impossible to derive with a constant step-size stochastic algorithm. Nevertheless, it may be seen as a benchmark result following the usages in non-convex machine learning optimization. It is a convenient way to assess a mean square convergence of stochastic optimization algorithm with non-convex landscape (see e.g. [GL13]).

We recover in this result a more quantitative result that translates both the linear effect of the dimension on the “convergence” rate and the dependency of the final bound in terms of N1/2N^{-1/2} when the algorithm is randomly stopped uniformly between iteration 11 and NN. The presence of both dd and of N1/2N^{-1/2} is not surprising as it already appears to be the minimax rate of convergence in stochastic optimization with weakly convex landscapes (see e.g. [NY83]).

If we translate the upper bound of i)i) into a complexity bound, we observe that for any δ>0\delta>0, we need to fix NN such that

When we use a mini-batch strategy with dNd\sqrt{N} samples at each iteration such that the error bound produced by iv)iv) is lower than δ\delta, we observe that NN has to be chosen of the order δ2\delta^{-2} and the overall procedure may be improved (when compared to the first setting) since we obtain a dδ3d\delta^{-3} computational cost. Finally, the otimal tuning of the algorithm seems to be the last ones, where σ2\sigma^{2} is chosen of the order d1d^{-1} and γnpnN1\gamma_{n}\propto p_{n}\propto N^{-1}, or σ2=1\sigma^{2}=1 and γ=p=q=(dN)1/2\gamma=p=q=(dN)^{-1/2}, which leads to a dδ2d\delta^{-2} computational cost. As discussed in Section 3.2, with this strategy, it seems impossible to improve the dδ2d\delta^{-2} computational cost obtained with other choices of the parameters.

A such improvement comes from a careful tuning of a Lyapunov function that is not exactly the same as the one used in these previous works. We refer to Sections 3.1 and 3.2 for further details. Finally, we also point out that when the sequence (γn)n1(\gamma_{n})_{n\geq 1} is kept fixed, as indicated in the paragraph 2.2.2 it corresponds to a choice of β2<1\beta_{2}<1 kept constant all over the time evolution (see Equation (8)) and p=q=1Np=q=\frac{1}{\sqrt{N}}. This result with this range of parameters appears to be in line with those of [WWB19, DBBU20], but in our work the assumptions on the noise sequence and on the function ff are significantly weaker.

In this paragraph, we assess the almost sure convergence of the sequence (θn)n1(\theta_{n})_{n\geq 1} towards a local minimum of ff and state that the algorithm cannot converge towards an unstable (hyperbolic) point of the dynamical system, e.g. cannot converge towards a saddle point or a local maximum of ff.

Assume that Hf\mathbf{H}_{f}, HSteps\mathbf{H}_{\emph{Steps}} and Hσ\mathbf{H}_{\sigma}^{\infty} hold. Assume that ff is twice differentiable. Suppose p=0p_{\infty}=0, qnq=O(pn)|q_{n}-q_{\infty}|=\mathcal{O}(p_{n}), γn=γ1β\gamma_{n}=\gamma_{1}^{-\beta} and let (β,r,s)(\beta,r,s) be chosen such that:

Then almost surely the sequence (θn)n1(\theta_{n})_{n\geq 1} does not converge towards a local maximum of ff.

Several remarks are necessary after our last theorem, that identifies not only the limit points as the critical points of ff but as local minimizers. Hence, our contribution should be understood as a new example of stochastic method that avoids local traps, and then compared to [Pem90, BD96, GPS18, BBHS20].

Moreover, we point out that our result holds for every initialization point and do not use any integration over (θ0,w0)(\theta_{0},w_{0}). Hence, the nature of our result is different from the ones obtained in recent contributions in the field of machine learning (see e.g. [LSJR16, LPP+17]): we establish that our stochastic algorithm converges with probability 1 to a minimizer, which is different from proving that a deterministic or randomized algorithm (gradient descent in [LPP+17] for example) randomly initialized converges to a local minimum with probability 1.

When the variance of the noise sequence is kept fixed all along the iterations (no mini-batch is used, so that s=0s=0), the previous conditions on the parameters can be summarized as: β(1/2,1)\beta\in(1/2,1); r(1β,β)r\in(1-\beta,\beta), when β<2/3\beta<2/3 and r(β/2,β)r\in(\beta/2,\beta) when β2/3\beta\geq 2/3.

Finally, we should emphasize that this last result is rather difficult to obtain when the mini-batch parameter ss is chosen strictly greater than since it translates a possible vanishing level of noise when the number of iterations is increasing. Our assumption shows that the size of the mini-batch should not grow to fast (induced by the condition s(1β)/2s\leq(1-\beta)/2) to obtain the convergence towards a local minimizer of ff. Up to our knowledge, a such explicit phenomenon is new in the stochastic algorithm community. It would deserve further numerical or theoretical investigations to identify whether this condition is necessary to convert an almost sure convergence result towards critical points into a convergence result towards a stable point of the differential system. The limiting condition seems to be β=1,s1/2\beta=1,s\leq 1/2 and r[1/2,1]r\in[1/2,1]. As a really ambitious question, we leave this problem for future developments and up to our knowledge, the maximal size of the mini-batch that guarantees convergence to local minimizer is still an unresolved question even for the SGD.

5 Organization of the paper

The rest of the paper consists in showing the proofs of the previous results.

Theorems 1 and 2 are proven in Section 3. In particular, Proposition 3.1 studies the average one-step evolution of the algorithm through several key functions. This proposition permits to derive a Lyapunov function in Section 3.2 that translates both the average quantitative result of Theorem 2 and the asymptotic convergence result of Theorem 1. The main difficulties in these two results is to derive a mean reverting effect in terms of f(θn)2\|\nabla f(\theta_{n})\|^{2} without using some extra boundedness assumption, and to assess the influence of dd on the quantitative result.

Theorem 3 is a typical result of stochastic algorithms, and is inspired from the seminal contributions of [Pem90] and [BH95]. The proof is detailed in Section 4 and the cornerstone of this proof is the use of the stable/unstable manifold Lemma that provides an ad-hoc Lyapunov function of the dynamical system, denoted by η\eta in Proposition 4.1. We also refer to the recent contribution of [BBHS20] for another typical application to stochastic algorithms. The main novelty brought in our proof is to use the mini-batch low noise level and keep the a.s. escape of local maximum true. In particular, from a technical point of view, we take advantage of the boundedness series of Proposition 3.2, which is a key ingredient in the proof of Proposition 4.3.

Almost sure convergence to the set of critical points

The purpose of this section is to prove Theorem 1 and Theorem 2. In particular, we will obtain Theorem 2 during the proof of the almost sure convergence result as a specific point of Proposition 3.2, iii)iii). The basic ingredient of our proof relies on the Robbins-Siegmund Theorem [RS71] that will be applied with the help of an ad-hoc Lyapunov function on (θn,wn)n1(\theta_{n},w_{n})_{n\geq 1}.

Below, we will pay a specific attention to the dimension dependency in the inequalities we will obtain. We first state the following proposition that will create a mean reverting effect from iteration nn to iteration n+1n+1 on the pair (θn,wn)(\theta_{n},w_{n}).

Assume that γnqn<1\|\gamma_{n}q_{n}\|_{\infty}<1, that pn<+\|p_{n}\|_{\infty}<+\infty, that Hf\mathbf{H}_{f} and Hσp\mathbf{H}_{\sigma}^{p} hold for p=4p=4.

A constant c1,εc_{1,\varepsilon} independent from nn and {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}d} exists such that for any n1n\geq 1:

then a constant c2,εc_{2,\varepsilon} independent from nn and dd exists such that:

then a constant c3,εc_{3,\varepsilon} independent from nn and dd exists such that:

\bullet Proof of i)i): We observe that:

where the last line comes from the definition of ξn+1=σn+1ζn+1\xi_{n+1}=\sigma_{n+1}\zeta_{n+1} and the fact that:

\bullet Proof of ii)ii): We develop f(θn+1)f(\theta_{n+1}): we use the descent inequality (12):

where in the last line we used a rough upper bound on 1/wn+ε1/\sqrt{w_{n}+\varepsilon} and a+b22a2+2b2|a+b|^{2}\leq 2a^{2}+2b^{2}. It is also possible to use (12) and prove the lower bound:

To deduce ii)ii), we then consider the conditional expectation in (14), use Equation (13) and assumption Hσp\mathbf{H}_{\sigma}^{p} for p=2p=2 to obtain that a constant cϵ=L2(1cf)/εc_{\epsilon}=L^{2}(1\vee c_{f})/\varepsilon exists such that:

We emphasize that at this stage, this last inequality does not create any repelling effect on the position (θn)n1(\theta_{n})_{n\geq 1} and we need to deal with the denominator (wn+ε)(w_{n}+\varepsilon), which is the purpose of iii)iii). \diamond

\bullet Proof of iii)iii): We consider the evolution of f2f^{2} from θn\theta_{n} to θn+1\theta_{n+1}. Using (14) and (15) obtained in ii)ii), we deduce that:

We now expand {mn}2\{m^{-}_{n}\}^{2} and {mn+}2\{m^{+}_{n}\}^{2}, using that aa=aa\vee-a=|a|, we get:

When taking the conditional expectation in the previous inequality we treat some of these terms separately. First, using the centering of ξn+1\xi_{n+1} we have that :

where in the second line we used Cauchy-Schwarz inequality and assumption Hσp\mathbf{H}_{\sigma}^{p} for p=2p=2 whereas the last line comes from sub-quadratic growth assumption on the function ff given by (13). This last bound can also be used to control the term L2γn+12f(θn)gn+1wn+ε2L^{2}\gamma_{n+1}^{2}f(\theta_{n})\left\|\frac{g_{n+1}}{\sqrt{w_{n}+\varepsilon}}\right\|^{2} since:

Now, in similar manner, using repeatitly these last two assumptions (Hσp\mathbf{H}_{\sigma}^{p} also with p=4p=4 ) :

Starting again with the Cauchy-Schwarz inequality for the last term we obtain :

We can now regroup the previous bounds, use equation \eqrefeq:intermediairehorrible\eqref{eq:intermediaire_horrible}, the fact that γn<1\gamma_{n}<1 and that ab1/2(a+b)\sqrt{ab}\leq 1/2(a+b) to conclude that a constant c2,εc_{2,\varepsilon}, independent of dd and nn exists such that :

\bullet Proof of iv)iv): We observe that:

where the last line comes from the fact that wni/(wni+ε)<1w_{n}^{i}/(w_{n}^{i}+\varepsilon)<1, which implies that the last term of the right hand side exists.

Using 1+a1+a/2\sqrt{1+a}\leq 1+a/2 for any a>1a>-1, we deduce that:

Now observe that the second term can easily be bounded by :

We use this last inequality and f(θn+1)mn+f(\theta_{n+1})\leq m_{n}^{+} to conclude that:

The baseline remark is that thanks to the Cauchy-Schwarz inequality, we have:

We then compute the conditional expectation of the previous terms with respect to Fn\mathcal{F}_{n}, using the previous inequality, the centering of ξn+1\xi_{n+1}, and obtain that a constant c3,εc_{3,\varepsilon} independent from dd exists such that:

We then study each terms in the large bracket separately. Using f1+f2f\leq 1+f^{2} and Hσp\mathbf{H}_{\sigma}^{p}, we have:

The second term is handled with the help of Hσp\mathbf{H}_{\sigma}^{p} (for p=3p=3) and the Cauchy-Schwarz inequality:

③ is Fn\mathcal{F}_{n} measurable and the subquadratic growth assumption given by (13) ensures that :

The term ④ is Fn\mathcal{F}_{n} measurable and we use once more that f2df\|\nabla f\|^{2}\lesssim_{d}f to obtain that:

For ⑤ we use Assumption Hσp\mathbf{H}_{\sigma}^{p} with p=2p=2 and obtain that:

⑥ is close to ③, we use Hσp\mathbf{H}_{\sigma}^{p} with p=2p=2 and obtain that:

The last line comes from the fact that as soon as ff is uniformly lower bounded by a positive constant, then (wn+ε)1/42df(θn)(wn+ε)1/42\|(w_{n}+\varepsilon)^{1/4}\|^{2}\lesssim_{d}f(\theta_{n})\|(w_{n}+\varepsilon)^{1/4}\|^{2}.

The term ⑦ is close to ⑤ and Hσp\mathbf{H}_{\sigma}^{p} yields:

For ⑧ Assumption Hσp\mathbf{H}_{\sigma}^{p} with p=4p=4 implies that:

Our bounds on ①-⑧ and the fact that (1+f(θn))22(1+f(θn)2)(1+f(\theta_{n}))^{2}\leq 2(1+f(\theta_{n})^{2}) ensure that a constant c3,εc_{3,\varepsilon} exists independent from dd such that:

Since γn+1σn+13d3/21/2(σn+12d+γn+12σn+14d2),\gamma_{n+1}\sigma_{n+1}^{3}d^{3/2}\leq 1/2(\sigma_{n+1}^{2}d+\gamma_{n+1}^{2}\sigma_{n+1}^{4}d^{2}), using the definition of tnt_{n}, we deduce that:

Proposition 3.1 will permit do derive a Lyapunov function on (θn,wn)n1(\theta_{n},w_{n})_{n\geq 1} (see the next result) which implies the convergence of

This kind of bound has also been obtained in [DBBU20] (Theorem 4) with the help of a somewhat artificial boundedness assumption of the noisy gradients, which is not used in our work. We also point out that [ZSJ+19] propose another function that generates a mean reverting term:

and the major difference with our result is the weaker 4/34/3 instead of 22 in the series. In particular, a such 4/34/3 will not allow to prove the a.s. asymptotic pseudo-trajectory result, and consequently the a.s. convergence of the trajectory towards a critical point of ff.

2 Proof of Theorem 2

Using Proposition 3.1, we are now ready to state the next important result, which will be key for the almost sure convergence of (θn)n1(\theta_{n})_{n\geq 1}.

Under the assumptions of Proposition 3.1, then:

Two constants c(θ0,w0)c(\theta_{0},w_{0}) and κ\kappa exist such that, for all n1n\geq 1

where (tn)n0(t_{n})_{n\geq 0} and (sn)n0(s_{n})_{n\geq 0} are the auxiliary sequences defined in Proposition 3.1.

If n1(γn+12+γn+1σn+12pn)<+\sum_{n\geq 1}(\gamma_{n+1}^{2}+\gamma_{n+1}\sigma_{n+1}^{2}p_{n})<+\infty , then almost surely:

\bullet Proof of i)i). Our proof relies on a Lyapunov function defined by:

Using i)i), iii)iii) and iv)iv) of Proposition 3.1 and the fact that f(θn)1+f(θn)2f(\theta_{n})\leq 1+f(\theta_{n})^{2}, we deduce that a constant κ\kappa that depends on c1,ε,c2,ε,c3,εc_{1,\varepsilon},c_{2,\varepsilon},c_{3,\varepsilon} and of the next choice of aa and bb exists such that:

We observe that for any vector uu, we have u4u2\|\sqrt{|u|}\|^{4}\geq\|u\|^{2} and that (pn)n1(p_{n})_{n\geq 1} is a bounded sequence, so that we can find bb large enough to have b>2supn1pnb>2\sup_{n\geq 1}p_{n}, and 2abpn2a\geq bp_{n} such that n1\forall n\geq 1:

We then conclude using a straightforward recursion. \diamond

\bullet Proof of ii)ii). This point proceeds with standard arguments: we use the Robbins Siegmund Lemma (see [RS71]): the series γn+12\sum\gamma_{n+1}^{2} and γn+1σn+12pn\sum\gamma_{n+1}\sigma_{n+1}^{2}p_{n} are convergent and obtain that:

More importantly, the next series are convergent:

This ends the proof of ii)ii) \diamond

We emphasize that i),ii)i),ii) are standard consequences of the Robbins-Siegmund approach on stochastic algorithms. In particular, the use of i)i) with the calibrations of (γn)1nN,(σn)1nN(\gamma_{n})_{1\leq n\leq N},(\sigma_{n})_{1\leq n\leq N} and (pn)1nN(p_{n})_{1\leq n\leq N} that are given in the statement of Theorem 2 instantaneously lead to the conclusion of the proof. We also point out that the basic fact to obtain this result is a tuning of the parameters thats leads to

Considering constant step-size sequences, it entails the following constraints on (p,γ,σ)(p,\gamma,\sigma):

whereas the size of NN needed to obtain a δ\delta approximation should verify that:

Hence, the computational cost is lower bounded by dδ2d\delta^{-2} and this bound may be achieved as soon as

which corresponds to the tuning described in iii)iii) and iv)iv) of Theorem 2.

3 Proof of Theorem 1

For the sake of convenience, from now on we denote Vn=Va,b(θn,wn)V_{n}=V_{a,b}(\theta_{n},w_{n}). Since we are now interested in purely asymptotic result, we omit the dependency with dd in the bounds we obtain hereafter. The main difficulty here is to convert the result of Proposition 3.2 ii)ii) into an a.s. convergence result on (θn)n1(\theta_{n})_{n\geq 1}. We remind the standard definition of asymptotic pseudo-trajectories of a semiflow Φ\Phi.

A continuous trajectory ZZ is a pseudo-trajectory of Φ\Phi if for any finite time horizon T>0T>0

We refer to [BH96, Ben99] for further details. We will use in particular the cornerstone result which is reminded below:

Denote τ0=0\tau_{0}=0, τn=k=1nγk\tau_{n}=\sum_{k=1}^{n}\gamma_{k} and consider (Zˉt)t0(\bar{Z}_{t})_{t\geq 0} the continuous time process corresponding to a linear interpolation of (Zn)n0(Z_{n})_{n\geq 0}, given by :

The Robbins-Siegmund Lemma ensures that the sequence (Vn)n1(V_{n})_{n\geq 1} converges a.s. to a finite random variable VV_{\infty}. Since all the terms of VnV_{n} are positive, (f(θn))n0(f(\theta_{n}))_{n\geq 0} and (wn)n0(\|w_{n}\|)_{n\geq 0} are a.s. bounded as well. The coercivity of ff implies thus that (Zn)n0(Z_{n})_{n\geq 0} is a.s. bounded. The evolution of the sequence (Zn)n0(Z_{n})_{n\geq 0} can be written as:

The next result allows us to control the rest term and ensures that the assumptions of Proposition 4.1 of [Ben99] are fulfilled, which in turn implies that (Zˉt)t0(\bar{Z}_{t})_{t\geq 0} is indeed an asymptotic pseudo trajectory of the flow induced by HH.

Under the assumptions of Proposition 3.1 and if γk+1σk+12pk<+\sum\gamma_{k+1}\sigma_{k+1}^{2}p_{k}<+\infty and (pn,qn)(p,q)(p_{n},q_{n})\longrightarrow(p_{\infty},q_{\infty}). Let N(n,t)=supk0{t+τnτk}N(n,t)=\sup_{k\geq 0}\{t+\tau_{n}\geq\tau_{k}\} and (en)n1(e_{n})_{n\geq 1} defined in Equation (22). For all T>0T>0:

We consider a finite horizon T>0T>0. In order to deal with the previous sum, we write ene_{n} as an+ΔMn+1+cna_{n}+\Delta M_{n+1}+c_{n}, with an=(0(pnp)f(θn)2(qnq)wn)a_{n}=\begin{pmatrix}0\\ (p_{n}-p_{\infty})\nabla f(\theta_{n})^{\odot 2}-(q_{n}-q_{\infty})w_{n}\end{pmatrix}, ΔMn+1=(f(θn)gn+1wn+ε0)\Delta M_{n+1}=\begin{pmatrix}\dfrac{\nabla f(\theta_{n})-g_{n+1}}{\sqrt{w_{n}+\varepsilon}}\\ 0\end{pmatrix} and cn=(0pn(f(θn)2gn+12)).c_{n}=\begin{pmatrix}0\\ p_{n}(\nabla f(\theta_{n})^{\odot 2}-g^{\odot 2}_{n+1})\end{pmatrix}. We use the fact that:

and proceed to upper bound each term on the right hand side.

\bullet The convergence of the first term is mainly a consequence of Equations (19), (20) (convergence of the Robbins-Siegmund series), of the convergence of (qn)n1(q_{n})_{n\geq 1} towards and qq_{\infty} and the fact that (pnp)n1(p_{n}-p_{\infty})_{n\geq 1} is a bounded sequence :

Using the convergence of the series (19) and (20) we conclude that, t>0\forall t>0This last limit holds regardless t<Tt<T:

\bullet To control the second term we observe that (ΔMk+1)k0(\Delta M_{k+1})_{k\geq 0} is a sequence of martingale increments with a bounded second order moment since

We start by decomposing ckc_{k} as its expected value plus a martingale increment:

Since n0γn+1pnσn+12<+\sum_{n\geq 0}\gamma_{n+1}p_{n}\sigma_{n+1}^{2}<+\infty the first sum convergences to when nn goes to infinity. The terms of the second sum are martingale increments:

Using the fact that (a+b)22(a2+b2)(a+b)^{2}\leq 2(a^{2}+b^{2}), we get a first bound on their second order moments:

Now using Hσp\mathbf{H}_{\sigma}^{p} for p=2p=2 and Inequality \eqrefeq:comparaison\eqref{eq:comparaison}:

The second term can be dealt with in a similar manner, using the assumption Hσp\mathbf{H}_{\sigma}^{p} for p=4p=4:

and we can apply once more Corollary 11 of [MP] to conclude that:

We now give the proof of the leading result of Section 3.

\bullet Identification of the possible limit points. The a.s. boundness of (Zn)n0(Z_{n})_{n\geq 0} and Lemma 3.3 show that the assumptions of Proposition 4.1 of [Ben99] hold, which implies that (Zˉt)t0(\bar{Z}_{t})_{t\geq 0} is an asymptotic pseudo-trajectory of the differential flow induced by HH almost surely. It implies in particular that (wn)n1(w_{n})_{n\geq 1} is a.s. bounded.

We shall deduct from Theorem 4 that all limit points Z=(θ,w)Z_{\infty}=(\theta_{\infty},w_{\infty}) of Zˉt\bar{Z}_{t} are stationary points for the differential equation z˙=H(z)\dot{z}=H(z) and thus that H(Z)=0H(Z_{\infty})=0. Since wn\|w_{n}\| is a.s. bounded, it implies that f(θ)=0\nabla f(\theta_{\infty})=0.

In the meantime, since f(θn)0\nabla f(\theta_{n})\to 0 when nn\to\infty, we observe that for any limit point ZZ_{\infty}, H(Z)=0H(Z_{\infty})=0 also implies that w=0w_{\infty}=0.

\bullet Convergence of (θn)n1(\theta_{n})_{n\geq 1}. Thus (since wnw_{n} is bounded) we have that wnw_{n} converges to . Hence:

Since aa and bb are non-negative and f(θn)f(\theta_{n}) positive, the last equality implies that the sequence (f(θn))n0(f(\theta_{n}))_{n\geq 0} is a.s. convergent:

Now, (θn)n0(\theta_{n})_{n\geq 0} is an a.s. bounded sequence and the set of possible limit points for its sub-sequences is connected since:

Since {θ:f(θ)=f}{θ:f(θ)=0}\{\theta:f(\theta)=f_{\infty}\}\cap\{\theta:\nabla f(\theta)=0\} is locally finite, we can conclude that (θn)n1(\theta_{n})_{n\geq 1} has a unique adherence point and is a convergent sequence:

Almost sure convergence to a local minimum

In this paragraph, we prove that the sequence (θn)n1(\theta_{n})_{n\geq 1} almost surely converges towards a local minimum of ff and cannot converge towards an unstable (hyperbolic) point of the dynamical system, e.g. cannot converge towards a saddle point or a local maximum of ff.

We begin with a simple statement that identifies the unstable points of the dynamical system

These equilibria may correspond to repulsive equilibria or hyperbolic points (saddle points). The next proposition makes this last sentence more precise and introduce the cornerstone function η\eta that measures in each neighborhood of an unstable (or saddle) point the distance of any point xx to the local stable manifold in the expanding direction.

Consider the dynamical system (27), and assume that ff is twice differentiable, then:

The equilibria are (t,0)(t,0) where tt is a critical point of ff.

If tt is a local maximum of ff, the dynamical system is unstable near (t,0)(t,0).

If tt is a local minimum of ff, the dynamical system is stable near (t,0)(t,0).

If tt is an unstable equilibria, a compact neighborhood N\mathcal{N} of (t,0)(t,0) and a function η\eta exist such that:

If E+E_{+} is the eigenspace associated to the negative eigenvalues of D2f(t)D^{2}f(t), then:

\bullet Proof of i)i). The proof of i)i) is immediate by observing that H(θ,w)=0H(\theta,w)=0 and q0q_{\infty}\neq 0 implies that f(θ)=0\nabla f(\theta)=0 and w=0w=0.

\bullet Proof of ii)ii) and iii)iii). We use a linearization of the drift around an equilibria (t,0)(t,0). Since tt is a critical point of ff, we observe that:

Consequently, we observe that (f(t+h))2=(D2f(t)h)2=O(h2)(\nabla f(t+h))^{2}=(D^{2}f(t)h)^{2}=\mathcal{O}\left(\|h\|^{2}\right), which entails:

The conclusion follows from the spectral decomposition of D2f(t)D^{2}f(t).

\bullet Proof of iv)iv). The last point is a consequence of Proposition 3.1 of [BH95] (see also Proposition 9.5 of [Ben99]). We only need to observe that the coordinates in u2u_{2} always correspond to the attractive manifold so that π+(u)=π+(u1)\pi_{+}(u)=\pi_{+}(u_{1}). ∎

2 Preliminary estimates

For a given integer n0n_{0} when (θn0,wn0)(\theta_{n_{0}},w_{n_{0}}) is in a neighborhood N\mathcal{N} (given by (iv) of Proposition 4.1 ) of an unstable point, we introduce the exit time of N\mathcal{N} defined by:

where (αn+1)n1(\alpha_{n+1})_{n\geq 1} will be made explicit later on, and the associated cumulative sum:

We prove the following upper bound on the second order moment of (Xn)nn0+1(X_{n})_{n\geq n_{0}+1}.

We decompose Xn+1X_{n+1} according to the position of nn with respect to Tn0T_{n_{0}}:

If nTn0n\geq T_{n_{0}}, there is nothing to prove.

We then consider the case when n<Tn0n<T_{n_{0}} and define m=supzNη(z)m=\sup_{z\in\mathcal{N}}\|\nabla\eta(z)\|. A first order Taylor expansion yields:

When n<Tn0n<T_{n_{0}}, the process Zn=(θn,wn)NZ_{n}=(\theta_{n},w_{n})\in\mathcal{N} so that wn2\|w_{n}\|^{2} is bounded. It remains to study the terms that involve gn+12\|g_{n+1}\|^{2} and gn+122\|g^{\odot 2}_{n+1}\|^{2}. We shall observe that:

because (σn)n1(\sigma_{n})_{n\geq 1} is bounded by definition and f\nabla f is continuous and N\mathcal{N} is compact.

Thus from the assumption Hσ\mathbf{H}_{\sigma}^{\infty} it follows that:

where KK is a constant that depends on η\nabla\eta, mm, ε\varepsilon, p\|p\|_{\infty} and q\|q\|_{\infty}. This ends the proof. ∎

Below, we will use the consequence of iv)(a)iv)-(a) of Proposition 4.1: if z=(θ,w)z=(\theta,w) is a point in N\mathcal{N}, then a constant Γ\Gamma exists such that:

We write the joint evolution of the algorithm (Zn)n1(Z_{n})_{n\geq 1} as:

where (Δn+1)n1(\Delta_{n+1})_{n\geq 1} is a sequence of martingale increment defined by:

ΔMn+1=ξn+1wn+ε\Delta M_{n+1}=-\frac{\xi_{n+1}}{\sqrt{w_{n}+\varepsilon}}

We introduce below the sequence (νn)n1(\nu_{n})_{n\geq 1} that stands for the convergence rate of (pn)n1(p_{n})_{n\geq 1} towards pp_{\infty} and (qn)n1(q_{n})_{n\geq 1} towards qq_{\infty}:

We start once more from the decomposition of (Xn)nn0+1(X_{n})_{n\geq n_{0}+1} before or after Tn0T_{n_{0}} and observe that:

The definition of the stopping time Tn0T_{n_{0}} ensures that when n<Tn0n<T_{n_{0}}, ZnZ_{n} belongs to the neighborhood N\mathcal{N} so we can apply Equation \eqrefIneg:alpha\eqref{Ineg: alpha} and obtain the following lower bound:

where the last inequality comes from (iv)c of Proposition 4.1 (Equation (28)). Using that η\|\nabla\eta\| is upper bounded on N\mathcal{N} by mm, the definition of Tn0T_{n_{0}} leads to:

This inequality associated with the Cauchy-Schwarz inequality implies that:

Observing that ξn+12=i=1dξn+1,i4i=1dξn+1,i2=ξn+12\|\xi_{n+1}^{\odot 2}\|=\sqrt{\sum_{i=1}^{d}\xi_{n+1,i}^{4}}\leq\sum_{i=1}^{d}\xi_{n+1,i}^{2}=\|\xi_{n+1}\|^{2}, we deduce that:

We then define k=sup(θ,w)Nf(θ)2+w<+k^{\prime}=\sup_{(\theta,w)\in\mathcal{N}}\|\nabla f(\theta)^{\odot 2}\|+\|w\|<+\infty (because N\mathcal{N} is compact and f\nabla f is continuous). We deduce that:

Concerning the last term, it is sufficient to show that the expected value is bounded. We start by using the fact that (a+b+c)24(a2+b2+c2)(a+b+c)^{2}\leq 4(a^{2}+b^{2}+c^{2}) to split the squared norm and to obtain that:

Since HH is continuous, (H(Zn)2)n0n<Tn0(\|H(Z_{n})\|^{2})_{n_{0}\leq n<T_{n_{0}}} is a bounded sequence. Moreover, we have seen that when n<Tn0n<T_{n_{0}}:

Again, using the compactness of N\mathcal{N}, the continuity of f\nabla f and the assumption Hσ\mathbf{H}_{\sigma}^{\infty} on the noise (ζn)n1(\zeta_{n})_{n\geq 1}, we deduct that K2>0K_{2}>0 exists such that:

We now gather all the terms and use the fact that the sequences (σn)n0(\sigma_{n})_{n\geq 0}, (pn)n0(p_{n})_{n\geq 0} and (νn)n0(\nu_{n})_{n\geq 0} are all bounded, to conclude that a constant K3K_{3} exists such that:

We now study the evolution of Sn2S_{n}^{2}.

Assume that δn=c(νn+pnσn+12+γn+1)\delta_{n}=c(\nu_{n}+p_{n}\sigma_{n+1}^{2}+\gamma_{n+1}) is such that γn+1δn2=o(γn+12σn+12αn2)\gamma_{n+1}\delta_{n}^{2}=o(\gamma_{n+1}^{2}\sigma_{n+1}^{2}\wedge\alpha_{n}^{2}) then:

where in the last line we used the reverting effect translated in Equation (28) and the Cauchy-Schwarz inequality. Since η\eta is positive, we then obtain that:

We denote the positive part of any real value aa by a+\lfloor a\rfloor_{+} and we use that aba+b+a\geq b\Longrightarrow\lfloor a\rfloor_{+}\geq\lfloor b\rfloor_{+} and the inequality

Considering Equation (38), we then observe that:

Once more the regularity of η\nabla\eta, the compactness of N\mathcal{N} and the definition of UnU_{n}, guarantee that a constant κ>0\kappa>0 exists such that:

Computing the conditional expectation and using the arguments of (36), we have

Using that when n<Tn0n<T_{n_{0}}, ZnNZ_{n}\in\mathcal{N}, we can apply iv)(b)iv)-(b) of Proposition 4.1 so that:

where π+\pi_{+} is the orthogonal projection on E+E_{+}, the eigenspace associated to the negative eigenvalues of D2f(t)D^{2}f(t). For nn large enough, the almost sure convergence of (wn)n0(w_{n})_{n\geq 0} to and our elliptic assumption Hσii)\mathbf{H}_{\sigma}^{\infty}-ii) on the sequence (ξn+1)n0(\xi_{n+1})_{n\geq 0} yield:

Decomposing now Xn+1=Xn+1+Xn+1+X_{n+1}=\lfloor X_{n+1}\rfloor_{+}-\lfloor-X_{n+1}\rfloor_{+}, we then deduce that:

The last line is justified by the assumption γn+1δn2=o(γn+12σn+12)\gamma_{n+1}\delta_{n}^{2}=o(\gamma_{n+1}^{2}\sigma_{n+1}^{2}) which ensures that δn=o(σn)\delta_{n}=o(\sigma_{n}). We shall also observe that:

Since γn+1δn2=o(γn+12σn+12αn2)\gamma_{n+1}\delta_{n}^{2}=o(\gamma_{n+1}^{2}\sigma_{n+1}^{2}\wedge\alpha_{n}^{2}), we can now conclude by inserting the previous bounds in the inequality \eqrefineg:Sn1\eqref{ineg: Sn 1}.

3 End of the proof

We mimic the strategy of Lemma 9.6 of [Ben99] and of Theorem 3.2 of [GPS18] with the help of the two sequences defined in (31) and (30).

We summarize the preliminary results proven in the previous subsection as they will we used in what follows:

Proposition 4.2 yields that if δn=cδ(νn+pnσn+12+γn+12)\delta_{n}=c_{\delta}(\nu_{n}+p_{n}\sigma_{n+1}^{2}+\gamma_{n+1}^{2}), then

Proposition 4.2 yields: if γn+1δn2=o(γn+12σn+12αn2)\gamma_{n+1}\delta_{n}^{2}=o(\gamma_{n+1}^{2}\sigma_{n+1}^{2}\wedge\alpha_{n}^{2}) (which also means that δn=o(σn)\delta_{n}=o(\sigma_{n})), then

For any integer qq, we consider an integer nqn\geq q and introduce the two sequences (un)nq(u_{n})_{n\geq q} and (Uˉn)nq(\bar{U}_{n})_{n\geq q} defined by:

With the framework of Theorem 3 in mind we suppose that

with β(1/2,1)\beta\in(1/2,1), s0s\geq 0 and r>0r>0. Let us point out other restrictions on the choices of these parameters in light of previous assumptions:

A first restriction follows directly from (HSteps3)(\mathbf{H}_{\text{Steps}}-3), namely that

(to ensure that γn+1pnσn+12<+\sum\gamma_{n+1}p_{n}\sigma_{n+1}^{2}<+\infty). Furthermore, Theorem 1 demands that γn/pn0\gamma_{n}/p_{n}\to 0, implying that

To ease notations it is convenient to assume that the sequence (qn)(q_{n}) converges to its limit q>0q_{\infty}>0 at least as fast as (pn)(p_{n}) goes to : νn=O(pn).\nu_{n}=\mathcal{O}(p_{n}). With this in mind, the condition imposed by Proposition 4.2, δn=o(γn+1σn+1)\delta_{n}=o(\sqrt{\gamma_{n+1}}\sigma_{n+1}) translates to r+s>β/2r+s>\beta/2, r>β/2+sr>\beta/2+s and 2β>β/2+s2\beta>\beta/2+s, ie β>2/3s\beta>2/3s. As soon β>s\beta>s all these conditions come down to:

Other restrictions will be added in the following propositions and are summarized in Theorem 3.

For any positive real value b>0b>0 and any integer q>0q>0, we consider the sequence of stopping times (Tbq)q0(T_{b}^{q})_{q\geq 0} defined by:

The stopping times TbqT_{b}^{q} stands for the first time the sequence (Sn)n1(S_{n})_{n\geq 1} becomes larger than the threshold (un)n1(\sqrt{u_{n}})_{n\geq 1}, which converges towards zero with a controlled rate. We prove the following result.

Assume that σi2σ12i2s\sigma_{i}^{2}\sim\sigma_{1}^{2}i^{-2s} with 0s<1/20\leq s<1/2 and γi=γ1iβ\gamma_{i}=\gamma_{1}i^{-\beta}, r>β/2+sr>\beta/2+s and choose (αn)n1(\alpha_{n})_{n\geq 1} as n0:αn+1=γn+1σn+1\forall n\geq 0\,:\alpha_{n+1}=\gamma_{n+1}\sigma_{n+1}. Then a small enough b>0b>0 exists and a large enough qq such that:

Our starting point is the lower bound given by Proposition 4.2 and we observe that a small enough a>0a>0 exists such that:

We consider an integer nq+1n\geq q+1 and apply the Optional Stopping Theorem and verify that:

where the last inequality is a consequence of the definition of the stopping time TbqT_{b}^{q}. Since (un)n0(u_{n})_{n\geq 0} is a decreasing sequence, we then have that:

and we are led to upper bound the term {XnTbq}2\{X_{n\wedge T_{b}^{q}}\}^{2}.

The definition of (Xn)n1(X_{n})_{n\geq 1} yields:

Using a similar argument as the one used in the proof of Proposition 4.2, a Taylor expansion associated with the smoothness of η\eta and ff leads to:

This last bound together with inequality (41) implies that a constant C>0C>0 exists such that:

We therefore deduce an upper bound on the probability that the stopping time is larger than nn:

We then take the limit n+n\longrightarrow+\infty in the previous inequality and obtain that:

because limn+un=0\lim_{n\longrightarrow+\infty}u_{n}=0. Choosing now αn+1=γn+1σn+1\alpha_{n+1}=\gamma_{n+1}\sigma_{n+1}, we then observe that uq=iqγi+12σi+12u_{q}=\sum_{i\geq q}\gamma_{i+1}^{2}\sigma_{i+1}^{2} and then the second term on the right hand side goes to zero as soon as:

Since we have chosen γq=γ1qβ\gamma_{q}=\gamma_{1}q^{-\beta} and σq=σ1qs\sigma_{q}=\sigma_{1}q^{-s}, we verify that a constant υ\upsilon exists such that:

Therefore, the condition γq2=o(uq)\gamma_{q}^{2}=o(u_{q}) boils down to 2β<12(β+s)s<1/2-2\beta<1-2(\beta+s)\Leftrightarrow s<1/2.

Hence, when s<1/2s<1/2 we can conclude the proof of the proposition by setting bb small enough (b<a/3Cb<a/3C for example). ∎

The next result states that SnS_{n} may remain larger than 12buq\frac{1}{2}\sqrt{bu_{q}} with a positive probability when SqbuqS_{q}\geq\sqrt{bu_{q}}. For this purpose, we introduce

that stands for the first time (Sn)nq(S_{n})_{n\geq q} comes back below the threshold buq\sqrt{bu_{q}}.

Assume that αn+1=σn+1γn+1\alpha_{n+1}=\sigma_{n+1}\gamma_{n+1} and β+sr<12\beta+s-r<\frac{1}{2}, β12s\beta\leq 1-2s and that νn=o(γnσn)\nu_{n}=o(\sqrt{\gamma_{n}}\sigma_{n}). Then there exits qq large enough and a constant c>0c>0 such that

Since δnpnσn2+νn+γn\delta_{n}\sim p_{n}\sigma_{n}^{2}+\nu_{n}+\gamma_{n}, we conclude with our settings that: δnnr+nβ.\delta_{n}\sim n^{-r}+n^{-\beta}. Moreover, the choice αi+1=γi+1σi+1\alpha_{i+1}=\gamma_{i+1}\sigma_{i+1} implies that:

We then observe that δn=o(uq)\delta_{n}=o(\sqrt{u_{q}}) as soon as:

since β>2(β+s)12\beta>\frac{2(\beta+s)-1}{2} always holds when s<12s<\frac{1}{2}.

From our previous remark, we observe that for qq large enough if SqS_{q} is greater than 12buq\frac{1}{2}\sqrt{bu_{q}} then SqδqS_{q}\geq\delta_{q}, so Proposition 4.2 implies that (SnS)nq(S_{n\wedge\mathcal{S}})_{n\geq q} is a submartingale. For such a choice of nn and qq, we can use the Doob decomposition and write that:

where InI_{n} is an increasing Fn\mathcal{F}_{n} predictable process with Iq=0I_{q}=0 and WnW_{n} is a martingale. We then observe that

Furthermore, if SqbuqS_{q}\geq\sqrt{bu_{q}}, then WqbuqW_{q}\geq\sqrt{bu_{q}}, which entails:

Using the fact that (Wn)nq(W_{n})_{n\geq q} is a martingale and the definition of SnS_{n}, one can verify that:

The upper bound obtained in Proposition 4.2 is not sharp enough to be directly applied here, so in order to deal with the term on the right hand side we return to the upper bound obtained in the proof of Proposition 4.2 which gives for all ii:

The sequence (σi)i0(\sigma_{i})_{i\geq 0} is bounded, (pi)i0(p_{i})_{i\geq 0} converges to and Theorem 1 shows that (f(θi))i0(\|\nabla f(\theta_{i})\|)_{i\geq 0} converge almost surely to , so there exists q>0q>0 such that iq\forall i\geq q:

Inserting this into our previous bound (43) we get that (for qq large enough and iqi\geq q)

Since αi+1=γi+1σi+1\alpha_{i+1}=\gamma_{i+1}\sigma_{i+1} we have that:

The first series of the right-hand side is handled with the simple bound i=qn1γi+12σi+12q12(β+s)\sum_{i=q}^{n-1}\gamma_{i+1}^{2}\sigma_{i+1}^{2}\lesssim q^{1-2(\beta+s)}. In dealing with the second term of the right-hand side we use the almost sure convergence of the series stated in (18) of Proposition 3.2 and the fact that (γn)n0(\gamma_{n})_{n\geq 0} is a decreasing sequence:

Thus there exists a constant a>0a>0 such that for all nqn\geq q:

Setting t=a(uq+γq)/ht=a(u_{q}+\gamma_{q})/h in the last term we have :

Now when h=1/2buqh=1/2\sqrt{bu_{q}} we obtain that :

As soon as γquq\gamma_{q}\lesssim u_{q}, meaning that β12(β+s)-\beta\leq 1-2(\beta+s), there exists a constant c>0c>0 such that for qq large enough:

Inserting this bound in \eqrefineg:Sinf\eqref{ineg: Sinf} ends the proof.

At this point, Propositions 4.3 and 4.3 enable us to prove that the sequence (Sn)(S_{n}) does not converge to a.s. using the same arguments as [Ben99] and then we can conclude the proof of Theorem 3.

When the variance of the noise sequence does not converge to (s=0s=0), the previous conditions on the parameters can be summarized as : β(1/2,1)\beta\in(1/2,1); r(1β,β)r\in(1-\beta,\beta), when β<2/3\beta<2/3 and r(β/2,β)r\in(\beta/2,\beta) for β2/3\beta\geq 2/3.

Step 1: SnS_{n} does not converge to a.s. Let G\mathcal{G} denote the event:

In the meantime, if S=+\mathcal{S}=+\infty then (Sn)(S_{n}) does not converge to , so {S=+}G\{\mathcal{S}=+\infty\}\subset\mathcal{G}. For qq large enough, such that Proposition 4.3 holds, and for all nqn\geq q :

Thus, if we consider all the integers nn larger than qq, we obtain:

We now apply Proposition 4.3 and obtain that:

Step 2: the algorithm escapes any neighborhood of an unstable point in a finite time a.s.

Theorem 1 ensures that (θn,wn)(\theta_{n},w_{n}) converges a.s to a point (θ,0)(\theta_{\infty},0). This together with the regularity of the function η\eta implies that the sequence SnS_{n} goes to η(θ,0)\eta(\theta_{\infty},0) when n+n\to+\infty. Since N\mathcal{N} is compact, the limit point (θ,0)(\theta_{\infty},0) belongs to N\mathcal{N} and according to Proposition 4.1 c) there exists k>0k>0 such that:

As seen in the proof of Theorem 1, the limit point (θ,0)(\theta_{\infty},0) is almost surely an equilibrium point for the dynamical system driven by HH, so H(θ,0)=0H(\theta_{\infty},0)=0. As a result we have that

References