Asymptotic study of stochastic adaptive algorithm in non-convex landscape

Sébastien Gadat, Ioana Gavra

Introduction

whose early success in the sixties has been at least rejuvenated if not resurrected with the development of massive learning problems, in the last fifteen years. We refer among other to [BB08, MB11] or to [Bac14] and the references therein for various applications in machine learning. Although being one of the state-of-the-art method to handle massive datasets, SGD suffers from several issues: difficulty to tune the step-size sequence or dependence on the gradient flow that may be lazy in flat areas, which is especially the case when looking at non-convex neural network problems.

Some popular improvements are commonly patched to the vanilla SGD, and among others we refer to the popular acceleration obtained with the Polyak-Ruppert averaging [PJ92, Rup88, MB11, CCGB17, GP20], variance reduction with mini-batch strategies (see e.g. [LRSB+12, JZ13]).

While these two last improvements do not modify the underlying gradient flow, other strategies rely on a modification of the dynamical system exploiting acceleration brought by momentum with second order terms. The first historical example is the Heavy Ball with Friction optimization based on the seminal contribution [Pol64] and then translated into a stochastic framework (see e.g. [GPS18, SGD20, LR20]). Another second example is the Nesterov Accelerated Gradient Descent (see e.g. [Nes83]) translated and studied in the noisy situation recently in a large number of works (see among other [GL16, JNJ18]).

To the best of our knowledge, there is little convergence mathematical results on adaptive algorithms: [BdSG20] studies the deterministic dynamical system behind adaptive algorithms and obtained long-time behaviour of the trajectories or the value function following ideas of [CEG09a, SBC16]. More recently, [BB20] (that is more closely related to us) obtains the almost sure convergence of their algorithms towards critical points with a parametrization that is different from our but the authors leave as an open problem the important question of the convergence towards a minimizer of $f$ .The same week we sent our paper on Arxiv, [BBHS20] also published some results on the trap avoidance of adaptive algorithms but not consider the mini-batch effect that is known to be a crucial ingredient for the efficiency of adaptive methods. We refer to Theorem 2 and 3 below for the conditions we obtained on the mini-batch sequence. Finally, some recent contributions in machine learning [WWB19, ZSJ+19, DBBU20] address some “convergence” questions for adaptive algorithms with constant step-size. They provide a non-asymptotic study with a step-size that is tuned according to the finite horizon of simulation. Even though these results are of major interest from a numerical point of view, they do not really answer the question of convergence from a trajectorial point of view (see Section 2.2 below). The objective of this work may be seen as modest at the moment: we aim to study the asymptotic behaviour of Adagrad and Rmsprop i.e. we aim to show the almost sure convergence towards a local minimizer of the objective function $f$ . However limited at first sight, we will see that the convergence of the trajectories outside local traps is already challenging, especially when a mini-batch strategy is used.

Adaptive algorithms and main results

The algorithm we consider in this paper use the vectorial division/multiplication notations introduced in Adagrad (see [DHS11]) and now widely used in machine learning. The vectorial division $\frac{u}{v}$ and multiplication $u\cdot v$ are the coordinate per coordinate operations introduced by:

In the meantime, the notation $u^{\odot 2}$ corresponds to the coordinate per coordinate square:

whereas $\sqrt{u}$ denotes the coordinate per coordinate square root:

where $(g_{n})_{n\geq 1}$ corresponds to a noisy stochastic evaluation of the gradient of the function $f$ , corrupted by an additive noise sequence $(\xi_{n+1})_{n\geq 1}$ :

Following the initial vectorial notations, we emphasize that (1) means that for any coordinate $i\in\{1,\ldots,d\}$ , the position $(\theta_{n})_{n\geq 1}$ and scaling factor $(w_{n})_{n\geq 1}$ are updated according to:

2 Link with other parametrizations

We discuss here on our choice of the Adagrad/Rmsprop parametrization (1) using the one of [BdSG20], and its link with the standard parametrization introduced in [DHS11] or [HSS12] and used in later works [DBBU20, ZSJ+19] for Adam and in [WWB19] for Adagrad. We have chosen to use this formulation, which is inspired from the limiting O.D.E. of the continuous time adaptive gradient system following previous works on accelerated or second order dynamics and among other we refer to Memory gradient diffusion [GP14], Ruppert-Polyak averaging [GP20], Heavy Ball systems [AGR00, CEG09a, CEG09b, GPS18] or more generally Nesterov acceleration [Nes04, SBC16, ACR19] and dissipative systems [Har91, AABR02].

The pioneering works [DHS11] and [HSS12] use the following parametrization:

for $\beta_{2}\in(0,1)$ and when no heavy ball momentum (see e.g. [Pol64]) is used in the algorithm (which is also the case we are considering in this work). Notice that $\beta_{2}$ may depend on the current iteration in a second stage and for the sake of completeness, we consider a general sequence $(\beta_{2}(n))_{n\geq 1}$ .

We introduce the natural normalizing sequence $(S_{n})_{n\geq 1}$ , defined by $S_{0}=1$ and the following recursion:

We are led to introduce $\widetilde{w}_{n}=v_{n}/S_{n}$ and $\widetilde{\varepsilon}_{n}=\widetilde{\varepsilon}/S_{n}$ and we observe that:

whereas the second coordinate evolves according to:

2.2 Two-time scale parametrization

We deduce that the popular parametrization introduced in the seminal contributions of Adagrad, Rmsprop or ADAM and the one we used in our paper are equivalent and fall into our framework described in Equation (1) with possibly two time-scales on the system $(\widetilde{\theta}_{n},\widetilde{w}_{n})_{n\geq 1}$ :

The choice of the sequence $(\beta_{2}(n))_{n\geq 1}$ is key to understand the stochastic algorithm we obtain in (4).

$\bullet$ Case constant $\beta_{2}=1$ (Adagrad of [DHS11]) This case is certainly the easiest to understand since the natural rescaling $S_{n}$ of the sequence $(v_{n})_{n\geq 1}$ is $S_{n}=n$ . In this case, we recover a joint evolution:

which entails $\gamma_{n+1}=\frac{\alpha_{n+1}}{\sqrt{n}}$ and $\widetilde{\gamma}_{n+1}=\frac{1}{n+1}$ . With the choice of [DBBU20], we then obtain a constant step-size stochastic algorithm for the coordinate $(\theta_{n})_{n\geq 1}$ associated with a uniform Cesaro averaging on the sequence of past squared gradients $(g_{n+1}^{\odot 2})_{n\geq 1}$ .

$\bullet$ Case constant $\beta_{2}\in(0,1)$ (Adam of [KB15] with no momentum). Since $S_{n}$ converges exponentially fast towards $(1-\beta_{2})^{-1}$ , the system is close to:

which entails $\gamma_{n+1}=(1-\beta_{2})\alpha_{n+1}$ and $\widetilde{\gamma}_{n+1}=(1-\beta_{2})$ .

$\bullet$ Case $\beta_{2}(n)=1-bn^{-\beta}$ with $b\in(0,1)$ . This last case where the sequence goes to $1$ with $n$ corresponds to an intermediary situation between $\beta_{2}=1$ and $\beta_{2}<1$ , this transition being parametrized by $\beta\in[0,+\infty]$ . We shall introduce the sequence of products:

It is well known (see e.g. [BCG20] Lemma 5.2) that when $\beta=1$ ,

where $\Lambda$ can be made explicit in terms of the Riemann zeta function. We then conclude the following behaviour of $(S_{n})_{n\geq 1}$ (see Appendix B of [GPS18]) that:

which implies that the joint evolution shall be written as:

which entails $\gamma_{n+1}=\alpha_{n+1}n^{-\frac{\beta\wedge 1}{2}}$ and $\widetilde{\gamma}_{n+1}=n^{-(\beta\wedge 1)}$ .

In all the situations above, we point out that we obtain some standard choices for the sequences $(\gamma_{n+1})_{n\geq 1}$ and $(\widetilde{\gamma}_{n+1})_{n\geq 1}$ involved in our two-time scale system (5).

2.3 Final remark on step-size sequences

In this work, we have chosen to restrict our study to a single time-scale parametrization with decreasing sequences $(\gamma_{n+1})_{n\geq 1}=(\widetilde{\gamma}_{n+1})_{n\geq 1}$ within the standard setup of stochastic algorithms:

This single time-scale restriction implies that $\sqrt{S_{n}}=\alpha_{n+1}$ , so that when we choose in (1) $p_{n}=q_{n}=1$ and $\gamma_{n+1}=\gamma_{1}(n+1)^{-\beta}$ , our algorithm is strictly equivalent to the one initially introduced in (2) with $S_{n}\propto n^{-\beta}$ and $\alpha_{n+1}=n^{-\beta/2}$ . In particular, if $\beta<1$ , it corresponds to $\beta_{2}(n)=1-bn^{-\beta}$ while if $\beta=1$ , it corresponds to $\beta_{2}(n)=1$ .

We leave the more sophisticated general study of the two time-scale algorithm for future investiations and refer to [Bor97], [MP06] or [BCG20, CG20] for other examples of such two time-scale stochastic algorithms in various (but simpler) situations.

3 Assumptions and convenient notations

We introduce the canonical filtration associated to our random sequence $\mathcal{F}_{n}=\sigma\left((\theta_{k},w_{k})_{1\leq k\leq n}\right)$ and list below the main assumptions used in our work.

We use the symbols $\lesssim_{d},\gtrsim_{d}$ to refer to inequalities up to a multiplicative constant that are independent from the dimension $d$ : for two positive sequences $(u_{n})_{n\geq 0}$ and $(v_{n})_{n\geq 0}$ , we write

and the constant $C$ is independent from the dimension of the ambient space $d$ . We will also use the notation $u_{n}=\mathcal{O}_{d}(v_{n})$ when $u_{n}\lesssim_{d}v_{n}$ . We also use the symbol $\lesssim$ that refers to an inequality up to a multiplicative constant that can depend on $d$ , for the proof of the local trap avoidance since for this result we are not interested in a quantitative effect of the dimension.

We first describe our main assumption on the sequence $(g_{n})_{n\geq 1}$ .

$\bullet$ Assumption $\mathbf{H}_{\sigma}^{p}$ . We assume that the sequence $(g_{n})_{n\geq 1}$ used in (1) provides an unbiased estimation of the true gradient of $f$ at position $\theta_{n}$ , i.e. we assume that:

We furthermore assume that the noise sequence $(\xi_{n+1})_{n\geq 1}$ satisfies:

where $c$ is a positive constant independent from $d$ . Assumption $\mathbf{H}_{\sigma}^{p}$ stands for a classical framework in stochastic optimization methods: $(\sigma_{n})_{n\geq 1}$ is an auxiliary sequence that translastes a possible use of mini-batches when $\sigma_{n}\longrightarrow 0$ as $n\longrightarrow+\infty$ . The moment assumption on $(\zeta_{n})_{n\geq 1}$ is the convenient assumption to handle standard problems like on-line regression, logistic regression or cascade of logistic regressions used in deep learning. We emphasize that we do not make any restrictive and somewhat irrealistic boundedness assumption of the noise $(\zeta_{n})_{n\geq 1}$ or of the sequence $(\theta_{n})_{n\geq 1}$ itself. Below, we will use this assumption with $p=4$ in Theorem 1. Finally, we should observe that this assumption introduces a possible linear dependency with $d$ on the size of the variance of the noise.

To derive the convergence of our algorithms towards a local minima, we will need a more stringent condition on the noise sequence. We then introduce the next assumption that will replace $\mathbf{H}_{\sigma}^{p}$ in our second main result of almost sure convergence (see Theorem 3 below).

$\bullet$ Assumption $\mathbf{H}_{\sigma}^{\infty}$ .

$(\mathbf{H}_{\sigma}^{\infty}-1)$ . The noise sequence $(\xi_{n+1})_{n\geq 1}$ is centered and satisfies:

$(\mathbf{H}_{\sigma}^{\infty}-2)$ . The noise sequence is elliptic uniformly in $n$ :

We stress that the upper bound $1$ on the second order moment is not restrictive, up to a modification of the calibration of the sequence $(\sigma_{n})_{n\geq 1}$ . The second assumption will be of course used to exit local traps.

We now introduce some standard assumptions on $f$ .

$(\mathbf{H}_{f}-1)$ . The function $f$ is positive and coercive, i.e. $f$ satisfies:

Demanding the lower bound of $f$ to be strictly positive is mostly a convenient technical constraint and not fundamentally more restrictive than the classical assumption of positivity.

$(\mathbf{H}_{f}-2)$ . We assume that $f$ satisfies the so-called Lipschitz continuous gradient property:

We emphasize that this implies the famous descent inequality:

This assumption is commonly used in optimization theory and statistics. Even though it is possible to address some more sophisticated situations (see e.g. [BBT17]), it is generally admitted that most of machine learning optimization problems fall into the Lipschitz continuous gradient framework.

$(\mathbf{H}_{f}-3)$ . We also assume that another constant $c_{f}$ exists such that:

This last assumption prevents from some too large growth of the function $f$ and it is immediate to verify that $(\mathbf{H}_{f}-3)$ implies that $f$ has a subquadratic growth, i.e. $\lim\sup_{\|x\|\longrightarrow+\infty}\frac{f(x)}{\|x\|^{2}}<+\infty$ . It has been widely used in the literature of stochastic algorithm (see e.g. [GPS18] and the references therein).

We finally introduce our assumptions on the step-size sequences used all along the paper that are involved in $(p_{n})_{n\geq 1},(q_{n})_{n\geq 1}$ and $(\gamma_{n})_{n\geq 1}$ . To easily assess some convergence results with quantitative conditions on our gain sequences, we will consider the situations where:

$(\mathbf{H}_{\text{Steps}}-1)$ . The sequences $(p_{n})_{n\geq 1}$ and $(q_{n})_{n\geq 1}$ satisfy:

$(\mathbf{H}_{\text{Steps}}-2)$ . The mini-batch sequence $(\sigma_{n})_{n\geq 1}$ satisfies:

$(\mathbf{H}_{\text{Steps}}-3)$ . As already discussed in Section 2.2.3, the sequence $(\gamma_{n})_{n\geq 0}$ satisfies:

We point out that $(p_{n})_{n\geq 1}$ and $(\sigma_{n})_{n\geq 1}$ may be (or not) some vanishing sequences (if $r>0$ and $p_{\infty}=0$ or if $s>0$ ).

4 Main results

We now state our three main convergence results for the stochastic algorithm defined in Equation (1).

Assume that $\mathbf{H}_{f}$ , $\mathbf{H}_{\emph{Steps}}$ and $\mathbf{H}_{\sigma}^{p}$ hold for $p=4$ . Then $(\theta_{n},w_{n})_{n\geq 1}$ converges almost surely towards $(\theta_{\infty},0)$ where $\nabla f(\theta_{\infty})=0$ .

Theorem 1 is a purely asymptotic convergence results. It provides the convergence of our adaptive algorithm (1) towards a set of critical points under mild assumptions on the noise sequence and on the function $f$ . We emphasize that this results holds for a standard setup on stochastic algorithms with a decreasing learning rate $(\gamma_{n})_{n\geq 1}$ . We observe that the essential condition involved in this result is the convergence of the series that depend on $(\gamma_{n},p_{n},\sigma_{n}^{2})$ . In particular, when $\gamma_{n}=\gamma_{1}n^{-\beta}$ , we observe that Theorem 1 holds when:

From a theoretical point of view, the less restrictive situation corresponds to the choice $\beta=1$ since the series converges as soon as $\sigma_{n+1}^{2}p_{n}$ decreases like $\log(n)^{-2}$ . It implies that either we need to use a very lengthy decrease of the update induced by $(p_{n})_{n\geq 1}$ , or use a very lengthy increase of the minibatch proportional, with a batch of size $\log^{2}(n)$ at step $n$ . Of course, this last condition holds as soon as $r+2s>0$ . When $\beta$ is chosen lower than $1$ , the condition becomes $r+2s>1-\beta$ , which may lead to a larger computational cost.

Using the point of view introduced in [GL13] to assess the computational cost of non-convex stochastic optimization, it is possible to derive a more quantitative result on the sequence $(\theta_{n})_{n\geq 1}$ . This result is stated in terms of the expected value of the gradient of $f$ all along the algorithm. A $\delta$ -approximation computational cost is then the number of samples that are necessary to obtain an average value below $\delta$ .

Assume that $\mathbf{H}_{f}$ and $\mathbf{H}_{\sigma}^{p}$ hold for $p=4$ and consider an integer $N>0$ and $\tau$ an integer sampled uniformly over $\{1,\ldots,N\}$ :

If $\gamma_{n}=\gamma=\frac{1}{d\sqrt{N}}$ and $p_{n}=q_{n}=\frac{1}{\sqrt{N}}$ and $\sigma_{n}^{2}=1$ , then

and the computational cost to obtain a $\delta$ -approximation is $d^{2}\delta^{-2}$ .

If $\gamma_{n}=\gamma=\frac{1}{\sqrt{N}}$ and $p_{n}=q_{n}=1$ and $\sigma_{n}^{2}=\frac{1}{d\sqrt{N}}$ , then

and the computational cost to obtain a $\delta$ -approximation is of order $d\delta^{-3}$ .

If $\gamma_{n}=\gamma=\frac{1}{\sqrt{dN}}$ and $p_{n}=q_{n}=\frac{1}{\sqrt{dN}}$ and $\sigma_{n}^{2}=1$ , then

and the computational cost to obtain a $\delta$ -approximation is of order $d\delta^{-2}$ .

If $\gamma_{n}=\gamma=\frac{1}{\sqrt{N}}$ and $p_{n}=q_{n}=\frac{1}{\sqrt{N}}$ and $\sigma_{n}^{2}=\frac{1}{d}$ , then

and the computational cost to obtain a $\delta$ -approximation is of order $d\delta^{-2}$ .

We emphasize that this last result is not a real convergence result, which is indeed impossible to derive with a constant step-size stochastic algorithm. Nevertheless, it may be seen as a benchmark result following the usages in non-convex machine learning optimization. It is a convenient way to assess a mean square convergence of stochastic optimization algorithm with non-convex landscape (see e.g. [GL13]).

We recover in this result a more quantitative result that translates both the linear effect of the dimension on the “convergence” rate and the dependency of the final bound in terms of $N^{-1/2}$ when the algorithm is randomly stopped uniformly between iteration $1$ and $N$ . The presence of both $d$ and of $N^{-1/2}$ is not surprising as it already appears to be the minimax rate of convergence in stochastic optimization with weakly convex landscapes (see e.g. [NY83]).

If we translate the upper bound of $i)$ into a complexity bound, we observe that for any $\delta>0$ , we need to fix $N$ such that

When we use a mini-batch strategy with $d\sqrt{N}$ samples at each iteration such that the error bound produced by $iv)$ is lower than $\delta$ , we observe that $N$ has to be chosen of the order $\delta^{-2}$ and the overall procedure may be improved (when compared to the first setting) since we obtain a $d\delta^{-3}$ computational cost. Finally, the otimal tuning of the algorithm seems to be the last ones, where $\sigma^{2}$ is chosen of the order $d^{-1}$ and $\gamma_{n}\propto p_{n}\propto N^{-1}$ , or $\sigma^{2}=1$ and $\gamma=p=q=(dN)^{-1/2}$ , which leads to a $d\delta^{-2}$ computational cost. As discussed in Section 3.2, with this strategy, it seems impossible to improve the $d\delta^{-2}$ computational cost obtained with other choices of the parameters.

A such improvement comes from a careful tuning of a Lyapunov function that is not exactly the same as the one used in these previous works. We refer to Sections 3.1 and 3.2 for further details. Finally, we also point out that when the sequence $(\gamma_{n})_{n\geq 1}$ is kept fixed, as indicated in the paragraph 2.2.2 it corresponds to a choice of $\beta_{2}<1$ kept constant all over the time evolution (see Equation (8)) and $p=q=\frac{1}{\sqrt{N}}$ . This result with this range of parameters appears to be in line with those of [WWB19, DBBU20], but in our work the assumptions on the noise sequence and on the function $f$ are significantly weaker.

In this paragraph, we assess the almost sure convergence of the sequence $(\theta_{n})_{n\geq 1}$ towards a local minimum of $f$ and state that the algorithm cannot converge towards an unstable (hyperbolic) point of the dynamical system, e.g. cannot converge towards a saddle point or a local maximum of $f$ .

Assume that $\mathbf{H}_{f}$ , $\mathbf{H}_{\emph{Steps}}$ and $\mathbf{H}_{\sigma}^{\infty}$ hold. Assume that $f$ is twice differentiable. Suppose $p_{\infty}=0$ , $|q_{n}-q_{\infty}|=\mathcal{O}(p_{n})$ , $\gamma_{n}=\gamma_{1}^{-\beta}$ and let $(\beta,r,s)$ be chosen such that:

Then almost surely the sequence $(\theta_{n})_{n\geq 1}$ does not converge towards a local maximum of $f$ .

Several remarks are necessary after our last theorem, that identifies not only the limit points as the critical points of $f$ but as local minimizers. Hence, our contribution should be understood as a new example of stochastic method that avoids local traps, and then compared to [Pem90, BD96, GPS18, BBHS20].

Moreover, we point out that our result holds for every initialization point and do not use any integration over $(\theta_{0},w_{0})$ . Hence, the nature of our result is different from the ones obtained in recent contributions in the field of machine learning (see e.g. [LSJR16, LPP+17]): we establish that our stochastic algorithm converges with probability 1 to a minimizer, which is different from proving that a deterministic or randomized algorithm (gradient descent in [LPP+17] for example) randomly initialized converges to a local minimum with probability 1.

When the variance of the noise sequence is kept fixed all along the iterations (no mini-batch is used, so that $s=0$ ), the previous conditions on the parameters can be summarized as: $\beta\in(1/2,1)$ ; $r\in(1-\beta,\beta)$ , when $\beta<2/3$ and $r\in(\beta/2,\beta)$ when $\beta\geq 2/3$ .

Finally, we should emphasize that this last result is rather difficult to obtain when the mini-batch parameter $s$ is chosen strictly greater than since it translates a possible vanishing level of noise when the number of iterations is increasing. Our assumption shows that the size of the mini-batch should not grow to fast (induced by the condition $s\leq(1-\beta)/2$ ) to obtain the convergence towards a local minimizer of $f$ . Up to our knowledge, a such explicit phenomenon is new in the stochastic algorithm community. It would deserve further numerical or theoretical investigations to identify whether this condition is necessary to convert an almost sure convergence result towards critical points into a convergence result towards a stable point of the differential system. The limiting condition seems to be $\beta=1,s\leq 1/2$ and $r\in[1/2,1]$ . As a really ambitious question, we leave this problem for future developments and up to our knowledge, the maximal size of the mini-batch that guarantees convergence to local minimizer is still an unresolved question even for the SGD.

5 Organization of the paper

The rest of the paper consists in showing the proofs of the previous results.

Theorems 1 and 2 are proven in Section 3. In particular, Proposition 3.1 studies the average one-step evolution of the algorithm through several key functions. This proposition permits to derive a Lyapunov function in Section 3.2 that translates both the average quantitative result of Theorem 2 and the asymptotic convergence result of Theorem 1. The main difficulties in these two results is to derive a mean reverting effect in terms of $\|\nabla f(\theta_{n})\|^{2}$ without using some extra boundedness assumption, and to assess the influence of $d$ on the quantitative result.

Theorem 3 is a typical result of stochastic algorithms, and is inspired from the seminal contributions of [Pem90] and [BH95]. The proof is detailed in Section 4 and the cornerstone of this proof is the use of the stable/unstable manifold Lemma that provides an ad-hoc Lyapunov function of the dynamical system, denoted by $\eta$ in Proposition 4.1. We also refer to the recent contribution of [BBHS20] for another typical application to stochastic algorithms. The main novelty brought in our proof is to use the mini-batch low noise level and keep the a.s. escape of local maximum true. In particular, from a technical point of view, we take advantage of the boundedness series of Proposition 3.2, which is a key ingredient in the proof of Proposition 4.3.

Almost sure convergence to the set of critical points

The purpose of this section is to prove Theorem 1 and Theorem 2. In particular, we will obtain Theorem 2 during the proof of the almost sure convergence result as a specific point of Proposition 3.2, $iii)$ . The basic ingredient of our proof relies on the Robbins-Siegmund Theorem [RS71] that will be applied with the help of an ad-hoc Lyapunov function on $(\theta_{n},w_{n})_{n\geq 1}$ .

Below, we will pay a specific attention to the dimension dependency in the inequalities we will obtain. We first state the following proposition that will create a mean reverting effect from iteration $n$ to iteration $n+1$ on the pair $(\theta_{n},w_{n})$ .

Assume that $\|\gamma_{n}q_{n}\|_{\infty}<1$ , that $\|p_{n}\|_{\infty}<+\infty$ , that $\mathbf{H}_{f}$ and $\mathbf{H}_{\sigma}^{p}$ hold for $p=4$ .

A constant $c_{1,\varepsilon}$ independent from $n$ and ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}d}$ exists such that for any $n\geq 1$ :

then a constant $c_{2,\varepsilon}$ independent from $n$ and $d$ exists such that:

then a constant $c_{3,\varepsilon}$ independent from $n$ and $d$ exists such that:

$\bullet$ Proof of $i)$ : We observe that:

where the last line comes from the definition of $\xi_{n+1}=\sigma_{n+1}\zeta_{n+1}$ and the fact that:

$\bullet$ Proof of $ii)$ : We develop $f(\theta_{n+1})$ : we use the descent inequality (12):

where in the last line we used a rough upper bound on $1/\sqrt{w_{n}+\varepsilon}$ and $|a+b|^{2}\leq 2a^{2}+2b^{2}$ . It is also possible to use (12) and prove the lower bound:

To deduce $ii)$ , we then consider the conditional expectation in (14), use Equation (13) and assumption $\mathbf{H}_{\sigma}^{p}$ for $p=2$ to obtain that a constant $c_{\epsilon}=L^{2}(1\vee c_{f})/\varepsilon$ exists such that:

We emphasize that at this stage, this last inequality does not create any repelling effect on the position $(\theta_{n})_{n\geq 1}$ and we need to deal with the denominator $(w_{n}+\varepsilon)$ , which is the purpose of $iii)$ . $\diamond$

$\bullet$ Proof of $iii)$ : We consider the evolution of $f^{2}$ from $\theta_{n}$ to $\theta_{n+1}$ . Using (14) and (15) obtained in $ii)$ , we deduce that:

We now expand $\{m^{-}_{n}\}^{2}$ and $\{m^{+}_{n}\}^{2}$ , using that $a\vee-a=|a|$ , we get:

When taking the conditional expectation in the previous inequality we treat some of these terms separately. First, using the centering of $\xi_{n+1}$ we have that :

where in the second line we used Cauchy-Schwarz inequality and assumption $\mathbf{H}_{\sigma}^{p}$ for $p=2$ whereas the last line comes from sub-quadratic growth assumption on the function $f$ given by (13). This last bound can also be used to control the term $L^{2}\gamma_{n+1}^{2}f(\theta_{n})\left\|\frac{g_{n+1}}{\sqrt{w_{n}+\varepsilon}}\right\|^{2}$ since:

Now, in similar manner, using repeatitly these last two assumptions ( $\mathbf{H}_{\sigma}^{p}$ also with $p=4$ ) :

Starting again with the Cauchy-Schwarz inequality for the last term we obtain :

We can now regroup the previous bounds, use equation $\eqref{eq:intermediaire_horrible}$ , the fact that $\gamma_{n}<1$ and that $\sqrt{ab}\leq 1/2(a+b)$ to conclude that a constant $c_{2,\varepsilon}$ , independent of $d$ and $n$ exists such that :

$\bullet$ Proof of $iv)$ : We observe that:

where the last line comes from the fact that $w_{n}^{i}/(w_{n}^{i}+\varepsilon)<1$ , which implies that the last term of the right hand side exists.

Using $\sqrt{1+a}\leq 1+a/2$ for any $a>-1$ , we deduce that:

Now observe that the second term can easily be bounded by :

We use this last inequality and $f(\theta_{n+1})\leq m_{n}^{+}$ to conclude that:

The baseline remark is that thanks to the Cauchy-Schwarz inequality, we have:

We then compute the conditional expectation of the previous terms with respect to $\mathcal{F}_{n}$ , using the previous inequality, the centering of $\xi_{n+1}$ , and obtain that a constant $c_{3,\varepsilon}$ independent from $d$ exists such that:

We then study each terms in the large bracket separately. Using $f\leq 1+f^{2}$ and $\mathbf{H}_{\sigma}^{p}$ , we have:

The second term is handled with the help of $\mathbf{H}_{\sigma}^{p}$ (for $p=3$ ) and the Cauchy-Schwarz inequality:

③ is $\mathcal{F}_{n}$ measurable and the subquadratic growth assumption given by (13) ensures that :

The term ④ is $\mathcal{F}_{n}$ measurable and we use once more that $\|\nabla f\|^{2}\lesssim_{d}f$ to obtain that:

For ⑤ we use Assumption $\mathbf{H}_{\sigma}^{p}$ with $p=2$ and obtain that:

⑥ is close to ③, we use $\mathbf{H}_{\sigma}^{p}$ with $p=2$ and obtain that:

The last line comes from the fact that as soon as $f$ is uniformly lower bounded by a positive constant, then $\|(w_{n}+\varepsilon)^{1/4}\|^{2}\lesssim_{d}f(\theta_{n})\|(w_{n}+\varepsilon)^{1/4}\|^{2}$ .

The term ⑦ is close to ⑤ and $\mathbf{H}_{\sigma}^{p}$ yields:

For ⑧ Assumption $\mathbf{H}_{\sigma}^{p}$ with $p=4$ implies that:

Our bounds on $①-⑧$ and the fact that $(1+f(\theta_{n}))^{2}\leq 2(1+f(\theta_{n})^{2})$ ensure that a constant $c_{3,\varepsilon}$ exists independent from $d$ such that:

Since $\gamma_{n+1}\sigma_{n+1}^{3}d^{3/2}\leq 1/2(\sigma_{n+1}^{2}d+\gamma_{n+1}^{2}\sigma_{n+1}^{4}d^{2}),$ using the definition of $t_{n}$ , we deduce that:

Proposition 3.1 will permit do derive a Lyapunov function on $(\theta_{n},w_{n})_{n\geq 1}$ (see the next result) which implies the convergence of

This kind of bound has also been obtained in [DBBU20] (Theorem 4) with the help of a somewhat artificial boundedness assumption of the noisy gradients, which is not used in our work. We also point out that [ZSJ+19] propose another function that generates a mean reverting term:

and the major difference with our result is the weaker $4/3$ instead of $2$ in the series. In particular, a such $4/3$ will not allow to prove the a.s. asymptotic pseudo-trajectory result, and consequently the a.s. convergence of the trajectory towards a critical point of $f$ .

2 Proof of Theorem 2

Using Proposition 3.1, we are now ready to state the next important result, which will be key for the almost sure convergence of $(\theta_{n})_{n\geq 1}$ .

Under the assumptions of Proposition 3.1, then:

Two constants $c(\theta_{0},w_{0})$ and $\kappa$ exist such that, for all $n\geq 1$

where $(t_{n})_{n\geq 0}$ and $(s_{n})_{n\geq 0}$ are the auxiliary sequences defined in Proposition 3.1.

If $\sum_{n\geq 1}(\gamma_{n+1}^{2}+\gamma_{n+1}\sigma_{n+1}^{2}p_{n})<+\infty$ , then almost surely:

$\bullet$ Proof of $i)$ . Our proof relies on a Lyapunov function defined by:

Using $i)$ , $iii)$ and $iv)$ of Proposition 3.1 and the fact that $f(\theta_{n})\leq 1+f(\theta_{n})^{2}$ , we deduce that a constant $\kappa$ that depends on $c_{1,\varepsilon},c_{2,\varepsilon},c_{3,\varepsilon}$ and of the next choice of $a$ and $b$ exists such that:

We observe that for any vector $u$ , we have $\|\sqrt{|u|}\|^{4}\geq\|u\|^{2}$ and that $(p_{n})_{n\geq 1}$ is a bounded sequence, so that we can find $b$ large enough to have $b>2\sup_{n\geq 1}p_{n}$ , and $2a\geq bp_{n}$ such that $\forall n\geq 1$ :

We then conclude using a straightforward recursion. $\diamond$

$\bullet$ Proof of $ii)$ . This point proceeds with standard arguments: we use the Robbins Siegmund Lemma (see [RS71]): the series $\sum\gamma_{n+1}^{2}$ and $\sum\gamma_{n+1}\sigma_{n+1}^{2}p_{n}$ are convergent and obtain that:

More importantly, the next series are convergent:

This ends the proof of $ii)$ $\diamond$ ∎

We emphasize that $i),ii)$ are standard consequences of the Robbins-Siegmund approach on stochastic algorithms. In particular, the use of $i)$ with the calibrations of $(\gamma_{n})_{1\leq n\leq N},(\sigma_{n})_{1\leq n\leq N}$ and $(p_{n})_{1\leq n\leq N}$ that are given in the statement of Theorem 2 instantaneously lead to the conclusion of the proof. We also point out that the basic fact to obtain this result is a tuning of the parameters thats leads to

Considering constant step-size sequences, it entails the following constraints on $(p,\gamma,\sigma)$ :

whereas the size of $N$ needed to obtain a $\delta$ approximation should verify that:

Hence, the computational cost is lower bounded by $d\delta^{-2}$ and this bound may be achieved as soon as

which corresponds to the tuning described in $iii)$ and $iv)$ of Theorem 2.

3 Proof of Theorem 1

For the sake of convenience, from now on we denote $V_{n}=V_{a,b}(\theta_{n},w_{n})$ . Since we are now interested in purely asymptotic result, we omit the dependency with $d$ in the bounds we obtain hereafter. The main difficulty here is to convert the result of Proposition 3.2 $ii)$ into an a.s. convergence result on $(\theta_{n})_{n\geq 1}$ . We remind the standard definition of asymptotic pseudo-trajectories of a semiflow $\Phi$ .

A continuous trajectory $Z$ is a pseudo-trajectory of $\Phi$ if for any finite time horizon $T>0$

We refer to [BH96, Ben99] for further details. We will use in particular the cornerstone result which is reminded below:

Denote $\tau_{0}=0$ , $\tau_{n}=\sum_{k=1}^{n}\gamma_{k}$ and consider $(\bar{Z}_{t})_{t\geq 0}$ the continuous time process corresponding to a linear interpolation of $(Z_{n})_{n\geq 0}$ , given by :

The Robbins-Siegmund Lemma ensures that the sequence $(V_{n})_{n\geq 1}$ converges a.s. to a finite random variable $V_{\infty}$ . Since all the terms of $V_{n}$ are positive, $(f(\theta_{n}))_{n\geq 0}$ and $(\|w_{n}\|)_{n\geq 0}$ are a.s. bounded as well. The coercivity of $f$ implies thus that $(Z_{n})_{n\geq 0}$ is a.s. bounded. The evolution of the sequence $(Z_{n})_{n\geq 0}$ can be written as:

The next result allows us to control the rest term and ensures that the assumptions of Proposition 4.1 of [Ben99] are fulfilled, which in turn implies that $(\bar{Z}_{t})_{t\geq 0}$ is indeed an asymptotic pseudo trajectory of the flow induced by $H$ .

Under the assumptions of Proposition 3.1 and if $\sum\gamma_{k+1}\sigma_{k+1}^{2}p_{k}<+\infty$ and $(p_{n},q_{n})\longrightarrow(p_{\infty},q_{\infty})$ . Let $N(n,t)=\sup_{k\geq 0}\{t+\tau_{n}\geq\tau_{k}\}$ and $(e_{n})_{n\geq 1}$ defined in Equation (22). For all $T>0$ :

We consider a finite horizon $T>0$ . In order to deal with the previous sum, we write $e_{n}$ as $a_{n}+\Delta M_{n+1}+c_{n}$ , with $a_{n}=\begin{pmatrix}0\\ (p_{n}-p_{\infty})\nabla f(\theta_{n})^{\odot 2}-(q_{n}-q_{\infty})w_{n}\end{pmatrix}$ , $\Delta M_{n+1}=\begin{pmatrix}\dfrac{\nabla f(\theta_{n})-g_{n+1}}{\sqrt{w_{n}+\varepsilon}}\\ 0\end{pmatrix}$ and $c_{n}=\begin{pmatrix}0\\ p_{n}(\nabla f(\theta_{n})^{\odot 2}-g^{\odot 2}_{n+1})\end{pmatrix}.$ We use the fact that:

and proceed to upper bound each term on the right hand side.

$\bullet$ The convergence of the first term is mainly a consequence of Equations (19), (20) (convergence of the Robbins-Siegmund series), of the convergence of $(q_{n})_{n\geq 1}$ towards and $q_{\infty}$ and the fact that $(p_{n}-p_{\infty})_{n\geq 1}$ is a bounded sequence :

Using the convergence of the series (19) and (20) we conclude that, $\forall t>0$ This last limit holds regardless $t<T$ :

$\bullet$ To control the second term we observe that $(\Delta M_{k+1})_{k\geq 0}$ is a sequence of martingale increments with a bounded second order moment since

We start by decomposing $c_{k}$ as its expected value plus a martingale increment:

Since $\sum_{n\geq 0}\gamma_{n+1}p_{n}\sigma_{n+1}^{2}<+\infty$ the first sum convergences to when $n$ goes to infinity. The terms of the second sum are martingale increments:

Using the fact that $(a+b)^{2}\leq 2(a^{2}+b^{2})$ , we get a first bound on their second order moments:

Now using $\mathbf{H}_{\sigma}^{p}$ for $p=2$ and Inequality $\eqref{eq:comparaison}$ :

The second term can be dealt with in a similar manner, using the assumption $\mathbf{H}_{\sigma}^{p}$ for $p=4$ :

and we can apply once more Corollary 11 of [MP] to conclude that:

We now give the proof of the leading result of Section 3.

$\bullet$ Identification of the possible limit points. The a.s. boundness of $(Z_{n})_{n\geq 0}$ and Lemma 3.3 show that the assumptions of Proposition 4.1 of [Ben99] hold, which implies that $(\bar{Z}_{t})_{t\geq 0}$ is an asymptotic pseudo-trajectory of the differential flow induced by $H$ almost surely. It implies in particular that $(w_{n})_{n\geq 1}$ is a.s. bounded.

We shall deduct from Theorem 4 that all limit points $Z_{\infty}=(\theta_{\infty},w_{\infty})$ of $\bar{Z}_{t}$ are stationary points for the differential equation $\dot{z}=H(z)$ and thus that $H(Z_{\infty})=0$ . Since $\|w_{n}\|$ is a.s. bounded, it implies that $\nabla f(\theta_{\infty})=0$ .

In the meantime, since $\nabla f(\theta_{n})\to 0$ when $n\to\infty$ , we observe that for any limit point $Z_{\infty}$ , $H(Z_{\infty})=0$ also implies that $w_{\infty}=0$ .

$\bullet$ Convergence of $(\theta_{n})_{n\geq 1}$ . Thus (since $w_{n}$ is bounded) we have that $w_{n}$ converges to . Hence:

Since $a$ and $b$ are non-negative and $f(\theta_{n})$ positive, the last equality implies that the sequence $(f(\theta_{n}))_{n\geq 0}$ is a.s. convergent:

Now, $(\theta_{n})_{n\geq 0}$ is an a.s. bounded sequence and the set of possible limit points for its sub-sequences is connected since:

Since $\{\theta:f(\theta)=f_{\infty}\}\cap\{\theta:\nabla f(\theta)=0\}$ is locally finite, we can conclude that $(\theta_{n})_{n\geq 1}$ has a unique adherence point and is a convergent sequence:

Almost sure convergence to a local minimum

In this paragraph, we prove that the sequence $(\theta_{n})_{n\geq 1}$ almost surely converges towards a local minimum of $f$ and cannot converge towards an unstable (hyperbolic) point of the dynamical system, e.g. cannot converge towards a saddle point or a local maximum of $f$ .

We begin with a simple statement that identifies the unstable points of the dynamical system

These equilibria may correspond to repulsive equilibria or hyperbolic points (saddle points). The next proposition makes this last sentence more precise and introduce the cornerstone function $\eta$ that measures in each neighborhood of an unstable (or saddle) point the distance of any point $x$ to the local stable manifold in the expanding direction.

Consider the dynamical system (27), and assume that $f$ is twice differentiable, then:

The equilibria are $(t,0)$ where $t$ is a critical point of $f$ .

If $t$ is a local maximum of $f$ , the dynamical system is unstable near $(t,0)$ .

If $t$ is a local minimum of $f$ , the dynamical system is stable near $(t,0)$ .

If $t$ is an unstable equilibria, a compact neighborhood $\mathcal{N}$ of $(t,0)$ and a function $\eta$ exist such that:

If $E_{+}$ is the eigenspace associated to the negative eigenvalues of $D^{2}f(t)$ , then:

$\bullet$ Proof of $i)$ . The proof of $i)$ is immediate by observing that $H(\theta,w)=0$ and $q_{\infty}\neq 0$ implies that $\nabla f(\theta)=0$ and $w=0$ .

$\bullet$ Proof of $ii)$ and $iii)$ . We use a linearization of the drift around an equilibria $(t,0)$ . Since $t$ is a critical point of $f$ , we observe that:

Consequently, we observe that $(\nabla f(t+h))^{2}=(D^{2}f(t)h)^{2}=\mathcal{O}\left(\|h\|^{2}\right)$ , which entails:

The conclusion follows from the spectral decomposition of $D^{2}f(t)$ .

$\bullet$ Proof of $iv)$ . The last point is a consequence of Proposition 3.1 of [BH95] (see also Proposition 9.5 of [Ben99]). We only need to observe that the coordinates in $u_{2}$ always correspond to the attractive manifold so that $\pi_{+}(u)=\pi_{+}(u_{1})$ . ∎

2 Preliminary estimates

For a given integer $n_{0}$ when $(\theta_{n_{0}},w_{n_{0}})$ is in a neighborhood $\mathcal{N}$ (given by (iv) of Proposition 4.1 ) of an unstable point, we introduce the exit time of $\mathcal{N}$ defined by:

where $(\alpha_{n+1})_{n\geq 1}$ will be made explicit later on, and the associated cumulative sum:

We prove the following upper bound on the second order moment of $(X_{n})_{n\geq n_{0}+1}$ .

We decompose $X_{n+1}$ according to the position of $n$ with respect to $T_{n_{0}}$ :

If $n\geq T_{n_{0}}$ , there is nothing to prove.

We then consider the case when $n<T_{n_{0}}$ and define $m=\sup_{z\in\mathcal{N}}\|\nabla\eta(z)\|$ . A first order Taylor expansion yields:

When $n<T_{n_{0}}$ , the process $Z_{n}=(\theta_{n},w_{n})\in\mathcal{N}$ so that $\|w_{n}\|^{2}$ is bounded. It remains to study the terms that involve $\|g_{n+1}\|^{2}$ and $\|g^{\odot 2}_{n+1}\|^{2}$ . We shall observe that:

because $(\sigma_{n})_{n\geq 1}$ is bounded by definition and $\nabla f$ is continuous and $\mathcal{N}$ is compact.

Thus from the assumption $\mathbf{H}_{\sigma}^{\infty}$ it follows that:

where $K$ is a constant that depends on $\nabla\eta$ , $m$ , $\varepsilon$ , $\|p\|_{\infty}$ and $\|q\|_{\infty}$ . This ends the proof. ∎

Below, we will use the consequence of $iv)-(a)$ of Proposition 4.1: if $z=(\theta,w)$ is a point in $\mathcal{N}$ , then a constant $\Gamma$ exists such that:

We write the joint evolution of the algorithm $(Z_{n})_{n\geq 1}$ as:

where $(\Delta_{n+1})_{n\geq 1}$ is a sequence of martingale increment defined by:

$\Delta M_{n+1}=-\frac{\xi_{n+1}}{\sqrt{w_{n}+\varepsilon}}$

We introduce below the sequence $(\nu_{n})_{n\geq 1}$ that stands for the convergence rate of $(p_{n})_{n\geq 1}$ towards $p_{\infty}$ and $(q_{n})_{n\geq 1}$ towards $q_{\infty}$ :

We start once more from the decomposition of $(X_{n})_{n\geq n_{0}+1}$ before or after $T_{n_{0}}$ and observe that:

The definition of the stopping time $T_{n_{0}}$ ensures that when $n<T_{n_{0}}$ , $Z_{n}$ belongs to the neighborhood $\mathcal{N}$ so we can apply Equation $\eqref{Ineg: alpha}$ and obtain the following lower bound:

where the last inequality comes from (iv)c of Proposition 4.1 (Equation (28)). Using that $\|\nabla\eta\|$ is upper bounded on $\mathcal{N}$ by $m$ , the definition of $T_{n_{0}}$ leads to:

This inequality associated with the Cauchy-Schwarz inequality implies that:

Observing that $\|\xi_{n+1}^{\odot 2}\|=\sqrt{\sum_{i=1}^{d}\xi_{n+1,i}^{4}}\leq\sum_{i=1}^{d}\xi_{n+1,i}^{2}=\|\xi_{n+1}\|^{2}$ , we deduce that:

We then define $k^{\prime}=\sup_{(\theta,w)\in\mathcal{N}}\|\nabla f(\theta)^{\odot 2}\|+\|w\|<+\infty$ (because $\mathcal{N}$ is compact and $\nabla f$ is continuous). We deduce that:

Concerning the last term, it is sufficient to show that the expected value is bounded. We start by using the fact that $(a+b+c)^{2}\leq 4(a^{2}+b^{2}+c^{2})$ to split the squared norm and to obtain that:

Since $H$ is continuous, $(\|H(Z_{n})\|^{2})_{n_{0}\leq n<T_{n_{0}}}$ is a bounded sequence. Moreover, we have seen that when $n<T_{n_{0}}$ :

Again, using the compactness of $\mathcal{N}$ , the continuity of $\nabla f$ and the assumption $\mathbf{H}_{\sigma}^{\infty}$ on the noise $(\zeta_{n})_{n\geq 1}$ , we deduct that $K_{2}>0$ exists such that:

We now gather all the terms and use the fact that the sequences $(\sigma_{n})_{n\geq 0}$ , $(p_{n})_{n\geq 0}$ and $(\nu_{n})_{n\geq 0}$ are all bounded, to conclude that a constant $K_{3}$ exists such that:

We now study the evolution of $S_{n}^{2}$ .

Assume that $\delta_{n}=c(\nu_{n}+p_{n}\sigma_{n+1}^{2}+\gamma_{n+1})$ is such that $\gamma_{n+1}\delta_{n}^{2}=o(\gamma_{n+1}^{2}\sigma_{n+1}^{2}\wedge\alpha_{n}^{2})$ then:

where in the last line we used the reverting effect translated in Equation (28) and the Cauchy-Schwarz inequality. Since $\eta$ is positive, we then obtain that:

We denote the positive part of any real value $a$ by $\lfloor a\rfloor_{+}$ and we use that $a\geq b\Longrightarrow\lfloor a\rfloor_{+}\geq\lfloor b\rfloor_{+}$ and the inequality

Considering Equation (38), we then observe that:

Once more the regularity of $\nabla\eta$ , the compactness of $\mathcal{N}$ and the definition of $U_{n}$ , guarantee that a constant $\kappa>0$ exists such that:

Computing the conditional expectation and using the arguments of (36), we have

Using that when $n<T_{n_{0}}$ , $Z_{n}\in\mathcal{N}$ , we can apply $iv)-(b)$ of Proposition 4.1 so that:

where $\pi_{+}$ is the orthogonal projection on $E_{+}$ , the eigenspace associated to the negative eigenvalues of $D^{2}f(t)$ . For $n$ large enough, the almost sure convergence of $(w_{n})_{n\geq 0}$ to and our elliptic assumption $\mathbf{H}_{\sigma}^{\infty}-ii)$ on the sequence $(\xi_{n+1})_{n\geq 0}$ yield:

Decomposing now $X_{n+1}=\lfloor X_{n+1}\rfloor_{+}-\lfloor-X_{n+1}\rfloor_{+}$ , we then deduce that:

The last line is justified by the assumption $\gamma_{n+1}\delta_{n}^{2}=o(\gamma_{n+1}^{2}\sigma_{n+1}^{2})$ which ensures that $\delta_{n}=o(\sigma_{n})$ . We shall also observe that:

Since $\gamma_{n+1}\delta_{n}^{2}=o(\gamma_{n+1}^{2}\sigma_{n+1}^{2}\wedge\alpha_{n}^{2})$ , we can now conclude by inserting the previous bounds in the inequality $\eqref{ineg: Sn 1}$ .

3 End of the proof

We mimic the strategy of Lemma 9.6 of [Ben99] and of Theorem 3.2 of [GPS18] with the help of the two sequences defined in (31) and (30).

We summarize the preliminary results proven in the previous subsection as they will we used in what follows:

Proposition 4.2 yields that if $\delta_{n}=c_{\delta}(\nu_{n}+p_{n}\sigma_{n+1}^{2}+\gamma_{n+1}^{2})$ , then

Proposition 4.2 yields: if $\gamma_{n+1}\delta_{n}^{2}=o(\gamma_{n+1}^{2}\sigma_{n+1}^{2}\wedge\alpha_{n}^{2})$ (which also means that $\delta_{n}=o(\sigma_{n})$ ), then

For any integer $q$ , we consider an integer $n\geq q$ and introduce the two sequences $(u_{n})_{n\geq q}$ and $(\bar{U}_{n})_{n\geq q}$ defined by:

With the framework of Theorem 3 in mind we suppose that

with $\beta\in(1/2,1)$ , $s\geq 0$ and $r>0$ . Let us point out other restrictions on the choices of these parameters in light of previous assumptions:

A first restriction follows directly from $(\mathbf{H}_{\text{Steps}}-3)$ , namely that

(to ensure that $\sum\gamma_{n+1}p_{n}\sigma_{n+1}^{2}<+\infty$ ). Furthermore, Theorem 1 demands that $\gamma_{n}/p_{n}\to 0$ , implying that

To ease notations it is convenient to assume that the sequence $(q_{n})$ converges to its limit $q_{\infty}>0$ at least as fast as $(p_{n})$ goes to : $\nu_{n}=\mathcal{O}(p_{n}).$ With this in mind, the condition imposed by Proposition 4.2, $\delta_{n}=o(\sqrt{\gamma_{n+1}}\sigma_{n+1})$ translates to $r+s>\beta/2$ , $r>\beta/2+s$ and $2\beta>\beta/2+s$ , ie $\beta>2/3s$ . As soon $\beta>s$ all these conditions come down to:

Other restrictions will be added in the following propositions and are summarized in Theorem 3.

For any positive real value $b>0$ and any integer $q>0$ , we consider the sequence of stopping times $(T_{b}^{q})_{q\geq 0}$ defined by:

The stopping times $T_{b}^{q}$ stands for the first time the sequence $(S_{n})_{n\geq 1}$ becomes larger than the threshold $(\sqrt{u_{n}})_{n\geq 1}$ , which converges towards zero with a controlled rate. We prove the following result.

Assume that $\sigma_{i}^{2}\sim\sigma_{1}^{2}i^{-2s}$ with $0\leq s<1/2$ and $\gamma_{i}=\gamma_{1}i^{-\beta}$ , $r>\beta/2+s$ and choose $(\alpha_{n})_{n\geq 1}$ as $\forall n\geq 0\,:\alpha_{n+1}=\gamma_{n+1}\sigma_{n+1}$ . Then a small enough $b>0$ exists and a large enough $q$ such that:

Our starting point is the lower bound given by Proposition 4.2 and we observe that a small enough $a>0$ exists such that:

We consider an integer $n\geq q+1$ and apply the Optional Stopping Theorem and verify that:

where the last inequality is a consequence of the definition of the stopping time $T_{b}^{q}$ . Since $(u_{n})_{n\geq 0}$ is a decreasing sequence, we then have that:

and we are led to upper bound the term $\{X_{n\wedge T_{b}^{q}}\}^{2}$ .

The definition of $(X_{n})_{n\geq 1}$ yields:

Using a similar argument as the one used in the proof of Proposition 4.2, a Taylor expansion associated with the smoothness of $\eta$ and $f$ leads to:

This last bound together with inequality (41) implies that a constant $C>0$ exists such that:

We therefore deduce an upper bound on the probability that the stopping time is larger than $n$ :

We then take the limit $n\longrightarrow+\infty$ in the previous inequality and obtain that:

because $\lim_{n\longrightarrow+\infty}u_{n}=0$ . Choosing now $\alpha_{n+1}=\gamma_{n+1}\sigma_{n+1}$ , we then observe that $u_{q}=\sum_{i\geq q}\gamma_{i+1}^{2}\sigma_{i+1}^{2}$ and then the second term on the right hand side goes to zero as soon as:

Since we have chosen $\gamma_{q}=\gamma_{1}q^{-\beta}$ and $\sigma_{q}=\sigma_{1}q^{-s}$ , we verify that a constant $\upsilon$ exists such that:

Therefore, the condition $\gamma_{q}^{2}=o(u_{q})$ boils down to $-2\beta<1-2(\beta+s)\Leftrightarrow s<1/2$ .

Hence, when $s<1/2$ we can conclude the proof of the proposition by setting $b$ small enough ( $b<a/3C$ for example). ∎

The next result states that $S_{n}$ may remain larger than $\frac{1}{2}\sqrt{bu_{q}}$ with a positive probability when $S_{q}\geq\sqrt{bu_{q}}$ . For this purpose, we introduce

that stands for the first time $(S_{n})_{n\geq q}$ comes back below the threshold $\sqrt{bu_{q}}$ .

Assume that $\alpha_{n+1}=\sigma_{n+1}\gamma_{n+1}$ and $\beta+s-r<\frac{1}{2}$ , $\beta\leq 1-2s$ and that $\nu_{n}=o(\sqrt{\gamma_{n}}\sigma_{n})$ . Then there exits $q$ large enough and a constant $c>0$ such that

Since $\delta_{n}\sim p_{n}\sigma_{n}^{2}+\nu_{n}+\gamma_{n}$ , we conclude with our settings that: $\delta_{n}\sim n^{-r}+n^{-\beta}.$ Moreover, the choice $\alpha_{i+1}=\gamma_{i+1}\sigma_{i+1}$ implies that:

We then observe that $\delta_{n}=o(\sqrt{u_{q}})$ as soon as:

since $\beta>\frac{2(\beta+s)-1}{2}$ always holds when $s<\frac{1}{2}$ .

From our previous remark, we observe that for $q$ large enough if $S_{q}$ is greater than $\frac{1}{2}\sqrt{bu_{q}}$ then $S_{q}\geq\delta_{q}$ , so Proposition 4.2 implies that $(S_{n\wedge\mathcal{S}})_{n\geq q}$ is a submartingale. For such a choice of $n$ and $q$ , we can use the Doob decomposition and write that:

where $I_{n}$ is an increasing $\mathcal{F}_{n}$ predictable process with $I_{q}=0$ and $W_{n}$ is a martingale. We then observe that

Furthermore, if $S_{q}\geq\sqrt{bu_{q}}$ , then $W_{q}\geq\sqrt{bu_{q}}$ , which entails:

Using the fact that $(W_{n})_{n\geq q}$ is a martingale and the definition of $S_{n}$ , one can verify that:

The upper bound obtained in Proposition 4.2 is not sharp enough to be directly applied here, so in order to deal with the term on the right hand side we return to the upper bound obtained in the proof of Proposition 4.2 which gives for all $i$ :

The sequence $(\sigma_{i})_{i\geq 0}$ is bounded, $(p_{i})_{i\geq 0}$ converges to and Theorem 1 shows that $(\|\nabla f(\theta_{i})\|)_{i\geq 0}$ converge almost surely to , so there exists $q>0$ such that $\forall i\geq q$ :

Inserting this into our previous bound (43) we get that (for $q$ large enough and $i\geq q$ )

Since $\alpha_{i+1}=\gamma_{i+1}\sigma_{i+1}$ we have that:

The first series of the right-hand side is handled with the simple bound $\sum_{i=q}^{n-1}\gamma_{i+1}^{2}\sigma_{i+1}^{2}\lesssim q^{1-2(\beta+s)}$ . In dealing with the second term of the right-hand side we use the almost sure convergence of the series stated in (18) of Proposition 3.2 and the fact that $(\gamma_{n})_{n\geq 0}$ is a decreasing sequence:

Thus there exists a constant $a>0$ such that for all $n\geq q$ :

Setting $t=a(u_{q}+\gamma_{q})/h$ in the last term we have :

Now when $h=1/2\sqrt{bu_{q}}$ we obtain that :

As soon as $\gamma_{q}\lesssim u_{q}$ , meaning that $-\beta\leq 1-2(\beta+s)$ , there exists a constant $c>0$ such that for $q$ large enough:

Inserting this bound in $\eqref{ineg: Sinf}$ ends the proof.

At this point, Propositions 4.3 and 4.3 enable us to prove that the sequence $(S_{n})$ does not converge to a.s. using the same arguments as [Ben99] and then we can conclude the proof of Theorem 3.

When the variance of the noise sequence does not converge to ( $s=0$ ), the previous conditions on the parameters can be summarized as : $\beta\in(1/2,1)$ ; $r\in(1-\beta,\beta)$ , when $\beta<2/3$ and $r\in(\beta/2,\beta)$ for $\beta\geq 2/3$ .

Step 1: $S_{n}$ does not converge to a.s. Let $\mathcal{G}$ denote the event:

In the meantime, if $\mathcal{S}=+\infty$ then $(S_{n})$ does not converge to , so $\{\mathcal{S}=+\infty\}\subset\mathcal{G}$ . For $q$ large enough, such that Proposition 4.3 holds, and for all $n\geq q$ :

Thus, if we consider all the integers $n$ larger than $q$ , we obtain:

We now apply Proposition 4.3 and obtain that:

Step 2: the algorithm escapes any neighborhood of an unstable point in a finite time a.s.

Theorem 1 ensures that $(\theta_{n},w_{n})$ converges a.s to a point $(\theta_{\infty},0)$ . This together with the regularity of the function $\eta$ implies that the sequence $S_{n}$ goes to $\eta(\theta_{\infty},0)$ when $n\to+\infty$ . Since $\mathcal{N}$ is compact, the limit point $(\theta_{\infty},0)$ belongs to $\mathcal{N}$ and according to Proposition 4.1 c) there exists $k>0$ such that:

As seen in the proof of Theorem 1, the limit point $(\theta_{\infty},0)$ is almost surely an equilibrium point for the dynamical system driven by $H$ , so $H(\theta_{\infty},0)=0$ . As a result we have that