Asymptotic study of stochastic adaptive algorithm in non-convex landscape
Sébastien Gadat, Ioana Gavra
Introduction
whose early success in the sixties has been at least rejuvenated if not resurrected with the development of massive learning problems, in the last fifteen years. We refer among other to [BB08, MB11] or to [Bac14] and the references therein for various applications in machine learning. Although being one of the state-of-the-art method to handle massive datasets, SGD suffers from several issues: difficulty to tune the step-size sequence or dependence on the gradient flow that may be lazy in flat areas, which is especially the case when looking at non-convex neural network problems.
Some popular improvements are commonly patched to the vanilla SGD, and among others we refer to the popular acceleration obtained with the Polyak-Ruppert averaging [PJ92, Rup88, MB11, CCGB17, GP20], variance reduction with mini-batch strategies (see e.g. [LRSB+12, JZ13]).
While these two last improvements do not modify the underlying gradient flow, other strategies rely on a modification of the dynamical system exploiting acceleration brought by momentum with second order terms. The first historical example is the Heavy Ball with Friction optimization based on the seminal contribution [Pol64] and then translated into a stochastic framework (see e.g. [GPS18, SGD20, LR20]). Another second example is the Nesterov Accelerated Gradient Descent (see e.g. [Nes83]) translated and studied in the noisy situation recently in a large number of works (see among other [GL16, JNJ18]).
To the best of our knowledge, there is little convergence mathematical results on adaptive algorithms: [BdSG20] studies the deterministic dynamical system behind adaptive algorithms and obtained long-time behaviour of the trajectories or the value function following ideas of [CEG09a, SBC16]. More recently, [BB20] (that is more closely related to us) obtains the almost sure convergence of their algorithms towards critical points with a parametrization that is different from our but the authors leave as an open problem the important question of the convergence towards a minimizer of .The same week we sent our paper on Arxiv, [BBHS20] also published some results on the trap avoidance of adaptive algorithms but not consider the mini-batch effect that is known to be a crucial ingredient for the efficiency of adaptive methods. We refer to Theorem 2 and 3 below for the conditions we obtained on the mini-batch sequence. Finally, some recent contributions in machine learning [WWB19, ZSJ+19, DBBU20] address some “convergence” questions for adaptive algorithms with constant step-size. They provide a non-asymptotic study with a step-size that is tuned according to the finite horizon of simulation. Even though these results are of major interest from a numerical point of view, they do not really answer the question of convergence from a trajectorial point of view (see Section 2.2 below). The objective of this work may be seen as modest at the moment: we aim to study the asymptotic behaviour of Adagrad and Rmsprop i.e. we aim to show the almost sure convergence towards a local minimizer of the objective function . However limited at first sight, we will see that the convergence of the trajectories outside local traps is already challenging, especially when a mini-batch strategy is used.
Adaptive algorithms and main results
The algorithm we consider in this paper use the vectorial division/multiplication notations introduced in Adagrad (see [DHS11]) and now widely used in machine learning. The vectorial division and multiplication are the coordinate per coordinate operations introduced by:
In the meantime, the notation corresponds to the coordinate per coordinate square:
whereas denotes the coordinate per coordinate square root:
where corresponds to a noisy stochastic evaluation of the gradient of the function , corrupted by an additive noise sequence :
Following the initial vectorial notations, we emphasize that (1) means that for any coordinate , the position and scaling factor are updated according to:
2 Link with other parametrizations
We discuss here on our choice of the Adagrad/Rmsprop parametrization (1) using the one of [BdSG20], and its link with the standard parametrization introduced in [DHS11] or [HSS12] and used in later works [DBBU20, ZSJ+19] for Adam and in [WWB19] for Adagrad. We have chosen to use this formulation, which is inspired from the limiting O.D.E. of the continuous time adaptive gradient system following previous works on accelerated or second order dynamics and among other we refer to Memory gradient diffusion [GP14], Ruppert-Polyak averaging [GP20], Heavy Ball systems [AGR00, CEG09a, CEG09b, GPS18] or more generally Nesterov acceleration [Nes04, SBC16, ACR19] and dissipative systems [Har91, AABR02].
The pioneering works [DHS11] and [HSS12] use the following parametrization:
for and when no heavy ball momentum (see e.g. [Pol64]) is used in the algorithm (which is also the case we are considering in this work). Notice that may depend on the current iteration in a second stage and for the sake of completeness, we consider a general sequence .
We introduce the natural normalizing sequence , defined by and the following recursion:
We are led to introduce and and we observe that:
whereas the second coordinate evolves according to:
2.2 Two-time scale parametrization
We deduce that the popular parametrization introduced in the seminal contributions of Adagrad, Rmsprop or ADAM and the one we used in our paper are equivalent and fall into our framework described in Equation (1) with possibly two time-scales on the system :
The choice of the sequence is key to understand the stochastic algorithm we obtain in (4).
Case constant (Adagrad of [DHS11]) This case is certainly the easiest to understand since the natural rescaling of the sequence is . In this case, we recover a joint evolution:
which entails and . With the choice of [DBBU20], we then obtain a constant step-size stochastic algorithm for the coordinate associated with a uniform Cesaro averaging on the sequence of past squared gradients .
Case constant (Adam of [KB15] with no momentum). Since converges exponentially fast towards , the system is close to:
which entails and .
Case with . This last case where the sequence goes to with corresponds to an intermediary situation between and , this transition being parametrized by . We shall introduce the sequence of products:
It is well known (see e.g. [BCG20] Lemma 5.2) that when ,
where can be made explicit in terms of the Riemann zeta function. We then conclude the following behaviour of (see Appendix B of [GPS18]) that:
which implies that the joint evolution shall be written as:
which entails and .
In all the situations above, we point out that we obtain some standard choices for the sequences and involved in our two-time scale system (5).
2.3 Final remark on step-size sequences
In this work, we have chosen to restrict our study to a single time-scale parametrization with decreasing sequences within the standard setup of stochastic algorithms:
This single time-scale restriction implies that , so that when we choose in (1) and , our algorithm is strictly equivalent to the one initially introduced in (2) with and . In particular, if , it corresponds to while if , it corresponds to .
We leave the more sophisticated general study of the two time-scale algorithm for future investiations and refer to [Bor97], [MP06] or [BCG20, CG20] for other examples of such two time-scale stochastic algorithms in various (but simpler) situations.
3 Assumptions and convenient notations
We introduce the canonical filtration associated to our random sequence and list below the main assumptions used in our work.
We use the symbols to refer to inequalities up to a multiplicative constant that are independent from the dimension : for two positive sequences and , we write
and the constant is independent from the dimension of the ambient space . We will also use the notation when . We also use the symbol that refers to an inequality up to a multiplicative constant that can depend on , for the proof of the local trap avoidance since for this result we are not interested in a quantitative effect of the dimension.
We first describe our main assumption on the sequence .
Assumption . We assume that the sequence used in (1) provides an unbiased estimation of the true gradient of at position , i.e. we assume that:
We furthermore assume that the noise sequence satisfies:
where is a positive constant independent from . Assumption stands for a classical framework in stochastic optimization methods: is an auxiliary sequence that translastes a possible use of mini-batches when as . The moment assumption on is the convenient assumption to handle standard problems like on-line regression, logistic regression or cascade of logistic regressions used in deep learning. We emphasize that we do not make any restrictive and somewhat irrealistic boundedness assumption of the noise or of the sequence itself. Below, we will use this assumption with in Theorem 1. Finally, we should observe that this assumption introduces a possible linear dependency with on the size of the variance of the noise.
To derive the convergence of our algorithms towards a local minima, we will need a more stringent condition on the noise sequence. We then introduce the next assumption that will replace in our second main result of almost sure convergence (see Theorem 3 below).
Assumption .
. The noise sequence is centered and satisfies:
. The noise sequence is elliptic uniformly in :
We stress that the upper bound on the second order moment is not restrictive, up to a modification of the calibration of the sequence . The second assumption will be of course used to exit local traps.
We now introduce some standard assumptions on .
. The function is positive and coercive, i.e. satisfies:
Demanding the lower bound of to be strictly positive is mostly a convenient technical constraint and not fundamentally more restrictive than the classical assumption of positivity.
. We assume that satisfies the so-called Lipschitz continuous gradient property:
We emphasize that this implies the famous descent inequality:
This assumption is commonly used in optimization theory and statistics. Even though it is possible to address some more sophisticated situations (see e.g. [BBT17]), it is generally admitted that most of machine learning optimization problems fall into the Lipschitz continuous gradient framework.
. We also assume that another constant exists such that:
This last assumption prevents from some too large growth of the function and it is immediate to verify that implies that has a subquadratic growth, i.e. . It has been widely used in the literature of stochastic algorithm (see e.g. [GPS18] and the references therein).
We finally introduce our assumptions on the step-size sequences used all along the paper that are involved in and . To easily assess some convergence results with quantitative conditions on our gain sequences, we will consider the situations where:
. The sequences and satisfy:
. The mini-batch sequence satisfies:
. As already discussed in Section 2.2.3, the sequence satisfies:
We point out that and may be (or not) some vanishing sequences (if and or if ).
4 Main results
We now state our three main convergence results for the stochastic algorithm defined in Equation (1).
Assume that , and hold for . Then converges almost surely towards where .
Theorem 1 is a purely asymptotic convergence results. It provides the convergence of our adaptive algorithm (1) towards a set of critical points under mild assumptions on the noise sequence and on the function . We emphasize that this results holds for a standard setup on stochastic algorithms with a decreasing learning rate . We observe that the essential condition involved in this result is the convergence of the series that depend on . In particular, when , we observe that Theorem 1 holds when:
From a theoretical point of view, the less restrictive situation corresponds to the choice since the series converges as soon as decreases like . It implies that either we need to use a very lengthy decrease of the update induced by , or use a very lengthy increase of the minibatch proportional, with a batch of size at step . Of course, this last condition holds as soon as . When is chosen lower than , the condition becomes , which may lead to a larger computational cost.
Using the point of view introduced in [GL13] to assess the computational cost of non-convex stochastic optimization, it is possible to derive a more quantitative result on the sequence . This result is stated in terms of the expected value of the gradient of all along the algorithm. A -approximation computational cost is then the number of samples that are necessary to obtain an average value below .
Assume that and hold for and consider an integer and an integer sampled uniformly over :
If and and , then
and the computational cost to obtain a -approximation is .
If and and , then
and the computational cost to obtain a -approximation is of order .
If and and , then
and the computational cost to obtain a -approximation is of order .
If and and , then
and the computational cost to obtain a -approximation is of order .
We emphasize that this last result is not a real convergence result, which is indeed impossible to derive with a constant step-size stochastic algorithm. Nevertheless, it may be seen as a benchmark result following the usages in non-convex machine learning optimization. It is a convenient way to assess a mean square convergence of stochastic optimization algorithm with non-convex landscape (see e.g. [GL13]).
We recover in this result a more quantitative result that translates both the linear effect of the dimension on the “convergence” rate and the dependency of the final bound in terms of when the algorithm is randomly stopped uniformly between iteration and . The presence of both and of is not surprising as it already appears to be the minimax rate of convergence in stochastic optimization with weakly convex landscapes (see e.g. [NY83]).
If we translate the upper bound of into a complexity bound, we observe that for any , we need to fix such that
When we use a mini-batch strategy with samples at each iteration such that the error bound produced by is lower than , we observe that has to be chosen of the order and the overall procedure may be improved (when compared to the first setting) since we obtain a computational cost. Finally, the otimal tuning of the algorithm seems to be the last ones, where is chosen of the order and , or and , which leads to a computational cost. As discussed in Section 3.2, with this strategy, it seems impossible to improve the computational cost obtained with other choices of the parameters.
A such improvement comes from a careful tuning of a Lyapunov function that is not exactly the same as the one used in these previous works. We refer to Sections 3.1 and 3.2 for further details. Finally, we also point out that when the sequence is kept fixed, as indicated in the paragraph 2.2.2 it corresponds to a choice of kept constant all over the time evolution (see Equation (8)) and . This result with this range of parameters appears to be in line with those of [WWB19, DBBU20], but in our work the assumptions on the noise sequence and on the function are significantly weaker.
In this paragraph, we assess the almost sure convergence of the sequence towards a local minimum of and state that the algorithm cannot converge towards an unstable (hyperbolic) point of the dynamical system, e.g. cannot converge towards a saddle point or a local maximum of .
Assume that , and hold. Assume that is twice differentiable. Suppose , , and let be chosen such that:
Then almost surely the sequence does not converge towards a local maximum of .
Several remarks are necessary after our last theorem, that identifies not only the limit points as the critical points of but as local minimizers. Hence, our contribution should be understood as a new example of stochastic method that avoids local traps, and then compared to [Pem90, BD96, GPS18, BBHS20].
Moreover, we point out that our result holds for every initialization point and do not use any integration over . Hence, the nature of our result is different from the ones obtained in recent contributions in the field of machine learning (see e.g. [LSJR16, LPP+17]): we establish that our stochastic algorithm converges with probability 1 to a minimizer, which is different from proving that a deterministic or randomized algorithm (gradient descent in [LPP+17] for example) randomly initialized converges to a local minimum with probability 1.
When the variance of the noise sequence is kept fixed all along the iterations (no mini-batch is used, so that ), the previous conditions on the parameters can be summarized as: ; , when and when .
Finally, we should emphasize that this last result is rather difficult to obtain when the mini-batch parameter is chosen strictly greater than since it translates a possible vanishing level of noise when the number of iterations is increasing. Our assumption shows that the size of the mini-batch should not grow to fast (induced by the condition ) to obtain the convergence towards a local minimizer of . Up to our knowledge, a such explicit phenomenon is new in the stochastic algorithm community. It would deserve further numerical or theoretical investigations to identify whether this condition is necessary to convert an almost sure convergence result towards critical points into a convergence result towards a stable point of the differential system. The limiting condition seems to be and . As a really ambitious question, we leave this problem for future developments and up to our knowledge, the maximal size of the mini-batch that guarantees convergence to local minimizer is still an unresolved question even for the SGD.
5 Organization of the paper
The rest of the paper consists in showing the proofs of the previous results.
Theorems 1 and 2 are proven in Section 3. In particular, Proposition 3.1 studies the average one-step evolution of the algorithm through several key functions. This proposition permits to derive a Lyapunov function in Section 3.2 that translates both the average quantitative result of Theorem 2 and the asymptotic convergence result of Theorem 1. The main difficulties in these two results is to derive a mean reverting effect in terms of without using some extra boundedness assumption, and to assess the influence of on the quantitative result.
Theorem 3 is a typical result of stochastic algorithms, and is inspired from the seminal contributions of [Pem90] and [BH95]. The proof is detailed in Section 4 and the cornerstone of this proof is the use of the stable/unstable manifold Lemma that provides an ad-hoc Lyapunov function of the dynamical system, denoted by in Proposition 4.1. We also refer to the recent contribution of [BBHS20] for another typical application to stochastic algorithms. The main novelty brought in our proof is to use the mini-batch low noise level and keep the a.s. escape of local maximum true. In particular, from a technical point of view, we take advantage of the boundedness series of Proposition 3.2, which is a key ingredient in the proof of Proposition 4.3.
Almost sure convergence to the set of critical points
The purpose of this section is to prove Theorem 1 and Theorem 2. In particular, we will obtain Theorem 2 during the proof of the almost sure convergence result as a specific point of Proposition 3.2, . The basic ingredient of our proof relies on the Robbins-Siegmund Theorem [RS71] that will be applied with the help of an ad-hoc Lyapunov function on .
Below, we will pay a specific attention to the dimension dependency in the inequalities we will obtain. We first state the following proposition that will create a mean reverting effect from iteration to iteration on the pair .
Assume that , that , that and hold for .
A constant independent from and {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}d} exists such that for any :
then a constant independent from and exists such that:
then a constant independent from and exists such that:
Proof of : We observe that:
where the last line comes from the definition of and the fact that:
Proof of : We develop : we use the descent inequality (12):
where in the last line we used a rough upper bound on and . It is also possible to use (12) and prove the lower bound:
To deduce , we then consider the conditional expectation in (14), use Equation (13) and assumption for to obtain that a constant exists such that:
We emphasize that at this stage, this last inequality does not create any repelling effect on the position and we need to deal with the denominator , which is the purpose of .
Proof of : We consider the evolution of from to . Using (14) and (15) obtained in , we deduce that:
We now expand and , using that , we get:
When taking the conditional expectation in the previous inequality we treat some of these terms separately. First, using the centering of we have that :
where in the second line we used Cauchy-Schwarz inequality and assumption for whereas the last line comes from sub-quadratic growth assumption on the function given by (13). This last bound can also be used to control the term since:
Now, in similar manner, using repeatitly these last two assumptions ( also with ) :
Starting again with the Cauchy-Schwarz inequality for the last term we obtain :
We can now regroup the previous bounds, use equation , the fact that and that to conclude that a constant , independent of and exists such that :
Proof of : We observe that:
where the last line comes from the fact that , which implies that the last term of the right hand side exists.
Using for any , we deduce that:
Now observe that the second term can easily be bounded by :
We use this last inequality and to conclude that:
The baseline remark is that thanks to the Cauchy-Schwarz inequality, we have:
We then compute the conditional expectation of the previous terms with respect to , using the previous inequality, the centering of , and obtain that a constant independent from exists such that:
We then study each terms in the large bracket separately. Using and , we have:
The second term is handled with the help of (for ) and the Cauchy-Schwarz inequality:
③ is measurable and the subquadratic growth assumption given by (13) ensures that :
The term ④ is measurable and we use once more that to obtain that:
For ⑤ we use Assumption with and obtain that:
⑥ is close to ③, we use with and obtain that:
The last line comes from the fact that as soon as is uniformly lower bounded by a positive constant, then .
The term ⑦ is close to ⑤ and yields:
For ⑧ Assumption with implies that:
Our bounds on and the fact that ensure that a constant exists independent from such that:
Since using the definition of , we deduce that:
Proposition 3.1 will permit do derive a Lyapunov function on (see the next result) which implies the convergence of
This kind of bound has also been obtained in [DBBU20] (Theorem 4) with the help of a somewhat artificial boundedness assumption of the noisy gradients, which is not used in our work. We also point out that [ZSJ+19] propose another function that generates a mean reverting term:
and the major difference with our result is the weaker instead of in the series. In particular, a such will not allow to prove the a.s. asymptotic pseudo-trajectory result, and consequently the a.s. convergence of the trajectory towards a critical point of .
2 Proof of Theorem 2
Using Proposition 3.1, we are now ready to state the next important result, which will be key for the almost sure convergence of .
Under the assumptions of Proposition 3.1, then:
Two constants and exist such that, for all
where and are the auxiliary sequences defined in Proposition 3.1.
If , then almost surely:
Proof of . Our proof relies on a Lyapunov function defined by:
Using , and of Proposition 3.1 and the fact that , we deduce that a constant that depends on and of the next choice of and exists such that:
We observe that for any vector , we have and that is a bounded sequence, so that we can find large enough to have , and such that :
We then conclude using a straightforward recursion.
Proof of . This point proceeds with standard arguments: we use the Robbins Siegmund Lemma (see [RS71]): the series and are convergent and obtain that:
More importantly, the next series are convergent:
This ends the proof of ∎
We emphasize that are standard consequences of the Robbins-Siegmund approach on stochastic algorithms. In particular, the use of with the calibrations of and that are given in the statement of Theorem 2 instantaneously lead to the conclusion of the proof. We also point out that the basic fact to obtain this result is a tuning of the parameters thats leads to
Considering constant step-size sequences, it entails the following constraints on :
whereas the size of needed to obtain a approximation should verify that:
Hence, the computational cost is lower bounded by and this bound may be achieved as soon as
which corresponds to the tuning described in and of Theorem 2.
3 Proof of Theorem 1
For the sake of convenience, from now on we denote . Since we are now interested in purely asymptotic result, we omit the dependency with in the bounds we obtain hereafter. The main difficulty here is to convert the result of Proposition 3.2 into an a.s. convergence result on . We remind the standard definition of asymptotic pseudo-trajectories of a semiflow .
A continuous trajectory is a pseudo-trajectory of if for any finite time horizon
We refer to [BH96, Ben99] for further details. We will use in particular the cornerstone result which is reminded below:
Denote , and consider the continuous time process corresponding to a linear interpolation of , given by :
The Robbins-Siegmund Lemma ensures that the sequence converges a.s. to a finite random variable . Since all the terms of are positive, and are a.s. bounded as well. The coercivity of implies thus that is a.s. bounded. The evolution of the sequence can be written as:
The next result allows us to control the rest term and ensures that the assumptions of Proposition 4.1 of [Ben99] are fulfilled, which in turn implies that is indeed an asymptotic pseudo trajectory of the flow induced by .
Under the assumptions of Proposition 3.1 and if and . Let and defined in Equation (22). For all :
We consider a finite horizon . In order to deal with the previous sum, we write as , with , and We use the fact that:
and proceed to upper bound each term on the right hand side.
The convergence of the first term is mainly a consequence of Equations (19), (20) (convergence of the Robbins-Siegmund series), of the convergence of towards and and the fact that is a bounded sequence :
Using the convergence of the series (19) and (20) we conclude that, This last limit holds regardless :
To control the second term we observe that is a sequence of martingale increments with a bounded second order moment since
We start by decomposing as its expected value plus a martingale increment:
Since the first sum convergences to when goes to infinity. The terms of the second sum are martingale increments:
Using the fact that , we get a first bound on their second order moments:
Now using for and Inequality :
The second term can be dealt with in a similar manner, using the assumption for :
and we can apply once more Corollary 11 of [MP] to conclude that:
We now give the proof of the leading result of Section 3.
Identification of the possible limit points. The a.s. boundness of and Lemma 3.3 show that the assumptions of Proposition 4.1 of [Ben99] hold, which implies that is an asymptotic pseudo-trajectory of the differential flow induced by almost surely. It implies in particular that is a.s. bounded.
We shall deduct from Theorem 4 that all limit points of are stationary points for the differential equation and thus that . Since is a.s. bounded, it implies that .
In the meantime, since when , we observe that for any limit point , also implies that .
Convergence of . Thus (since is bounded) we have that converges to . Hence:
Since and are non-negative and positive, the last equality implies that the sequence is a.s. convergent:
Now, is an a.s. bounded sequence and the set of possible limit points for its sub-sequences is connected since:
Since is locally finite, we can conclude that has a unique adherence point and is a convergent sequence:
Almost sure convergence to a local minimum
In this paragraph, we prove that the sequence almost surely converges towards a local minimum of and cannot converge towards an unstable (hyperbolic) point of the dynamical system, e.g. cannot converge towards a saddle point or a local maximum of .
We begin with a simple statement that identifies the unstable points of the dynamical system
These equilibria may correspond to repulsive equilibria or hyperbolic points (saddle points). The next proposition makes this last sentence more precise and introduce the cornerstone function that measures in each neighborhood of an unstable (or saddle) point the distance of any point to the local stable manifold in the expanding direction.
Consider the dynamical system (27), and assume that is twice differentiable, then:
The equilibria are where is a critical point of .
If is a local maximum of , the dynamical system is unstable near .
If is a local minimum of , the dynamical system is stable near .
If is an unstable equilibria, a compact neighborhood of and a function exist such that:
If is the eigenspace associated to the negative eigenvalues of , then:
Proof of . The proof of is immediate by observing that and implies that and .
Proof of and . We use a linearization of the drift around an equilibria . Since is a critical point of , we observe that:
Consequently, we observe that , which entails:
The conclusion follows from the spectral decomposition of .
Proof of . The last point is a consequence of Proposition 3.1 of [BH95] (see also Proposition 9.5 of [Ben99]). We only need to observe that the coordinates in always correspond to the attractive manifold so that . ∎
2 Preliminary estimates
For a given integer when is in a neighborhood (given by (iv) of Proposition 4.1 ) of an unstable point, we introduce the exit time of defined by:
where will be made explicit later on, and the associated cumulative sum:
We prove the following upper bound on the second order moment of .
We decompose according to the position of with respect to :
If , there is nothing to prove.
We then consider the case when and define . A first order Taylor expansion yields:
When , the process so that is bounded. It remains to study the terms that involve and . We shall observe that:
because is bounded by definition and is continuous and is compact.
Thus from the assumption it follows that:
where is a constant that depends on , , , and . This ends the proof. ∎
Below, we will use the consequence of of Proposition 4.1: if is a point in , then a constant exists such that:
We write the joint evolution of the algorithm as:
where is a sequence of martingale increment defined by:
We introduce below the sequence that stands for the convergence rate of towards and towards :
We start once more from the decomposition of before or after and observe that:
The definition of the stopping time ensures that when , belongs to the neighborhood so we can apply Equation and obtain the following lower bound:
where the last inequality comes from (iv)c of Proposition 4.1 (Equation (28)). Using that is upper bounded on by , the definition of leads to:
This inequality associated with the Cauchy-Schwarz inequality implies that:
Observing that , we deduce that:
We then define (because is compact and is continuous). We deduce that:
Concerning the last term, it is sufficient to show that the expected value is bounded. We start by using the fact that to split the squared norm and to obtain that:
Since is continuous, is a bounded sequence. Moreover, we have seen that when :
Again, using the compactness of , the continuity of and the assumption on the noise , we deduct that exists such that:
We now gather all the terms and use the fact that the sequences , and are all bounded, to conclude that a constant exists such that:
We now study the evolution of .
Assume that is such that then:
where in the last line we used the reverting effect translated in Equation (28) and the Cauchy-Schwarz inequality. Since is positive, we then obtain that:
We denote the positive part of any real value by and we use that and the inequality
Considering Equation (38), we then observe that:
Once more the regularity of , the compactness of and the definition of , guarantee that a constant exists such that:
Computing the conditional expectation and using the arguments of (36), we have
Using that when , , we can apply of Proposition 4.1 so that:
where is the orthogonal projection on , the eigenspace associated to the negative eigenvalues of . For large enough, the almost sure convergence of to and our elliptic assumption on the sequence yield:
Decomposing now , we then deduce that:
The last line is justified by the assumption which ensures that . We shall also observe that:
Since , we can now conclude by inserting the previous bounds in the inequality .
3 End of the proof
We mimic the strategy of Lemma 9.6 of [Ben99] and of Theorem 3.2 of [GPS18] with the help of the two sequences defined in (31) and (30).
We summarize the preliminary results proven in the previous subsection as they will we used in what follows:
Proposition 4.2 yields that if , then
Proposition 4.2 yields: if (which also means that ), then
For any integer , we consider an integer and introduce the two sequences and defined by:
With the framework of Theorem 3 in mind we suppose that
with , and . Let us point out other restrictions on the choices of these parameters in light of previous assumptions:
A first restriction follows directly from , namely that
(to ensure that ). Furthermore, Theorem 1 demands that , implying that
To ease notations it is convenient to assume that the sequence converges to its limit at least as fast as goes to : With this in mind, the condition imposed by Proposition 4.2, translates to , and , ie . As soon all these conditions come down to:
Other restrictions will be added in the following propositions and are summarized in Theorem 3.
For any positive real value and any integer , we consider the sequence of stopping times defined by:
The stopping times stands for the first time the sequence becomes larger than the threshold , which converges towards zero with a controlled rate. We prove the following result.
Assume that with and , and choose as . Then a small enough exists and a large enough such that:
Our starting point is the lower bound given by Proposition 4.2 and we observe that a small enough exists such that:
We consider an integer and apply the Optional Stopping Theorem and verify that:
where the last inequality is a consequence of the definition of the stopping time . Since is a decreasing sequence, we then have that:
and we are led to upper bound the term .
The definition of yields:
Using a similar argument as the one used in the proof of Proposition 4.2, a Taylor expansion associated with the smoothness of and leads to:
This last bound together with inequality (41) implies that a constant exists such that:
We therefore deduce an upper bound on the probability that the stopping time is larger than :
We then take the limit in the previous inequality and obtain that:
because . Choosing now , we then observe that and then the second term on the right hand side goes to zero as soon as:
Since we have chosen and , we verify that a constant exists such that:
Therefore, the condition boils down to .
Hence, when we can conclude the proof of the proposition by setting small enough ( for example). ∎
The next result states that may remain larger than with a positive probability when . For this purpose, we introduce
that stands for the first time comes back below the threshold .
Assume that and , and that . Then there exits large enough and a constant such that
Since , we conclude with our settings that: Moreover, the choice implies that:
We then observe that as soon as:
since always holds when .
From our previous remark, we observe that for large enough if is greater than then , so Proposition 4.2 implies that is a submartingale. For such a choice of and , we can use the Doob decomposition and write that:
where is an increasing predictable process with and is a martingale. We then observe that
Furthermore, if , then , which entails:
Using the fact that is a martingale and the definition of , one can verify that:
The upper bound obtained in Proposition 4.2 is not sharp enough to be directly applied here, so in order to deal with the term on the right hand side we return to the upper bound obtained in the proof of Proposition 4.2 which gives for all :
The sequence is bounded, converges to and Theorem 1 shows that converge almost surely to , so there exists such that :
Inserting this into our previous bound (43) we get that (for large enough and )
Since we have that:
The first series of the right-hand side is handled with the simple bound . In dealing with the second term of the right-hand side we use the almost sure convergence of the series stated in (18) of Proposition 3.2 and the fact that is a decreasing sequence:
Thus there exists a constant such that for all :
Setting in the last term we have :
Now when we obtain that :
As soon as , meaning that , there exists a constant such that for large enough:
Inserting this bound in ends the proof.
At this point, Propositions 4.3 and 4.3 enable us to prove that the sequence does not converge to a.s. using the same arguments as [Ben99] and then we can conclude the proof of Theorem 3.
When the variance of the noise sequence does not converge to (), the previous conditions on the parameters can be summarized as : ; , when and for .
Step 1: does not converge to a.s. Let denote the event:
In the meantime, if then does not converge to , so . For large enough, such that Proposition 4.3 holds, and for all :
Thus, if we consider all the integers larger than , we obtain:
We now apply Proposition 4.3 and obtain that:
Step 2: the algorithm escapes any neighborhood of an unstable point in a finite time a.s.
Theorem 1 ensures that converges a.s to a point . This together with the regularity of the function implies that the sequence goes to when . Since is compact, the limit point belongs to and according to Proposition 4.1 c) there exists such that:
As seen in the proof of Theorem 1, the limit point is almost surely an equilibrium point for the dynamical system driven by , so . As a result we have that