To alleviate these limitations, the popular Adam algorithm adjusts the learning rate coordinate-wise, as a function of the past values of the squared gradient vectors’ coordinates. The algorithm thus combines the assets of inertial methods with an adaptive per-coordinate learning rate selection. Finally, the algorithm includes a so-called bias correction step. Acting on the current estimate of the gradient vector, this step is especially useful during the early iterations.
Despite the growing popularity of the algorithm, only few works investigate its behavior from a theoretical point of view (see the discussion in Section 6). The present paper studies the convergence of Adam from a dynamical system viewpoint.
We introduce a continuous-time version of the Adam algorithm under the form of a non-autonomous ordinary differential equation (ODE). Building on the existence of an explicit Lyapunov function for the ODE, we show the existence of a unique global solution to the ODE. This first result turns out to be non-trivial due to the irregularity of the vector field. We then establish the convergence of the continuous-time Adam trajectory to the set of critical points of the objective function F. The proposed continuous-time version of Adam provides useful insights on the effect of the bias correction step. It is shown that, close to the origin, the objective function F is non-increasing along the Adam trajectory, suggesting that early iterations of Adam can only improve the initial guess.
Under a Łojasiewicz-type condition, we prove that the solution to the ODE converges to a single critical point of the objective function F. We provide convergence rates in this case.
In discrete time, we first analyze the Adam iterates in the constant stepsize regime as originally introduced in . In this work, it is shown that the discrete-time Adam iterates shadow the behavior of the non-autonomous ODE in the asymptotic regime where the stepsize parameter γ of Adam is small. More precisely, we consider the interpolated process zγ(t) which consists of a piecewise linear interpolation of the Adam iterates. The random process zγ is indexed by the parameter γ, which is assumed constant during the whole run of the algorithm. In the space of continuous functions on [0,+∞) equipped with the topology of uniform convergence on compact sets, we establish that zγ converges in probability to the solution to the non-autonomous ODE when γ tends to zero.
Under a stability condition, we prove the asymptotic ergodic convergence of the probability of the discrete-time Adam iterates to approach the set of critical points of the objective function in the doubly asymptotic regime where n→∞ then γ→0.
Beyond the original constant stepsize Adam , we propose a decreasing stepsize version of the algorithm. We provide sufficient conditions ensuring the stability and the almost sure convergence of the iterates towards the critical points of the objective function.
We establish a convergence rate of the stochastic iterates of the decreasing stepsize algorithm under the form of a conditional central limit theorem.
We claim that our analysis can be easily extended to other adaptive algorithms such as e.g. RmsProp or AdaGrad and AmsGrad (see Section 6).
The paper is organized as follows. In Section 2, we present the Adam algorithm and the main assumptions. Our main results are stated in Sections 3 to 5. We provide a review of related works in Section 6. The rest of the paper addresses the proofs of our results (Sections 7 to 9).
The Adam Algorithm
It satisfies: zn=Tγ,α,β(n,zn−1,ξn), for every n≥1, where for every z=(x,m,v) in Z+, ξ∈Ξ,
The iterates zn form a non-homogeneous Markov chain, because Tγ,α,β(n,z,ξ) depends on n. This is due to the so-called debiasing step, which consists of replacing mn,vn in Algorithm 1 by their “debiased” versions m^n,v^n. The motivation becomes clear when expanding the expression:
From this equation, it is observed that, m^n forms a convex combination of the past gradients. This is unlike mn, which may be small during the first iterations.
For almost every ξ, the map f(.,ξ) is continuously differentiable.
2 Asymptotic Regime
We address the constant stepsize regime, where γ is fixed along the iterations (the default value recommended in is γ=0.001). As opposed to the decreasing stepsize context, the sequence znγ:=zn cannot in general converge as n tends to infinity, in an almost sure sense. Instead, we investigate the asymptotic behavior of the family of processes (n↦znγ)γ>0 indexed by γ, in the regime where γ→0. We use the so-called ODE method (see e.g. ). The interpolated process zγ is the piecewise linear function defined on [0,+∞)→Z+ for all t∈[nγ,(n+1)γ) by:
We establish the convergence in probability of the family of random processes (zγ)γ>0 as γ tends to zero, towards a deterministic continuous-time system defined by an ODE. The latter ODE, which we provide below at Eq. (ODE), will be referred to as the continuous-time version of Adam.
the initialization being chosen as z0γ=(x0,0,0).
Continuous-Time System
In order to gain insight into the behavior of the sequence (znγ) defined by (5), it is convenient to rewrite the Adam iterations under the following equivalent form, for every n≥1:
where we define for every γ>0, z∈Z+,
where a,b are the constants defined in Assumption 2.2. We prove that, for any fixed (t,z), the quantity h(t,z) coincides with the limit of hγ(⌊t/γ⌋,z) as γ↓0. This remark along with Eq. (6) suggests that, as γ↓0, the interpolated process zγ shadows the non-autonomous differential equation
2 Existence, Uniqueness, Convergence
Since h(.,z) is non-continuous at point zero for a fixed z∈Z+, and since h(t,.) is not locally Lipschitz continuous for a fixed t>0, the existence and uniqueness of the solution to (ODE) do not stem directly from off-the-shelf theorems.
Let x0 be fixed. A continuous map z:[0,+∞)→Z+ is said to be a global solution to (ODE) with initial condition (x0,0,0) if z is continuously differentiable on (0,+∞), if Eq. (ODE) holds for all t>0, and if z(0)=(x0,0,0).
Let Assumptions 2.1 to 2.2 hold true. There exists a unique global solution z:[0,+∞)→Z+ to (ODE) with initial condition (x0,0,0). Moreover, z([0,+∞)) is a bounded subset of Z+.
On the other hand, we note that a solution may not exist for an initial point(x0,m0,v0) with arbitrary (non-zero) values of m0,v0.
Let Assumptions 2.1 to 2.2 hold true. Assume that F(S) has an empty interior. Let z:t↦(x(t),m(t),v(t)) be the global solution to (ODE) with the initial condition (x0,0,0). Then, the set S is non-empty and limt→∞d(x(t),S)=0, limt→∞m(t)=0, limt→∞S(x(t))−v(t)=0.
Then, t↦V(t,z(t)) is decreasing if z(⋅) is the global solution to (ODE).
Cost decrease at the origin. As F itself is not a Lyapunov function for (ODE), there is no guarantee that F(x(t)) is decreasing w.r.t. t. Nevertheless, the statement holds at the origin. Indeed, it can be shown that limt↓0V(t,z(t))=F(x0) (see Prop. 7.7). As a consequence,
In other words, the (continuous-time) Adam procedure can only improve the initial guess x0. This is the consequence of the so-called bias correction steps in Adam (see Algorithm 1) . If these debiasing steps were deleted in the Adam iterations, the early stages of the algorithm could degrade the initial estimate x0.
Derivatives at the origin. The proof of Th. 3.1 reveals that the initial derivative is given by x˙(0)=−∇F(x0)/(ε+S(x0)) (see Lemma 7.1). In the absence of debiasing steps, the initial derivative x˙(0) would be a function of the initial parameters m0, v0, and the user would be required to tune these hyperparameters. No such tuning is required thanks to the debiasing step. When ε is small and when the variance of ∇f(x0,ξ) is small (i.e., S(x0)≃∇F(x0)⊙2), the initial derivative x˙(0) is approximately equal to −∇F(x0)/∣∇F(x0)∣. This suggests that in the early stages of the algorithm, the Adam iterations are comparable to the sign variant of the gradient descent, the properties of which were discussed in previous works, see .
3 Convergence rates
In this paragraph, we establish the convergence to a single critical point of F and quantify the convergence rate, using the following assumption . {assumption}[Łojasiewicz property] For any x∗∈S, there exist c>0,σ>0 and θ∈(0,21] s.t.
Assumption 3.3 holds for real-analytic functions and semialgebraic functions. We refer to for a discussion and a review of applications. We will call any θ satisfying (12) for some c,σ>0, as a Łojasiewicz exponent of f at x∗. The next result establishes the convergence of the function x(t) generated by the ODE to a single critical point of f, and provides the convergence rate as a function of the Łojasiewicz exponent of f at this critical point. The proof is provided in subsection 7.4.
Moreover, if θ∈(0,21] is a Łojasiewicz exponent of f at x∗, there exists a constant C>0 s.t. for all t≥0,
Discrete-Time System: Convergence of Adam
The sequence (ξn:n≥1) is iid, with the same distribution as ξ.
Let p>0. Assume either one of the following conditions.
The value of p will be specified in the sequel, in the statement of the results. Clearly, Assumption 4 ii is stronger than Assumption 4 i. We shall use either the latter or the former in our statements.
Convergence in the long run. When the stepsize γ is constant, the sequence (xnγ) cannot converge in the almost sure sense as n→∞. Convergence may only hold in the doubly asymptotic regime where n→∞ then γ→0.
A Decreasing Stepsize Adam Algorithm
Adam inherently uses constant stepsizes. Consequently, the iterates (5) do not converge in the almost sure sense. In order to achieve convergence, we introduce in this section a decreasing stepsize version of Adam. The iterations are given in Algorithm 2.
where for every n,k, ρn,k=αn⋯αk+1(1−αk). A similar equation holds for v^n.
2 Almost sure convergence
∑nγn=+∞ and ∑nγn2<+∞,
There exist a,b s.t. 0<b<4a, γn−1(1−αn)→a and γn−1(1−βn)→b .
Th. 5.1 establishes the almost sure convergence of xn to the set of critical points of F, under the assumption that the sequence ((xn,mn,vn)) is a.s. bounded. The next result provides a sufficient condition under which almost sure boundedness holds. {assumption} The following holds.
We assume the condition: limsupn→∞(γn1−(1−αn+11−αn+2)γn+11)<2(a−4b), which is satisfied for instance if b<4a and 1−αn+1=aγn.
3 Central Limit Theorem
Let x∗∈S. There exists a neighborhood V of x∗ s.t.
F is twice continuously differentiable on V, and the Hessian ∇2F(x∗) of F at x∗ is positive definite.
S is continuously differentiable on V.
Define D:=diag((ε+S1(x∗))−1,⋯,(ε+Sd(x∗))−1). Let P be an orthogonal matrix s.t. the following spectral decomposition holds:
where λ1,⋯,λd are the (positive) eigenvalues of D1/2∇2F(x∗)D1/2. Define
where Id represents the d×d identity matrix and ∇S(x∗) is the Jacobian matrix of S at x∗. The largest real part of the eigenvalues of H coincides with −L, where
There exist κ∈(0,1], γ0>0, s.t. the sequence (γn) satisfies γn=γ0/(n+1)κ for all n. If κ=1, we assume moreover that γ0>2L1.
The sequences (γn1(γn1−αn−a)) and (γn1(γn1−βn−b)) are bounded.
The variable vn has an impact on the limiting covariance Σ1 through its limit S(x∗) (used to define D), but the fluctuations of vn and the parameter b have no effect on Σ1. As a matter of fact, Σ1 coincides with the limiting covariance matrix that would have been obtained by considering iterates of the form
which can be interpreted as a preconditioned version of the stochastic heavy ball algorithm . Of course, the above iterates are not implementable because the preconditioning matrix D is unknown.
When a is large, Σ1 is close to the matrix Σ1(0) obtained when letting a→+∞ in Eq. (17). The matrix Σ1(0) is the solution to the Lyapunov equation
The matrix Σ1(0) can be interpreted as the asymptotic covariance matrix of the x-variable in the absence of the inertial term (that is, when one considers RmsProp instead of Adam). The matrix Σ1(0) approximates Σ1 in the sense that Σ1=Σ1(0)+a1Δ+O(a21) for some symmetric matrix Δ which can be explicited. The matrix Δ is neither positive nor negative definite in general. This suggests that the question of the potential benefit of the presence of an inertial term is in general problem dependent.
In the statement of Th. 5.3, the conditioning event {zn→z∗} can be replaced by the event {xn→x∗} under the additional assumption that ∑nγn2<+∞.
Related Works
Although the idea of adapting the (per-coordinate) learning rates as a function of past gradient values is not new (see e.g. variable metric methods such as the BFGS algorithms), AdaGrad led the way to a new class of algorithms that are sometimes referred to as adaptive gradient methods. AdaGrad consists of dividing the learning rate by the square root of the sum of previous gradients squared componentwise. The idea was to give larger learning rates to highly informative but infrequent features instead of using a fixed predetermined schedule. However, in practice, the division by the cumulative sum of squared gradients may generate small learning rates, thus freezing the iterates too early. Several works proposed heuristical ways to set the learning rates using a less aggressive policy. The work introduced an unpublished, yet popular, algorithm referred to as RmsProp where the cumulative sum used in AdaGrad is replaced by a moving average of squared gradients. Adam combines the advantages of both AdaGrad, RmsProp and inertial methods.
As opposed to AdaGrad, for which theoretical convergence guarantees exist , Adam is comparatively less studied. The initial paper suggests a O(T1) average regret bound in the convex setting, but exhibits a counterexample in contradiction with this statement. The latter counterexample implies that the average regret bound of Adam does not converge to zero. A first way to overcome the problem is to modify the Adam iterations themselves in order to obtain a vanishing average regret. This led to propose a variant called AmsGrad with the aim to recover, at least in the convex case, the sought guarantees. The work interprets Adam as a variance-adapted sign descent combining an update direction given by the sign and a magnitude controlled by a variance adaptation principle. A “noiseless” version of Adam is considered in . Under quite specific values of the Adam-hyperparameters, it is shown that for every δ>0, there exists some time instant for which the norm of the gradient of the objective at the current iterate is no larger than δ. The recent paper provides a similar result for AmsGrad and AdaGrad, but the generalization to Adam is subject to conditions which are not easily verifiable. The paper provides a convergence result for RmsProp using the objective function F as a Lyapunov function. However, our work suggests that unlike RmsProp, Adam does not admit F as a Lyapunov function. This makes the approach of hardly generalizable to Adam. Moreover, considers biased gradient estimates instead of the debiased estimates used in Adam.
In the present work, we study the behavior of an ODE, interpreted as the limit in probability of the (interpolated) Adam iterates as the stepsize tends to zero. Closely related continuous-time dynamical systems are also studied in . We leverage the idea of approximating a discrete time stochastic system by a deterministic continuous one, often referred to as the ODE method. A recent work fruitfully exploits this method to study a stochastic version of the celebrated heavy ball algorithm. We refer to for the reader interested in the non-differentiable setting with an analysis of the stochastic subgradient algorithm for non-smooth non-convex objective functions.
Concomitant to the present paper, Da Silva and Gazeau (posted only four weeks after the first version of the present work) study the asymptotic behavior of a similar dynamical system as the one introduced here. They establish several results in continuous time, such as avoidance of traps as well as convergence rates in the convex case; such aspects are out of the scope of this paper. However, the question of the convergence of the (discrete-time) iterates is left open. In the current paper, we also exhibit a Lyapunov function which allows, amongst others, to draw useful conclusions on the effect of the debiasing step of Adam . Finally, studies a slightly modified version of Adam allowing to recover an ODE with a locally Lipschitz continuous vector field, whereas the original Adam algorithm leads to an ODE with an irregular vector field. This technical issue is tackled in the present paper.
Proofs of Section 3
When η=0, Eq. (ODEη) boils down to the equation of interest (ODE). The choice η∈(0,+∞) will be revealed useful to prove Th. 3.1. Indeed, for η>0, a solution to Eq. (ODEη) can be shown to exist (on some interval) due to the continuity of the map h(.+η,.). Considering a family of such solutions indexed by η∈(0,1], the idea is to prove the existence of a solution to (ODE) as a cluster point of the latter family when η↓0. Indeed, as the family is shown to be equicontinuous, such a cluster point does exist thanks to the Arzelà-Ascoli theorem. When η=+∞, Eq. (ODEη) rewrites
where h∞(z):=limt→∞h(t,z). It is useful to note that for (x,m,v)∈Z+,
Contrary to Eq. (ODE), Eq. (ODE∞) defines an autonomous ODE. The latter admits a unique global solution for any initial condition in Z+, and defines a dynamical system D. We shall exhibit a strict Lyapunov function for this dynamical system D, and deduce that any solution to (ODE∞) converges to the set of equilibria of D as t→∞. On the otherhand, we will prove that the solution to (ODE) with a proper initial condition is a so-called asymptotic pseudotrajectory (APT) of D. Due to the existence of a strict Lyapunov function, the APT shall inherit the convergence behavior of the autonomous system as t→∞, which will prove Th. 3.2.
It is convenient to extend the map h:(0,+∞)×Z+→Z on (0,+∞)×Z→Z by setting h(t,(x,m,v)):=h(t,(x,m,∣v∣)) for every t>0, (x,m,v)∈Z. Similarly, we extend h∞ as h∞((x,m,v)):=h∞((x,m,∣v∣)). For any T∈(0,+∞] and any η∈[0,+∞], we say that a map z:[0,T)→Z is a solution to (ODEη) on [0,T) with initial condition z0∈Z+, if z is continuous on [0,T), continuously differentiable on (0,T), and if (ODEη) holds for all t∈(0,T). When T=+∞, we say that the solution is global. We denote by ZTη(z0) the subset of C([0,T),Z) formed by the solutions to (ODEη) on [0,T) with initial condition z0. For any K⊂Z+, we define ZTη(K):=⋃z∈KZTη(z).
Let Assumptions 2, 7.1 and 7.1 hold. For every η∈[0,+∞], T∈(0,+∞], z0∈Z+, z∈ZTη(z0), it holds that z((0,T))⊂Z+∗.
Set z(t)=(x(t),m(t),v(t)) for all t. Consider i∈{1,…,d}. Assume by contradiction that there exists t0∈(0,T) s.t. vi(t0)<0. Set τ:=sup{t∈[0,t0]:vi(t)≥0}. Clearly, τ<t0 and vi(τ)=0 by the continuity of vi. Since vi(t)≤0 for all t∈(τ,t0], it holds that v˙i(t)=b(Si(x(t))−vi(t)) is nonnegative for all t∈(τ,t0]. This contradicts the fact that vi(τ)>vi(t0). Thus, vi(t)≥0 for all t∈[0,T). Now assume by contradiction that there exists t∈(0,T) s.t. vi(t)=0. Then, v˙i(t)=bSi(x(t))>0. Thus, limδ↓0−δvi(t−δ)=bSi(x(t)). In particular, there exists δ>0 s.t. vi(t−δ)≤−2δbSi(x(t)). This contradicts the first point.
Recall the definitions of V and U from Eqs. (9) and (10). Clearly, U∞(v):=limt→∞U(t,v)=a(ε+v) is well defined for every v∈[0,+∞)d. Hence, we can also define V∞(z):=limt→∞V(t,z) for every z∈Z+.
Let Assumptions 7.1 and 7.1 hold. Assume that 0<b≤4a. Consider (t,z)∈(0,+∞)×Z+∗ and set z=(x,m,v). Then, V and V∞ are differentiable at points (t,z) and z respectively. Moreover, ⟨∇V∞(z),h∞(z)⟩≤−εU∞(v)am2 and
We only prove the second point, the proof of the first point follows the same line. Consider (t,z)∈(0,+∞)×Z+∗. We decompose ⟨∇V(t,z),(1,h(t,z))⟩=∂tV(t,z)+⟨∇zV(t,z),h(t,z)⟩. After tedious but straightforward derivations, we get:
where U(t,vi)=a(1−e−at)(ε+1−e−btvi) and ⟨∇zV(t,z),h(t,z)⟩ is equal to:
2 Proof of Th. 3.1
Let Assumptions 2, 2, 7.1 and 7.1 hold. Assume that 0<b≤4a. For every z0∈Z0, there exists a compact set K⊂Z+ s.t. for all η∈[0,+∞), all T∈(0,+∞] and all z∈ZTη(z0), {eˉ(t+η,z(t)):t∈(0,T)}⊂K. Moreover, choosing z0 of the form z0=(x0,0,0) and z(t)=(x(t),m(t),v(t)), it holds that F(x(t))≤F(x0) for all t∈[0,T).
Let η∈[0,+∞). Consider a solution zη(t)=(xη(t),mη(t),vη(t)) as in the statement, defined on some interval [0,T). Define m^η(t)=mη(t)/(1−e−a(t+η)), v^η(t)=vη(t)/(1−e−b(t+η)). By Lemma 7.3, t↦V(t+η,z(t)) is continuous on [0,T), and continuously differentiable on (0,T). By Lemma 7.5, V˙(t+η,zη(t))≤0 for all t>0. As a consequence, t↦V(t+η,zη(t)) is non-increasing on [0,T). Thus, for all t≥0, F(xη(t))≤limt′↓0V(t′+η,zη(t′)). Note that V(t+η,zη(t))≤F(xη(t))+21∑i=1da(1−e−a(t+η))εmη,i(t)2. If η>0, every term in the sum in the righthand side tends to zero, upon noting that mη(t)→0 as t→0. The statement still holds if η=0. Indeed, by Lemma 7.1, for a given i∈{1,…,d}, there exists δ>0 s.t. for all 0<t<δ, mη,i(t)2≤2a2(∂iF(x0))2t2 and 1−e−at≥(at)/2. As a consequence, each term of the sum is no larger than 4(∂iF(x0))2t/ε, which tends to zero as t→0. We conclude that for all t≥0, F(xη(t))≤F(x0). In particular, {xη(t):t∈[0,T)}⊂{F≤F(x0)}, the latter set being bounded by Assumption 2.
As V(t+η,zη(t))≤F(x0), we obtain: F(x0)≥F(xη(t))+21∥mη(t)∥U(t+η,vη(t))−12. Thus, F(x0)≥infF+2a(ε+M)1∥mη(t)∥2. Therefore, mη(.) is bounded on [0,T), uniformly in η. The same holds for m^η by using the mean value theorem in the same way as for v^η. The proof is complete.
Let Assumptions 2, 2, 7.1 and 7.1 hold. Assume that 0<b≤4a. Let K be a compact subset of Z+. Then, there exists an other compact set K′⊂Z+ s.t. for every T∈(0,+∞] and every z∈ZT∞(K), z([0,T))⊂K′.
The proof follows the same line as Prop. 7.7 and is omitted.
For any K⊂Z+, define vmin(K):=inf{vk:(x,m,v)∈K,i∈{1,…,d}}.
Under Assumptions 2, 2, 7.1 and 7.1, the following holds true.
For every compact set K⊂Z+, there exists c>0, s.t. for every z∈Z∞∞(K), of the form z(t)=(x(t),m(t),v(t)), vi(t)≥cmin(1,2cvmin(K)+t)(∀t≥0,∀i∈{1,…,d}).
For every z0∈Z0, there exists c>0 s.t. for every η∈[0,+∞) and every z∈Z∞η(z0), vi(t)≥cmin(1,t)(∀t≥0,∀i∈{1,…,d}).
Set κ1:=0.5(vmin+bSminτ). Note that vi(τ)≥κ1. Define Smin′:=inf{Si(x):i∈{1,…,d},(x,m,v)∈K′}. Note that Smin′>0 by Assumptions 7.1 and 2. Finally, define κ=0.5min(κ1,Smin′). By contradiction, assume that the set {t≥τ:vi(t)<κ} is non-empty, and denote by τ′ its infimum. It is clear that τ′>τ and vi(τ′)=κ. Thus, b−1v˙i(τ′)=Si(x(τ′))−κ. We obtain that b−1v˙i(τ′)≥0.5Smin′>0. As a consequence, there exists t∈(τ,τ′) s.t. vi(t)<vi(τ′). This contradicts the definition of τ′. We have shown that for all t≥τ, vi(t)≥κ. Putting this together with Eq. (21) and using that κ≤vmin+bSminτ, we conclude that: ∀t≥0,vi(t)≥min(κ,2vmin+2bSmint). Setting c:=min(κ,bSmin/2), the result follows.
We prove the second point. By Prop. 7.7, there exists a compact set K⊂Z+ s.t. for every η≥0, every z∈Z∞η(x0) of the form z(t)=(x(t),m(t),v(t)) satisfies {(x(t),m^(t),v^(t)):t≥0}⊂K, where m^(t)=m(t)/(1−e−a(t+h)) and v^(t)=v(t)/(1−e−b(t+h)). Denote by LS the Lipschitz constant of S on the set {x:(x,m,v)∈K}. Introduce the constants M1:=sup{∥m/(ε+v)∥∞:(x,m,v)∈K}, M2:=sup{∥S(x)∥∞:(x,m,v)∈K′}. These constants being introduced, the rest of the proof follows the same line as the proof of the first point.
2.2 Existence
Let Assumptions 2, 2, 7.1 and 7.1 hold. Assume that 0<b≤4a. For every z0∈Z+,Z∞∞(z0)=∅. For every (z0,η)∈Z0×(0,+∞),Z∞η(z0)=∅.
We prove the first point (the proof of the second point follows the same line). Under Assumptions 7.1 and 7.1, h∞ is continuous. Therefore, Cauchy-Peano’s theorem guarantees the existence of a solution to the (ODE) issued from z0, which we can extend over a maximal interval of existence [0,Tmax).We conclude that the solution is global (Tmax=+∞) using the boundedness of the solution given by Prop. 7.9.
Let Assumptions 2, 2, 7.1 and 7.1 hold. Assume that 0<b≤4a. Consider z0∈Z0. Denote by (zη:η∈(0,+∞)) a family of functions on [0,+∞)→Z+ s.t. for every η>0, zη∈Z∞η(z0). Then, (zη)η>0 is equicontinuous.
Let Assumptions 2, 2, 7.1 and 7.1 hold. Assume that 0<b≤4a. For every z0∈Z0, Z∞0(z0)=∅ i.e., (ODE) admits a global solution issued from z0.
and the righthand side converges to zero. As a consequence, for all t>s, z(t)=z(s)+∫sth(u,z(u))du. Moreover, z(0)=z0. This proves that z∈Z∞0(z0).
2.3 Uniqueness
Let Assumptions 2, 2, 7.1 and 7.1 hold. Assume b≤4a. For every z0∈Z0, Z∞0(z0) is a singleton i.e., there exists a unique global solution to (ODE) with initial condition z0.
Consider solutions z and z′ in Z∞0(z0). We denote by (x(t),m(t),v(t)) the blocks of z(t), and we define (x′(t),m′(t),v′(t)) similarly. For all t>0, we define m^(t):=m(t)/(1−e−at), v^(t):=v(t)/(1−e−bt), and we define m^′(t) and v^′(t) similarly. By Prop. 7.7, there exists a compact set K⊂Z+ s.t. (x(t),m^(t),v^(t)) and (x′(t),m^′(t),v^′(t)) are both in K for all t>0. We denote by LS and L∇F the Lipschitz constants of S and ∇F on the compact set {x:(x,m,v)∈K}. These constants are finite by Assumptions 7.1 and 7.1. We define M:=sup{∥m∥∞:(x,m,v)∈K}. Define ux(t):=∥x(t)−x′(t)∥2, um(t):=∥m^(t)−m^′(t)∥2 and uv(t):=∥v^(t)−v^′(t)∥2. Let δ>0. Define: u(δ)(t):=ux(t)+δum(t)+δuv(t). By the chain rule and the Cauchy-Schwarz inequality, u˙x(t)≤2∥x(t)−x′(t)∥∥ε+v^(t)m^(t)−ε+v^′(t)m^′(t)∥. Thus, using Lemma 7.11, there exists c>0 s.t.
For any δ>0, 2∥x(t)−x′(t)∥∥m^(t)−m^′(t)∥≤δ−1/2(ux(t)+δum(t))≤δ−1/2u(δ)(t). Similarly, 2∥x(t)−x′(t)∥∥v^(t)−v^′(t)∥≤δ−1/2u(δ)(t). Thus, for any δ>0,
We now study um(t). For all t>0, we obtain after some algebra: dtdm^(t)=a(∇F(x(t))−m^(t))/(1−e−at). Therefore, u˙m(t)≤1−e−at2aL∇F∥m^(t)−m^′(t)∥∥x(t)−x′(t)∥. For any θ>0, it holds that 2∥m^(t)−m^′(t)∥∥x(t)−x′(t)∥≤θux(t)+θ−1um(t). In particular, letting θ:=2L∇F, we obtain that for all δ>0,
where the last inequality is due to the fact that y/(1−e−y)≤1+y for all y>0. Using the exact same arguments, we also obtain that
We now choose any δ s.t. 4δ≤1/max(LS2,L∇F2). Then, Eq. (23) and (24) respectively imply that δu˙m(t)≤0.5(a+t−1)u(δ)(t) and δu˙v(t)≤0.5(b+t−1)u(δ)(t). Summing these inequalities along with Eq. (22), we obtain that for every t>0, u˙(δ)(t)≤ψ(t)u(δ)(t), where: ψ(t):=2a+b+εδ1+2ε2δcmin(1,t)M+t1. From Grönwall’s inequality, it holds that for every t>s>0, u(δ)(t)≤u(δ)(s)exp(∫stψ(s′)ds′). We first consider the case where t≤1. We set c1:=(a+b)/2+(εδ)−1 and c2:=M/(ε2δc). With these notations, ∫stψ(s′)ds′≤c1t+c2t+lnst. Therefore, u(δ)(t)≤su(δ)(s)exp(c1t+c2t+lnt). By Lemma 7.1, recall that x˙(0) and x˙′(0) are both well defined (and coincide). Thus,
By Lemma 7.1, z and z′ are differentiable at point zero. Then, the above inequality gives limsups↓0s∣m^i(s)−m^i′(s)∣≤4(L∇F∨1)∥z˙(0)∥ and limsups↓0s2um(s)≤16d(L∇F2∨1)∥z˙(0)∥2.Therefore, um(s)/s converges to zero as s↓0. By similar arguments, it can be shown that limsups↓0uv(s)/s2≤16d(LS2∨1)∥z˙(0)∥2, thus limuv(s)/s=0. Finally, we obtain that u(δ)(s)/s converges to zero as s↓0. Letting s tend to zero, we obtain that for every t≤1, u(δ)(t)=0. Setting s=1 and t>1, and noting that ψ is integrable on [1,t], it follows that u(δ)(t)=0 for all t>1. This proves that z=z′.
We recall that a semiflow Φ on the space (E,d) is a continuous map Φ from [0,+∞)×E to E defined by (t,x)↦Φ(t,x)=Φt(x) such that Φ0 is the identity and Φt+s=Φt∘Φs for all (t,s)∈[0,+∞)2.
Let Assumptions 2, 2, 7.1 and 7.1 hold. Assume that 0<b≤4a. The map Z∞∞ is single-valued on Z+→C([0,+∞),Z+) i.e., there exists a unique global solution to (ODE∞) starting from any given point in Z+. Moreover, the following map is a semiflow:
The result is a direct consequence of Lemma 7.19.
3 Proof of Th. 3.2
Consider a semiflow Ψ on (E,d) and a map z:[0,+∞)→E. Assume the following:
Ψ admits a strict Lyapunov function V.
The set ΛΨ of equilibrium points of Ψ is compact.
V(ΛΨ) has an empty interior.
Then, ⋂t≥0z([t,∞)) is a compact connected subset of ΛΨ .
For every δ>0 and every z=(x,m,v)∈Z+, define:
where we recall that V∞(z):=limt→∞V(t,z) for every z∈Z+ and V is defined by Eq.(9). Consider the set E:=h∞−1({0}) of all equilibrium points of (ODE∞), namely: E={(x,m,v)∈Z+:∇F(x)=0,m=0,v=S(x)}. The set E is non-empty by Assumption 2.
Let Assumptions 2, 2, 7.1 and 7.1 hold. Assume that 0<b≤4a. Let K⊂Z+ be a compact set. Define K′:={Φ(t,z):t≥0,z∈K}. Let Φ:[0,+∞)×K′→K′ be the restriction of the semiflow Φ to K′ i.e., Φ(t,z)=Φ(t,z) for all t≥0,z∈K′. Then,
Φ is well defined and is a semiflow on K′.
The set of equilibrium points of Φ is equal to E∩K′.
There exists δ>0 s.t. Wδ is a strict Lyapunov function for the semiflow Φ.
Denote by LS the Lipschitz constant of S on {x:(x,m,v)∈K′}. For every t>0,
Using that 2∥m(t)∥∥S(x(t))−v(t)∥≤bεLS∥m(t)∥2+LSbε∥S(x(t))−v(t)∥2, we obtain LH(t)≤−b∥S(x(t))−v(t)∥2+bε2LS2∥m(t)∥2. Hence, for every t>0,
where M(δ):=ε(ε+C1)−2−bε2δLS2−δ(2a+εL∇F). Choosing δ s.t. M(δ)>0,
where c:=min{M(δ),2aδ,δb}. It can easily be seen that for every z∈K′, t↦Wδ(Φt(z)) is Lipschitz continuous, hence absolutely continuous. Its derivative almost everywhere coincides with LWδ, which is non-positive. Thus, Wδ is a Lyapunov function for Φ. We prove that the Lyapunov function is strict. Consider z∈K′ s.t. Wδ(Φt(z))=Wδ(z) for all t>0. The derivative almost everywhere of t↦Wδ(Φt(z)) is identically zero, and by Eq. (27), this implies that −c(∥mt∥2+∥∇F(xt)∥2+∥S(xt)−vt∥2) is equal to zero for every t a.e. (hence, for every t, by continuity of Φ). In particular for t=0, m=∇F(x)=0 and S(x)−v=0. Hence, z∈h∞−1({0}).
Let Assumptions 2, 2, 7.1 and 7.1 hold. Assume that 0<b≤4a. For every z∈Z+, limt→∞d(Φ(z,t),E)=0.
Use Prop. 7.24 with K:={z}. and [15, Th. 2.1.7].
3.2 Asymptotic Behavior of the Solution to (ODE)
Let Assumptions 2, 2, 7.1 and 7.1 hold true. Assume that 0<b≤4a. Then, for every z0∈Z0, Z∞0(z0) is an asymptotic pseudotrajectory of the semiflow Φ given by (25).
Consider z0∈Z0, T∈(0,+∞) and define z:=Z∞0(z0). Consider t≥1. For every s≥0, define Δt(s):=∥z(t+s)−Φ(z(t))(s)∥. The aim is to prove that sups∈[0,T]Δt(s) tends to zero as t→∞. Putting together Prop. 7.7 and Lemma 7.11, the set K:={z(t):t≥1} is a compact subset of Z+∗. Define C(t):=sups≥0supz′∈K∥h(t+s,z′)−h∞(z′)∥. It can be shown that limt→∞C(t)=0. We obtain that for every s∈[0,T], Δt(s)≤TC(t)+∫0s∥h∞(z(t+s′))−h∞(Φ(z(t))(s′))∥ds′. By Lemma 7.11, K′:=⋃z′∈Φ(K)z′([0,∞)) is a compact subset of Z+∗. It is immediately seen from the definition that h∞ is Lipschitz continuous on every compact subset of Z+∗, hence on K∪K′. Therefore, there exists a constant L, independent from t,s, s.t. Δt(s)≤TC(t)+∫0sLΔt(s′)ds′(∀t≥1,∀s∈[0,T]). Using Grönwall’s lemma, it holds that for all s∈[0,T], Δt(s)≤TC(t)eLs. As a consequence, sups∈[0,T]Δt(s)≤TC(t)eLT and the righthand side converges to zero as t→∞.
End of the Proof of Th. 3.2
By Prop. 7.7, the set K:=Z∞0(z0)([0,∞)) is a compact subset of Z+. Define K′:={Φ(t,z):t≥0,z∈K}, and let Φ:[0,+∞)×K′→K′ be the restriction Φ to K′. By Prop. 7.24, there exists δ>0 s.t. Wδ is a strict Lyapunov function for the semiflow Φ. Moreover, the set of equilibrium points coincides with E∩K′. In particular, the equilibrium points of Φ form a compact set. By Prop. 7.28, Z∞0(z0) is an APT of Φ. Note that every z∈E can be written under the form z=(x,0,S(x)) for some x∈S. From the definition of Wδ in (26), Wδ(z)=Wδ(x,0,S(x))=V∞(x,0,S(x))=F(x). Since F(S) is assumed to have an empty interior, the same holds for Wδ(E∩K′). By Prop. 7.23, ⋂t≥0Z∞0(z0)([t,∞))⊂E∩K′. The set in the righthand side coincides with the set of limits of convergent sequences of the form Z∞0(z0)(tn) for tn→∞. As Z∞0(z0)([0,∞)) is a bounded set, d(Z∞0(z0)(t),E) tends to zero.
4 Proof of Th. 3.3
The proof follows the path of [16, Th. 10.1.6, Th. 10.2.3], but requires specific adaptations to deal with the dynamical system at hand. Define for all δ>0, t>0, and z=(x,m,v),
Upper-bound on wδ(t). From Eq. (9), we obtain that for every t≥1, V(t,z(t))≤∣F(x(t))∣+2aε(1−e−a)∥m(t)∥2.Using ⟨∇F(x),m⟩≤(∥∇F(x)∥2+∥m∥2)/2, we obtain that there exists a constant c1 (depending on δ) s.t. for every t≥1,
Upper-bound on dtdwδ(t). The function wδ is absolutely continuous on [1,+∞). Moreover, there exist δ>0, c2>0 (both depending on x0) s.t. for every t≥1 a.e.,
The proof of Eq. (30) uses arguments that are similar to the ones used in the proof of Prop. 7.24 (just use Lemma 7.5 to bound the derivative of the first term in Eq. (28)). For this reason, it is omitted.
Putting together (29) and (30) using the Łojasiewicz condition. By Prop. 7.23 and 7.28, the set L:=⋃s≥0{z(t):t≥s} is a compact connected subset of E={(x,0,S(x)):∇F(x)=0}. The set U:={x:(x,0,S(x))∈L} is a compact and connected subset of S. Using Assumption 3.3 and [16, Lemma 2.1.6], there exist constants σ,c>0 and θ∈(0,21], s.t. ∥∇F(x)∥≥c∣F(x)∣1−θ for all x s.t. d(x,U)<σ. As d(x(t),U)→0, there exists T≥1 s.t. for all t≥T, ∥∇F(x(t))∥≥c∣F(x(t))∣1−θ. Thus, we may replace the term ∥∇F(x(t))∥2 in the righthand side of Eq. (30) using ∥∇F(x(t))∥2≥21∥∇F(x(t))∥2+21∣F(x(t))∣2(1−θ). Upon noting that 2(1−θ)≥1, we thus obtain that there exists a constant c3 and some T′≥1 s.t. for t≥T′ a.e.,
Putting this inequality together with Eq. (29), we obtain that for some constant c4>0 and for all t≥T′ a.e., dtdwδ(t)≤−c4wδ(t)2(1−θ).
End of the proof. Following the arguments of [16, Th. 10.1.6], by integrating the preceding inequality, over [T′,t], we obtain wδ(t)≤c5t−1−2θ1 for t≥T′ in the case where θ<21, whereas wδ(t) decays exponentially if θ=21. From now on, we focus on the case θ<21. By definition of (ODE), ∥x˙(t)∥2≤∥m(t)∥2/((1−e−aT′)2ε2) for all t≥T′. Since Eq. (30) implies ∥m(t)∥2≤−w˙δ(t)/c2, we deduce that there exists c,c′>0 s.t. for all t≥T′, ∫t2t∥x˙(s)∥2ds≤cwδ(t)≤c′t−1−2θ1.Applying [16, Lemma 2.1.5], it follows that ∫t∞∥x˙(s)∥2ds≤ct−1−2θθ for some other constant c. Therefore x∗:=limt→+∞x(t) exists by Cauchy’s criterion and for all t≥T′, ∥x(t)−x∗∥≤ct−1−2θθ . Finally, since x(t)→a, we remark that, using the same arguments, the global Łojasiewicz exponent θ can be replaced by any Łojasiewicz exponent of f at x∗. When θ=21, the proof follows the same line.
Proofs of Section 4
Let Assumptions 2.1, 2.2 and 4 hold true. There exists γˉ0>0 s.t. for every R>0, there exists s>0,
Let R>0. We denote by (Hγ,x,Hγ,m,Hγ,v) the block components of Hγ. There exists a constant Cs depending only on s s.t. ∥Hγ∥1+s≤Cs(∥Hγ,x∥1+s+∥Hγ,m∥1+s+∥Hγ,v∥1+s). Hence, it is sufficient to prove that Eq. (31) holds respectively when replacing Hγ with each of Hγ,x,Hγ,m,Hγ,v. Consider z=(x,m,v) in Z+. We write: ∥Hγ,x(n+1,z,ξ)∥≤ε−1(∥1−αˉ(γ)nm∥+∥∇f(x,ξ)∥). Thus, for every z s.t. ∥eγ(n,z)∥≤R, there exists a constant C depending only on ε, R and s s.t. ∥Hγ,x(n+1,z,ξ)∥1+s≤C(1+∥∇f(x,ξ)∥1+s). By Assumption 4, (31) holds for Hγ,x instead of Hγ. Similar arguments hold for Hγ,m and Hγ,v upon noting that the functions γ↦(1−αˉ(γ))/γ and γ↦(1−βˉ(γ))/γ are bounded under Assumption 2.2.
Choose γˉ0>0 and s>0 as in Lemma 8.1. For every γ≤γˉ0,
By Lemma 8.1, the righthand side is finite and does not depend on (n,γ).
For every γ,R>0, we define zγ,R:=Xγ(zγ,R)=Xγ∘BR(zγ). Namely, zγ,R is the interpolated process associated with the sequence (znγ,R). It is a random variable on C([0,+∞),Z). We recall that Fn is the σ-algebra generated by the r.v. (ξk:1≤k≤n). For every γ,n,R, we use the notation: Δ0γ,R:=0 and
It is an immediate consequence of Lemma 8.3 and [6, Lemma 6.2]
The proof of the following lemma is omitted.
By Eq. (34), the RHS of the above inequality tends to zero as γ→0. The proof is complete.
2 Proof of Th. 13
In the sequel, we will use the notation Pˉγ,x as a shorthand notation for Pˉγ,δx where δx is the Dirac measure at some point x∈X. Finally, let Ψ be a semiflow on X. A Markov kernel P is Feller if Pf is continuous for every bounded continuous f. {assumption} Let ν∈P(X).
For every γ, Pˉγ is Feller.
For every δ>0, for every compact set K⊂X, for every t>0, limγ→0supx∈KPˉγ,x(∥X⌊t/γ⌋−Ψt(x)∥>δ)=0.
Let BCΨ be the Birkhof center of Ψ i.e., the closure of the set of recurrent points.
We omit the proof of this result which follows a similar reasoning to [6, Th. 5.5 and Proof in section 8.4] and makes use of results from .
Proofs of Section 5
The following lemma will be useful in the proofs.
Let the sequence (rn) be defined as in Algorithm 2. Assume that 0≤αn≤1 for all n and that (1−αn)/γn→a>0 as n→+∞. Then,
The sequence (rn) is nondecreasing and converges to 1.
Under Assumption 5.3 1, for every ϵ>0, for sufficiently large n, we have rn−1≤e−2(1−κ)aγ0n1−κ if κ∈(0,1) and rn−1≤n−aγ0/(1+ϵ) if κ=1.
A similar lemma holds for the sequence (rˉn).
We define zˉn=(xn−1,mn,vn) (note the shift in the index of the variable x). We have
where h∞ is defined in Eq. (18) and where we set
and ςn+1=(ςn+1x,ςn+1m,ςn+1v) with the components defined by: ςn+1x=ε+vnmn−γn+1γnε+v^nm^n, ςn+1m=(γn+11−αn+1−a)(∇F(xn)−mn)+a(∇F(xn)−∇F(xn−1)) and ςn+1v=(γn+11−βn+1−b)(S(xn)−vn)+b(S(xn)−S(xn−1)) . We prove that ςn→0 a.s. Using the triangular inequality,
(where τn=∑k=0nγk), is almost surely a bounded APT of the semiflow Φˉ defined by (ODE∞). The proof is concluded by applying Prop. 7.23 and Prop. 7.24.
2 Proof of Prop. 5.2
As infF>−∞, one can assume without loss of generality that F≥0. In the sequel, C denotes some positive constant which may change from line to line. We define an:=(1−αn+1)/γn and Pn:=2anrn1⟨mn⊙2,ε+v^n1⟩. We have an→a and rn→1. By Assumption 5.1-1,
We set un:=1−anan+1 and Dn:=ε+v^nrn−1, so that Pn=2an1⟨Dn,mn⊙2⟩. We can write:
We estimate the vector Dn+1−Dn. Using that (rn−1) is non-increasing,
Remarking that vn+1≥βn+1vn, recalling that (rˉn) is nondecreasing and using the update rules of vn and rˉn, we obtain after some algebra
It is easy to see that cn/γn→b/2. Thus, for any δ>0, cn+1≤(b+2δ)γn/2 for all n large enough. Using also that v^n+1/(ε+v^n+1)≤1, we obtain that Dn+1−Dn≤2b+2δγnDn. Substituting this inequality in Eq. (36), we get
As an→a, we have an−4b+2δ≥a−4b+δ for all n large enough. Hence,
Using the Cauchy-Schwartz inequality and Assumption 5.1 2, it is easy to show the inequality ⟨∇F(xn),ε+v^nm^n⟩≤C(1+F(xn)+Pn). Moreover, using the componentwise inequality (∇fn+1−mn)⊙2≤2∇fn+1⊙2+2mn⊙2 along with Assumption 5.1 2, we obtain
Putting all pieces together with Eq. (35),
Define Vn:=(1−Cγn−12)F(xn−1)+(1−un−1)Pn where the constant C is fixed so that Eq. (38) holds. Then,
By Assumption 5.1, limsupnun−1/γn<2a−b/2 and for δ small enough, we obtain
3 Proof of Th. 5.3
We use [20, Th. 1]. All the assumptions in the latter can be verified in our case, at the exception of a positive definiteness condition on the limiting covariance matrix, which corresponds, in our case, to the matrix Q given by Eq. (16). As Q is not positive definite, it is strictly speaking not possible to just cite and apply [20, Th. 1]. Nevertheless, a detailed inspection of the proofs of shows that only a minor adaptation is needed in order to cover the present case. Therefore, proving the convergence result of from scratch is worthless. It is sufficient to verify the assumptions of [20, Th. 1] (except the definiteness of Q) and then to point out the specific part of the proof of which requires some adaptation.
Let zn=(xn,mn,vn) be the output of Algorithm 2. Define z∗=(x∗,0,S(x∗)). Define ηn+1:=(0,a(∇fn+1−∇F(xn)),b(∇fn+1⊙2−S(xn))). We have
where ϵn+1:=(ϵn+11,ϵn+12,ϵn+13), whose components are given by
Here, ηn+1 is a martingale increment noise and ϵn+1=(ϵn+11,ϵn+12,ϵn+13) is a remainder term. The aim is to check the assumptions (A1.1) to (A1.3) of , where the role of the quantities (h, εn, rn, σn, α, ρ, β) in is respectively played by the quantities (h∞, ηn, ϵn, γn, κ, 1, 1) of the present paper.
It is sufficient to verify the latter for ϵni (i=1,2,3) in place of ϵn. The map (m,v)↦m/(ε+v) is Lipschitz continuous in a neighborhood of (0,S(x∗)) by Assumption 2. Thus, for M small enough, there exists a constant C s.t. if ∥zn−z∗∥≤M, then ∥ϵn+11∥≤Crn+1−1mn+1−mn+Crˉn+1−1vn+1−vn. Using the triangular inequality and the fact that rn+1,rˉn+1 are bounded sequences away from zero, there exists an other constant C s.t.
Using Lemma 9.1 under Assumption 5.3 (note that γ0>1/2L≥1/a when κ=1), we obtain that the sequence ∣rn−1∣/γn is bounded, thus ∣rn+1−1∣≤Cγn+1. The sequence (1−αn)/γn being also bounded, it holds that