Convergence and Dynamical Behavior of the ADAM Algorithm for Non-Convex Stochastic Optimization

Anas Barakat, Pascal Bianchi

Introduction

To alleviate these limitations, the popular Adam algorithm adjusts the learning rate coordinate-wise, as a function of the past values of the squared gradient vectors’ coordinates. The algorithm thus combines the assets of inertial methods with an adaptive per-coordinate learning rate selection. Finally, the algorithm includes a so-called bias correction step. Acting on the current estimate of the gradient vector, this step is especially useful during the early iterations.

Despite the growing popularity of the algorithm, only few works investigate its behavior from a theoretical point of view (see the discussion in Section 6). The present paper studies the convergence of Adam from a dynamical system viewpoint.

We introduce a continuous-time version of the Adam algorithm under the form of a non-autonomous ordinary differential equation (ODE). Building on the existence of an explicit Lyapunov function for the ODE, we show the existence of a unique global solution to the ODE. This first result turns out to be non-trivial due to the irregularity of the vector field. We then establish the convergence of the continuous-time Adam trajectory to the set of critical points of the objective function FF. The proposed continuous-time version of Adam provides useful insights on the effect of the bias correction step. It is shown that, close to the origin, the objective function FF is non-increasing along the Adam trajectory, suggesting that early iterations of Adam can only improve the initial guess.

Under a Łojasiewicz-type condition, we prove that the solution to the ODE converges to a single critical point of the objective function FF. We provide convergence rates in this case.

In discrete time, we first analyze the Adam iterates in the constant stepsize regime as originally introduced in . In this work, it is shown that the discrete-time Adam iterates shadow the behavior of the non-autonomous ODE in the asymptotic regime where the stepsize parameter γ\gamma of Adam is small. More precisely, we consider the interpolated process zγ(t){\mathsf{z}}^{\gamma}(t) which consists of a piecewise linear interpolation of the Adam iterates. The random process zγ{\mathsf{z}}^{\gamma} is indexed by the parameter γ\gamma, which is assumed constant during the whole run of the algorithm. In the space of continuous functions on [0,+)[0,+\infty) equipped with the topology of uniform convergence on compact sets, we establish that zγ{\mathsf{z}}^{\gamma} converges in probability to the solution to the non-autonomous ODE when γ\gamma tends to zero.

Under a stability condition, we prove the asymptotic ergodic convergence of the probability of the discrete-time Adam iterates to approach the set of critical points of the objective function in the doubly asymptotic regime where nn\to\infty then γ0\gamma\to 0.

Beyond the original constant stepsize Adam , we propose a decreasing stepsize version of the algorithm. We provide sufficient conditions ensuring the stability and the almost sure convergence of the iterates towards the critical points of the objective function.

We establish a convergence rate of the stochastic iterates of the decreasing stepsize algorithm under the form of a conditional central limit theorem.

We claim that our analysis can be easily extended to other adaptive algorithms such as e.g. RmsProp or AdaGrad and AmsGrad (see Section 6).

The paper is organized as follows. In Section 2, we present the Adam algorithm and the main assumptions. Our main results are stated in Sections 3 to 5. We provide a review of related works in Section 6. The rest of the paper addresses the proofs of our results (Sections 7 to 9).

The Adam Algorithm

It satisfies: zn=Tγ,α,β(n,zn1,ξn),z_{n}=T_{\gamma,\alpha,\beta}(n,z_{n-1},\xi_{n})\,, for every n1n\geq 1, where for every z=(x,m,v)z=(x,m,v) in Z+{{\mathcal{Z}}}_{+}, ξΞ\xi\in\Xi,

The iterates znz_{n} form a non-homogeneous Markov chain, because Tγ,α,β(n,z,ξ)T_{\gamma,\alpha,\beta}(n,z,\xi) depends on nn. This is due to the so-called debiasing step, which consists of replacing mn,vnm_{n},v_{n} in Algorithm 1 by their “debiased” versions m^n,v^n\hat{m}_{n},\hat{v}_{n}. The motivation becomes clear when expanding the expression:

From this equation, it is observed that, m^n\hat{m}_{n} forms a convex combination of the past gradients. This is unlike mnm_{n}, which may be small during the first iterations.

For almost every ξ\xi, the map f(.,ξ)f(\,.\,,\xi) is continuously differentiable.

2 Asymptotic Regime

We address the constant stepsize regime, where γ\gamma is fixed along the iterations (the default value recommended in is γ=0.001\gamma=0.001). As opposed to the decreasing stepsize context, the sequence znγ:=znz_{n}^{\gamma}:=z_{n} cannot in general converge as nn tends to infinity, in an almost sure sense. Instead, we investigate the asymptotic behavior of the family of processes (nznγ)γ>0(n\mapsto z_{n}^{\gamma})_{\gamma>0} indexed by γ\gamma, in the regime where γ0\gamma\to 0. We use the so-called ODE method (see e.g. ). The interpolated process zγ{\mathsf{z}}^{\gamma} is the piecewise linear function defined on [0,+)Z+[0,+\infty)\to{{\mathcal{Z}}}_{+} for all t[nγ,(n+1)γ)t\in[n\gamma,(n+1)\gamma) by:

We establish the convergence in probability of the family of random processes (zγ)γ>0({\mathsf{z}}^{\gamma})_{\gamma>0} as γ\gamma tends to zero, towards a deterministic continuous-time system defined by an ODE. The latter ODE, which we provide below at Eq. (ODE), will be referred to as the continuous-time version of Adam.

the initialization being chosen as z0γ=(x0,0,0)z_{0}^{\gamma}=(x_{0},0,0).

Continuous-Time System

In order to gain insight into the behavior of the sequence (znγ)(z_{n}^{\gamma}) defined by (5), it is convenient to rewrite the Adam iterations under the following equivalent form, for every n1n\geq 1:

where we define for every γ>0\gamma>0, zZ+z\in{{\mathcal{Z}}}_{+},

where a,ba,b are the constants defined in Assumption 2.2. We prove that, for any fixed (t,z)(t,z), the quantity h(t,z)h(t,z) coincides with the limit of hγ(t/γ,z)h_{\gamma}(\lfloor t/\gamma\rfloor,z) as γ0\gamma\downarrow 0. This remark along with Eq. (6) suggests that, as γ0\gamma\downarrow 0, the interpolated process zγ{\mathsf{z}}^{\gamma} shadows the non-autonomous differential equation

2 Existence, Uniqueness, Convergence

Since h(.,z)h(\,.\,,z) is non-continuous at point zero for a fixed zZ+z\in{{\mathcal{Z}}}_{+}, and since h(t,.)h(t,\,.\,) is not locally Lipschitz continuous for a fixed t > 0t~{}>~{}0, the existence and uniqueness of the solution to (ODE) do not stem directly from off-the-shelf theorems.

Let x0x_{0} be fixed. A continuous map z:[0,+)Z+z:[0,+\infty)\to{{\mathcal{Z}}}_{+} is said to be a global solution to (ODE) with initial condition (x0,0,0)(x_{0},0,0) if zz is continuously differentiable on (0,+)(0,+\infty), if Eq. (ODE) holds for all t>0t>0, and if z(0)=(x0,0,0)z(0)=(x_{0},0,0).

Let Assumptions 2.1 to 2.2 hold true. There exists a unique global solution z:[0,+)Z+z:[0,+\infty)\to{{\mathcal{Z}}}_{+} to (ODE) with initial condition (x0,0,0)(x_{0},0,0). Moreover, z([0,+))z([0,+\infty)) is a bounded subset of Z+{{\mathcal{Z}}}_{+}.

On the other hand, we note that a solution may not exist for an initial point(x0,m0,v0)(x_{0},m_{0},v_{0}) with arbitrary (non-zero) values of m0,v0m_{0},v_{0}.

Let Assumptions 2.1 to 2.2 hold true. Assume that F(S)F({{\mathcal{S}}}) has an empty interior. Let z:t(x(t),m(t),v(t))z:t\mapsto(x(t),m(t),v(t)) be the global solution to (ODE) with the initial condition (x0,0,0)(x_{0},0,0). Then, the set S{{\mathcal{S}}} is non-empty and limtd(x(t),S)=0\lim_{t\to\infty}{{\mathsf{d}}}(x(t),{{\mathcal{S}}})=0, limtm(t)=0\lim_{t\to\infty}m(t)=0, limtS(x(t))v(t)=0\lim_{t\to\infty}S(x(t))-v(t)=0.

Then, tV(t,z(t))t\mapsto V(t,z(t)) is decreasing if z()z(\cdot) is the global solution to (ODE).

Cost decrease at the origin. As FF itself is not a Lyapunov function for (ODE), there is no guarantee that F(x(t))F(x(t)) is decreasing w.r.t. tt. Nevertheless, the statement holds at the origin. Indeed, it can be shown that limt0V(t,z(t))=F(x0)\lim_{t\downarrow 0}V(t,z(t))=F(x_{0}) (see Prop. 7.7). As a consequence,

In other words, the (continuous-time) Adam procedure can only improve the initial guess x0x_{0}. This is the consequence of the so-called bias correction steps in Adam (see Algorithm 1) . If these debiasing steps were deleted in the Adam iterations, the early stages of the algorithm could degrade the initial estimate x0x_{0}.

Derivatives at the origin. The proof of Th. 3.1 reveals that the initial derivative is given by x˙(0)=F(x0)/(ε+S(x0))\dot{x}(0)=-\nabla F(x_{0})/(\varepsilon+\sqrt{S(x_{0})}) (see Lemma 7.1). In the absence of debiasing steps, the initial derivative x˙(0)\dot{x}(0) would be a function of the initial parameters m0m_{0}, v0v_{0}, and the user would be required to tune these hyperparameters. No such tuning is required thanks to the debiasing step. When ε\varepsilon is small and when the variance of f(x0,ξ)\nabla f(x_{0},\xi) is small (i.e., S(x0)F(x0)2S(x_{0})\simeq\nabla F(x_{0})^{\odot 2}), the initial derivative x˙(0)\dot{x}(0) is approximately equal to F(x0)/F(x0)-\nabla F(x_{0})/|\nabla F(x_{0})|. This suggests that in the early stages of the algorithm, the Adam iterations are comparable to the sign variant of the gradient descent, the properties of which were discussed in previous works, see .

3 Convergence rates

In this paragraph, we establish the convergence to a single critical point of FF and quantify the convergence rate, using the following assumption . {assumption}[Łojasiewicz property] For any xSx^{*}\in{\mathcal{S}}, there exist c>0,σ>0c>0\,,\sigma>0 and θ(0,12]\theta\in(0,\frac{1}{2}] s.t.

Assumption 3.3 holds for real-analytic functions and semialgebraic functions. We refer to for a discussion and a review of applications. We will call any θ\theta satisfying (12) for some c,σ>0c,\sigma>0, as a Łojasiewicz exponent of ff at xx^{*}. The next result establishes the convergence of the function x(t)x(t) generated by the ODE to a single critical point of ff, and provides the convergence rate as a function of the Łojasiewicz exponent of ff at this critical point. The proof is provided in subsection 7.4.

Moreover, if θ(0,12]\theta\in(0,\frac{1}{2}] is a Łojasiewicz exponent of ff at xx^{*}, there exists a constant C>0C>0 s.t. for all t0t\geq 0,

Discrete-Time System: Convergence of Adam

The sequence (ξn:n1)(\xi_{n}:n\geq 1) is iid, with the same distribution as ξ\xi.

Let p>0p>0. Assume either one of the following conditions.

The value of pp will be specified in the sequel, in the statement of the results. Clearly, Assumption 4 ii is stronger than Assumption 4 i. We shall use either the latter or the former in our statements.

Convergence in the long run. When the stepsize γ\gamma is constant, the sequence (xnγ)(x_{n}^{\gamma}) cannot converge in the almost sure sense as nn\to\infty. Convergence may only hold in the doubly asymptotic regime where nn\to\infty then γ0\gamma\to 0.

A Decreasing Stepsize Adam Algorithm

Adam inherently uses constant stepsizes. Consequently, the iterates (5) do not converge in the almost sure sense. In order to achieve convergence, we introduce in this section a decreasing stepsize version of Adam. The iterations are given in Algorithm 2.

where for every n,kn,k, ρn,k=αnαk+1(1αk)\rho_{n,k}=\alpha_{n}\cdots\alpha_{k+1}(1-\alpha_{k}). A similar equation holds for v^n\hat{v}_{n}.

2 Almost sure convergence

nγn=+\sum_{n}\gamma_{n}=+\infty and nγn2<+\sum_{n}\gamma_{n}^{2}<+\infty,

There exist a,ba,b s.t. 0<b<4a0<b<4a, γn1(1αn)a\gamma_{n}^{-1}(1-\alpha_{n})\to a and γn1(1βn)b\gamma_{n}^{-1}(1-\beta_{n})\to b .

Th. 5.1 establishes the almost sure convergence of xnx_{n} to the set of critical points of FF, under the assumption that the sequence ((xn,mn,vn))((x_{n},m_{n},v_{n})) is a.s. bounded. The next result provides a sufficient condition under which almost sure boundedness holds. {assumption} The following holds.

We assume the condition: limsupn(1γn(1αn+21αn+1)1γn+1)<2(ab4),\lim\sup_{n\to\infty}\left(\frac{1}{\gamma_{n}}-\left(\frac{1-\alpha_{n+2}}{1-\alpha_{n+1}}\right)\frac{1}{\gamma_{n+1}}\right)<2(a-\frac{b}{4})\,, which is satisfied for instance if b<4ab<4a and 1αn+1=aγn1-\alpha_{n+1}=a\gamma_{n}.

3 Central Limit Theorem

Let xSx^{*}\in{{\mathcal{S}}}. There exists a neighborhood V{{\mathcal{V}}} of xx^{*} s.t.

FF is twice continuously differentiable on V{{\mathcal{V}}}, and the Hessian 2F(x)\nabla^{2}F(x^{*}) of FF at xx^{*} is positive definite.

SS is continuously differentiable on V{{\mathcal{V}}}.

Define D:=diag((ε+S1(x))1,,(ε+Sd(x))1).D:=\textrm{diag}\left((\varepsilon+\sqrt{S_{1}(x^{*})})^{-1},\cdots,(\varepsilon+\sqrt{S_{d}(x^{*})})^{-1}\right)\,. Let PP be an orthogonal matrix s.t. the following spectral decomposition holds:

where λ1,,λd\lambda_{1},\cdots,\lambda_{d} are the (positive) eigenvalues of D1/22F(x)D1/2D^{1/2}\nabla^{2}F(x^{*})D^{1/2}. Define

where IdI_{d} represents the d×dd\times d identity matrix and S(x)\nabla S(x^{*}) is the Jacobian matrix of SS at xx^{*}. The largest real part of the eigenvalues of HH coincides with L-L, where

There exist κ(0,1]\kappa\in(0,1], γ0>0\gamma_{0}>0, s.t. the sequence (γn)(\gamma_{n}) satisfies γn=γ0/(n+1)κ\gamma_{n}={\gamma_{0}}/{(n+1)^{\kappa}} for all nn. If κ=1\kappa=1, we assume moreover that γ0>12L\gamma_{0}>\frac{1}{2L}.

The sequences (1γn(1αnγna))\left(\frac{1}{\gamma_{n}}(\frac{1-\alpha_{n}}{\gamma_{n}}-a)\right) and (1γn(1βnγnb))\left(\frac{1}{\gamma_{n}}(\frac{1-\beta_{n}}{\gamma_{n}}-b)\right) are bounded.

The variable vnv_{n} has an impact on the limiting covariance Σ1\Sigma_{1} through its limit S(x)S(x^{*}) (used to define DD), but the fluctuations of vnv_{n} and the parameter bb have no effect on Σ1\Sigma_{1}. As a matter of fact, Σ1\Sigma_{1} coincides with the limiting covariance matrix that would have been obtained by considering iterates of the form

which can be interpreted as a preconditioned version of the stochastic heavy ball algorithm . Of course, the above iterates are not implementable because the preconditioning matrix DD is unknown.

When aa is large, Σ1\Sigma_{1} is close to the matrix Σ1(0)\Sigma_{1}^{(0)} obtained when letting a+a\to+\infty in Eq. (17). The matrix Σ1(0)\Sigma_{1}^{(0)} is the solution to the Lyapunov equation

The matrix Σ1(0)\Sigma_{1}^{(0)} can be interpreted as the asymptotic covariance matrix of the xx-variable in the absence of the inertial term (that is, when one considers RmsProp instead of Adam). The matrix Σ1(0)\Sigma_{1}^{(0)} approximates Σ1\Sigma_{1} in the sense that Σ1=Σ1(0)+1aΔ+O(1a2)\Sigma_{1}=\Sigma_{1}^{(0)}+\frac{1}{a}\Delta+O(\frac{1}{a^{2}}) for some symmetric matrix Δ\Delta which can be explicited. The matrix Δ\Delta is neither positive nor negative definite in general. This suggests that the question of the potential benefit of the presence of an inertial term is in general problem dependent.

In the statement of Th. 5.3, the conditioning event {znz}\{z_{n}\to z^{*}\} can be replaced by the event {xnx}\{x_{n}\to x^{*}\} under the additional assumption that nγn2<+\sum_{n}\gamma_{n}^{2}<+\infty.

Related Works

Although the idea of adapting the (per-coordinate) learning rates as a function of past gradient values is not new (see e.g. variable metric methods such as the BFGS algorithms), AdaGrad led the way to a new class of algorithms that are sometimes referred to as adaptive gradient methods. AdaGrad consists of dividing the learning rate by the square root of the sum of previous gradients squared componentwise. The idea was to give larger learning rates to highly informative but infrequent features instead of using a fixed predetermined schedule. However, in practice, the division by the cumulative sum of squared gradients may generate small learning rates, thus freezing the iterates too early. Several works proposed heuristical ways to set the learning rates using a less aggressive policy. The work introduced an unpublished, yet popular, algorithm referred to as RmsProp where the cumulative sum used in AdaGrad is replaced by a moving average of squared gradients. Adam combines the advantages of both AdaGrad, RmsProp and inertial methods.

As opposed to AdaGrad, for which theoretical convergence guarantees exist , Adam is comparatively less studied. The initial paper suggests a O(1T)\mathcal{O}(\frac{1}{\sqrt{T}}) average regret bound in the convex setting, but exhibits a counterexample in contradiction with this statement. The latter counterexample implies that the average regret bound of Adam does not converge to zero. A first way to overcome the problem is to modify the Adam iterations themselves in order to obtain a vanishing average regret. This led to propose a variant called AmsGrad with the aim to recover, at least in the convex case, the sought guarantees. The work interprets Adam as a variance-adapted sign descent combining an update direction given by the sign and a magnitude controlled by a variance adaptation principle. A “noiseless” version of Adam is considered in . Under quite specific values of the Adam-hyperparameters, it is shown that for every δ>0\delta>0, there exists some time instant for which the norm of the gradient of the objective at the current iterate is no larger than δ\delta. The recent paper provides a similar result for AmsGrad and AdaGrad, but the generalization to Adam is subject to conditions which are not easily verifiable. The paper provides a convergence result for RmsProp using the objective function FF as a Lyapunov function. However, our work suggests that unlike RmsProp, Adam does not admit FF as a Lyapunov function. This makes the approach of hardly generalizable to Adam. Moreover, considers biased gradient estimates instead of the debiased estimates used in Adam.

In the present work, we study the behavior of an ODE, interpreted as the limit in probability of the (interpolated) Adam iterates as the stepsize tends to zero. Closely related continuous-time dynamical systems are also studied in . We leverage the idea of approximating a discrete time stochastic system by a deterministic continuous one, often referred to as the ODE method. A recent work fruitfully exploits this method to study a stochastic version of the celebrated heavy ball algorithm. We refer to for the reader interested in the non-differentiable setting with an analysis of the stochastic subgradient algorithm for non-smooth non-convex objective functions.

Concomitant to the present paper, Da Silva and Gazeau (posted only four weeks after the first version of the present work) study the asymptotic behavior of a similar dynamical system as the one introduced here. They establish several results in continuous time, such as avoidance of traps as well as convergence rates in the convex case; such aspects are out of the scope of this paper. However, the question of the convergence of the (discrete-time) iterates is left open. In the current paper, we also exhibit a Lyapunov function which allows, amongst others, to draw useful conclusions on the effect of the debiasing step of Adam . Finally, studies a slightly modified version of Adam allowing to recover an ODE with a locally Lipschitz continuous vector field, whereas the original Adam algorithm leads to an ODE with an irregular vector field. This technical issue is tackled in the present paper.

Proofs of Section 3

When η=0\eta=0, Eq. (ODEη) boils down to the equation of interest (ODE). The choice η(0,+)\eta\in(0,+\infty) will be revealed useful to prove Th. 3.1. Indeed, for η>0\eta>0, a solution to Eq. (ODEη) can be shown to exist (on some interval) due to the continuity of the map h(.+η,.)h(\,.+\eta,\,.\,). Considering a family of such solutions indexed by η(0,1]\eta\in(0,1], the idea is to prove the existence of a solution to (ODE) as a cluster point of the latter family when η0\eta\downarrow 0. Indeed, as the family is shown to be equicontinuous, such a cluster point does exist thanks to the Arzelà-Ascoli theorem. When η=+\eta=+\infty, Eq. (ODEη) rewrites

where h(z):=limth(t,z)h_{\infty}(z):=\lim_{t\to\infty}h(t,z). It is useful to note that for (x,m,v)Z+(x,m,v)\in{{\mathcal{Z}}}_{+},

Contrary to Eq. (ODE), Eq. (ODE∞) defines an autonomous ODE. The latter admits a unique global solution for any initial condition in Z+{{\mathcal{Z}}}_{+}, and defines a dynamical system D{{\mathcal{D}}}. We shall exhibit a strict Lyapunov function for this dynamical system D{{\mathcal{D}}}, and deduce that any solution to (ODE∞) converges to the set of equilibria of D{{\mathcal{D}}} as tt\to\infty. On the otherhand, we will prove that the solution to (ODE) with a proper initial condition is a so-called asymptotic pseudotrajectory (APT) of D{{\mathcal{D}}}. Due to the existence of a strict Lyapunov function, the APT shall inherit the convergence behavior of the autonomous system as tt\to\infty, which will prove Th. 3.2.

It is convenient to extend the map h:(0,+)×Z+Zh:(0,+\infty)\times{{\mathcal{Z}}}_{+}\to{{\mathcal{Z}}} on (0,+)×ZZ(0,+\infty)\times{{\mathcal{Z}}}\to{{\mathcal{Z}}} by setting h(t,(x,m,v)):=h(t,(x,m,v))h(t,(x,m,v)):=h(t,(x,m,|v|)) for every t>0t>0, (x,m,v)Z(x,m,v)\in{{\mathcal{Z}}}. Similarly, we extend hh_{\infty} as h((x,m,v)):=h((x,m,v))h_{\infty}((x,m,v)):=h_{\infty}((x,m,|v|)). For any T(0,+]T\in(0,+\infty] and any η[0,+]\eta\in[0,+\infty], we say that a map z:[0,T)Zz:[0,T)\to{{\mathcal{Z}}} is a solution to (ODEη) on [0,T)[0,T) with initial condition z0Z+z_{0}\in{{\mathcal{Z}}}_{+}, if zz is continuous on [0,T)[0,T), continuously differentiable on (0,T)(0,T), and if (ODEη) holds for all t(0,T)t\in(0,T). When T=+T=+\infty, we say that the solution is global. We denote by ZTη(z0)Z^{\eta}_{T}(z_{0}) the subset of C([0,T),Z)C([0,T),{{\mathcal{Z}}}) formed by the solutions to (ODEη) on [0,T)[0,T) with initial condition z0z_{0}. For any KZ+K\subset{{\mathcal{Z}}}_{+}, we define ZTη(K):=zKZTη(z)Z^{\eta}_{T}(K):=\bigcup_{z\in K}Z^{\eta}_{T}(z).

Let Assumptions 2, 7.1 and 7.1 hold. For every η[0,+]\eta\in[0,+\infty], T(0,+]T\in(0,+\infty], z0Z+z_{0}\in{{\mathcal{Z}}}_{+}, zZTη(z0)z\in Z_{T}^{\eta}(z_{0}), it holds that z((0,T))Z+z((0,T))\subset{{\mathcal{Z}}}_{+}^{*}.

Set z(t)=(x(t),m(t),v(t))z(t)=(x(t),m(t),v(t)) for all tt. Consider i{1,,d}i\in\{1,\dots,d\}. Assume by contradiction that there exists t0(0,T)t_{0}\in(0,T) s.t. vi(t0)<0v_{i}(t_{0})<0. Set τ:=sup{t[0,t0]:vi(t)0}\tau:=\sup\{t\in[0,t_{0}]:v_{i}(t)\geq 0\}. Clearly, τ<t0\tau<t_{0} and vi(τ)=0v_{i}(\tau)=0 by the continuity of viv_{i}. Since vi(t)0v_{i}(t)\leq 0 for all t(τ,t0]t\in(\tau,t_{0}], it holds that v˙i(t)=b(Si(x(t))vi(t))\dot{v}_{i}(t)=b(S_{i}(x(t))-v_{i}(t)) is nonnegative for all t(τ,t0]t\in(\tau,t_{0}]. This contradicts the fact that vi(τ)>vi(t0)v_{i}(\tau)>v_{i}(t_{0}). Thus, vi(t)0v_{i}(t)\geq 0 for all t[0,T)t\in[0,T). Now assume by contradiction that there exists t(0,T)t\in(0,T) s.t. vi(t)=0v_{i}(t)=0. Then, v˙i(t)=bSi(x(t))>0\dot{v}_{i}(t)=bS_{i}(x(t))>0. Thus, limδ0vi(tδ)δ=bSi(x(t)).\lim_{\delta\downarrow 0}\frac{v_{i}(t-\delta)}{-\delta}=bS_{i}(x(t))\,. In particular, there exists δ>0\delta>0 s.t. vi(tδ)δb2Si(x(t)).v_{i}(t-\delta)\leq-\frac{\delta b}{2}S_{i}(x(t))\,. This contradicts the first point.

Recall the definitions of VV and UU from Eqs. (9) and (10). Clearly, U(v):=limtU(t,v)=a(ε+v)U_{\infty}(v):=\lim_{t\to\infty}U(t,v)=a(\varepsilon+\sqrt{v}) is well defined for every v[0,+)dv\in[0,+\infty)^{d}. Hence, we can also define V(z):=limtV(t,z)V_{\infty}(z):=\lim_{t\to\infty}V(t,z) for every zZ+z\in{{\mathcal{Z}}}_{+}.

Let Assumptions 7.1 and 7.1 hold. Assume that 0<b4a0<b\leq 4a. Consider (t,z)(0,+)×Z+(t,z)\in(0,+\infty)\times{{\mathcal{Z}}}_{+}^{*} and set z=(x,m,v)z=(x,m,v). Then, VV and VV_{\infty} are differentiable at points (t,z)(t,z) and zz respectively. Moreover, V(z),h(z)εamU(v)2\langle\nabla V_{\infty}(z),h_{\infty}(z)\rangle\leq-\varepsilon\left\|\frac{am}{U_{\infty}(v)}\right\|^{2}\, and

We only prove the second point, the proof of the first point follows the same line. Consider (t,z)(0,+)×Z+(t,z)\in(0,+\infty)\times{{\mathcal{Z}}}_{+}^{*}. We decompose V(t,z),(1,h(t,z))=tV(t,z)+zV(t,z),h(t,z)\langle\nabla V(t,z),(1,h(t,z))\rangle=\partial_{t}V(t,z)+\langle\nabla_{z}V(t,z),h(t,z)\rangle. After tedious but straightforward derivations, we get:

where U(t,vi)=a(1eat)(ε+vi1ebt)U(t,v_{i})=a(1-e^{-at})\left(\varepsilon+\sqrt{\frac{v_{i}}{1-e^{-bt}}}\right) and zV(t,z),h(t,z)\langle\nabla_{z}V(t,z),h(t,z)\rangle is equal to:

2 Proof of Th. 3.1

Let Assumptions 2, 2, 7.1 and 7.1 hold. Assume that 0<b4a0<b\leq 4a. For every z0Z0z_{0}\in{{\mathcal{Z}}}_{0}, there exists a compact set KZ+K\subset{{\mathcal{Z}}}_{+} s.t. for all η[0,+)\eta\in[0,+\infty), all T(0,+]T\in(0,+\infty] and all zZTη(z0)z\in Z_{T}^{\eta}(z_{0}), {eˉ(t+η,z(t)):t(0,T)}K.\left\{\bar{e}(t+\eta,z(t)):t\in(0,T)\right\}\subset K\,. Moreover, choosing z0z_{0} of the form z0=(x0,0,0)z_{0}=(x_{0},0,0) and z(t)=(x(t),m(t),v(t))z(t)=(x(t),m(t),v(t)), it holds that F(x(t))F(x0)F(x(t))\leq F(x_{0}) for all t[0,T)t\in[0,T).

Let η[0,+)\eta\in[0,+\infty). Consider a solution zη(t)=(xη(t),mη(t),vη(t))z_{\eta}(t)=(x_{\eta}(t),m_{\eta}(t),v_{\eta}(t)) as in the statement, defined on some interval [0,T)[0,T). Define m^η(t)=mη(t)/(1ea(t+η))\hat{m}_{\eta}(t)=m_{\eta}(t)/(1-e^{-a(t+\eta)}), v^η(t)=vη(t)/(1eb(t+η))\hat{v}_{\eta}(t)=v_{\eta}(t)/(1-e^{-b(t+\eta)}). By Lemma 7.3, tV(t+η,z(t))t\mapsto V(t+\eta,z(t)) is continuous on [0,T)[0,T), and continuously differentiable on (0,T)(0,T). By Lemma 7.5, V˙(t+η,zη(t))0\dot{V}(t+\eta,z_{\eta}(t))\leq 0 for all t>0t>0. As a consequence, tV(t+η,zη(t))t\mapsto V(t+\eta,z_{\eta}(t)) is non-increasing on [0,T)[0,T). Thus, for all t0t\geq 0, F(xη(t))limt0V(t+η,zη(t))F(x_{\eta}(t))\leq\lim_{t^{\prime}\downarrow 0}V(t^{\prime}+\eta,z_{\eta}(t^{\prime})). Note that V(t+η,zη(t))F(xη(t))+12i=1dmη,i(t)2a(1ea(t+η))ε.V(t+\eta,z_{\eta}(t))\leq F(x_{\eta}(t))+\frac{1}{2}\sum_{i=1}^{d}\frac{m_{\eta,i}(t)^{2}}{a(1-e^{-a(t+\eta)})\varepsilon}\,. If η>0\eta>0, every term in the sum in the righthand side tends to zero, upon noting that mη(t)0m_{\eta}(t)\to 0 as t0t\to 0. The statement still holds if η=0\eta=0. Indeed, by Lemma 7.1, for a given i{1,,d}i\in\{1,\dots,d\}, there exists δ>0\delta>0 s.t. for all 0<t<δ0<t<\delta, mη,i(t)22a2(iF(x0))2t2m_{\eta,i}(t)^{2}\leq 2a^{2}(\partial_{i}F(x_{0}))^{2}t^{2} and 1eat(at)/21-e^{-at}\geq(at)/2. As a consequence, each term of the sum is no larger than 4(iF(x0))2t/ε4(\partial_{i}F(x_{0}))^{2}t/\varepsilon, which tends to zero as t0t\to 0. We conclude that for all t0t\geq 0, F(xη(t))F(x0)F(x_{\eta}(t))\leq F(x_{0}). In particular, {xη(t):t[0,T)}{FF(x0)}\{x_{\eta}(t):t\in[0,T)\}\subset\{F\leq F(x_{0})\}, the latter set being bounded by Assumption 2.

As V(t+η,zη(t))F(x0)V(t+\eta,z_{\eta}(t))\leq F(x_{0}), we obtain: F(x0)F(xη(t))+12mη(t)U(t+η,vη(t))12F(x_{0})\geq F(x_{\eta}(t))+\frac{1}{2}\left\|m_{\eta}(t)\right\|^{2}_{U(t+\eta,v_{\eta}(t))^{-1}}. Thus, F(x0)infF+12a(ε+M)mη(t)2F(x_{0})\geq\inf F+\frac{1}{2a(\varepsilon+\sqrt{M})}\|m_{\eta}(t)\|^{2}\,. Therefore, mη(.)m_{\eta}(\,.\,) is bounded on [0,T)[0,T), uniformly in η\eta. The same holds for m^η\hat{m}_{\eta} by using the mean value theorem in the same way as for v^η\hat{v}_{\eta}. The proof is complete.

Let Assumptions 2, 2, 7.1 and 7.1 hold. Assume that 0<b4a0<b\leq 4a. Let KK be a compact subset of Z+{{\mathcal{Z}}}_{+}. Then, there exists an other compact set KZ+K^{\prime}\subset{{\mathcal{Z}}}_{+} s.t. for every T(0,+]T\in(0,+\infty] and every zZT(K)z\in Z_{T}^{\infty}(K), z([0,T))Kz([0,T))\subset K^{\prime}.

The proof follows the same line as Prop. 7.7 and is omitted.

For any KZ+K\subset{{\mathcal{Z}}}_{+}, define vmin(K):=inf{vk:(x,m,v)K,i{1,,d}}v_{\min}(K):=\inf\{v_{k}:(x,m,v)\in K,i\in\{1,\dots,d\}\}.

Under Assumptions 2, 2, 7.1 and 7.1, the following holds true.

For every compact set KZ+K\subset{{\mathcal{Z}}}_{+}, there exists c>0c>0, s.t. for every zZ(K)z\in Z^{\infty}_{\infty}(K), of the form z(t)=(x(t),m(t),v(t))z(t)=(x(t),m(t),v(t)), vi(t)cmin(1,vmin(K)2c+t)(t0,i{1,,d}).v_{i}(t)\geq c\min\left(1,\frac{v_{\min}(K)}{2c}+t\right)\qquad(\forall t\geq 0,\forall i\in\{1,\dots,d\})\,.

For every z0Z0z_{0}\in{{\mathcal{Z}}}_{0}, there exists c>0c>0 s.t. for every η[0,+)\eta\in[0,+\infty) and every zZη(z0)z\in Z_{\infty}^{\eta}(z_{0}), vi(t)cmin(1,t)(t0,i{1,,d}).v_{i}(t)\geq c\min(1,t)\qquad(\forall t\geq 0,\forall i\in\{1,\dots,d\})\,.

Set κ1:=0.5(vmin+bSminτ)\kappa_{1}:=0.5(v_{\min}+bS_{\min}\tau). Note that vi(τ)κ1v_{i}(\tau)\geq\kappa_{1}. Define Smin:=inf{Si(x):i{1,,d},(x,m,v)K}.S_{\min}^{\prime}:=\inf\{S_{i}(x):i\in\{1,\dots,d\},(x,m,v)\in K^{\prime}\}\,. Note that Smin>0S_{\min}^{\prime}>0 by Assumptions 7.1 and 2. Finally, define κ=0.5min(κ1,Smin)\kappa=0.5\min(\kappa_{1},S_{\min}^{\prime}). By contradiction, assume that the set {tτ:vi(t)<κ}\{t\geq\tau:v_{i}(t)<\kappa\} is non-empty, and denote by τ\tau^{\prime} its infimum. It is clear that τ>τ\tau^{\prime}>\tau and vi(τ)=κv_{i}(\tau^{\prime})=\kappa. Thus, b1v˙i(τ)=Si(x(τ))κb^{-1}\dot{v}_{i}(\tau^{\prime})=S_{i}(x(\tau^{\prime}))-\kappa. We obtain that b1v˙i(τ)0.5Smin>0b^{-1}\dot{v}_{i}(\tau^{\prime})\geq 0.5S_{\min}^{\prime}>0. As a consequence, there exists t(τ,τ)t\in(\tau,\tau^{\prime}) s.t. vi(t)<vi(τ)v_{i}(t)<v_{i}(\tau^{\prime}). This contradicts the definition of τ\tau^{\prime}. We have shown that for all tτt\geq\tau, vi(t)κv_{i}(t)\geq\kappa. Putting this together with Eq. (21) and using that κvmin+bSminτ\kappa\leq v_{\min}+bS_{\min}\tau, we conclude that: t0, vi(t)min(κ,vmin2+bSmint2).\forall t\geq 0,\ v_{i}(t)\geq\min\left(\kappa\,,\frac{v_{\min}}{2}+\frac{bS_{\min}t}{2}\right)\,. Setting c:=min(κ,bSmin/2)c:=\min(\kappa,bS_{\min}/2), the result follows.

We prove the second point. By Prop. 7.7, there exists a compact set KZ+K\subset{{\mathcal{Z}}}_{+} s.t. for every η0\eta\geq 0, every zZη(x0)z\in Z_{\infty}^{\eta}(x_{0}) of the form z(t)=(x(t),m(t),v(t))z(t)=(x(t),m(t),v(t)) satisfies {(x(t),m^(t),v^(t)):t0}K\{(x(t),\hat{m}(t),\hat{v}(t)):t\geq 0\}\subset K, where m^(t)=m(t)/(1ea(t+h))\hat{m}(t)=m(t)/(1-e^{-a(t+h)}) and v^(t)=v(t)/(1eb(t+h))\hat{v}(t)=v(t)/(1-e^{-b(t+h)}). Denote by LSL_{S} the Lipschitz constant of SS on the set {x:(x,m,v)K}\{x:(x,m,v)\in K\}. Introduce the constants M1:=sup{m/(ε+v):(x,m,v)K}M_{1}:=\sup\{\|m/(\varepsilon+\sqrt{v})\|_{\infty}:(x,m,v)\in K\}, M2:=sup{S(x):(x,m,v)K}M_{2}:=\sup\{\|S(x)\|_{\infty}:(x,m,v)\in K^{\prime}\}. These constants being introduced, the rest of the proof follows the same line as the proof of the first point.

2.2 Existence

Let Assumptions 2, 2, 7.1 and 7.1 hold. Assume that 0<b4a0<b\leq 4a. For every z0Z+z_{0}\in{{\mathcal{Z}}}_{+},Z(z0)Z_{\infty}^{\infty}(z_{0})\neq\emptyset. For every (z0,η)Z0×(0,+)(z_{0},\eta)\in{{\mathcal{Z}}}_{0}\times(0,+\infty),Zη(z0)Z_{\infty}^{\eta}(z_{0})\neq\emptyset.

We prove the first point (the proof of the second point follows the same line). Under Assumptions 7.1 and 7.1, hh_{\infty} is continuous. Therefore, Cauchy-Peano’s theorem guarantees the existence of a solution to the (ODE) issued from z0z_{0}, which we can extend over a maximal interval of existence [0,Tmax)[0,T_{\max}).We conclude that the solution is global (Tmax=+T_{\max}=+\infty) using the boundedness of the solution given by Prop. 7.9.

Let Assumptions 2, 2, 7.1 and 7.1 hold. Assume that 0<b4a0<b\leq 4a. Consider z0Z0z_{0}\in{{\mathcal{Z}}}_{0}. Denote by (zη:η(0,+))(z_{\eta}:\eta\in(0,+\infty)) a family of functions on [0,+)Z+[0,+\infty)\to{{\mathcal{Z}}}_{+} s.t. for every η>0\eta>0, zηZη(z0)z_{\eta}\in Z_{\infty}^{\eta}(z_{0}). Then, (zη)η>0(z_{\eta})_{\eta>0} is equicontinuous.

Let Assumptions 2, 2, 7.1 and 7.1 hold. Assume that 0<b4a0<b\leq 4a. For every z0Z0z_{0}\in{{\mathcal{Z}}}_{0}, Z0(z0)Z_{\infty}^{0}(z_{0})\neq\emptyset i.e., (ODE) admits a global solution issued from z0z_{0}.

and the righthand side converges to zero. As a consequence, for all t>st>s, z(t)=z(s)+sth(u,z(u))du.z(t)=z(s)+\int_{s}^{t}h(u,z(u))du\,. Moreover, z(0)=z0z(0)=z_{0}. This proves that zZ0(z0)z\in Z^{0}_{\infty}(z_{0}).

2.3 Uniqueness

Let Assumptions 2, 2, 7.1 and 7.1 hold. Assume b4ab\leq 4a. For every z0Z0z_{0}\in{{\mathcal{Z}}}_{0}, Z0(z0)Z_{\infty}^{0}(z_{0}) is a singleton i.e., there exists a unique global solution to (ODE) with initial condition z0z_{0}.

Consider solutions zz and zz^{\prime} in Z0(z0)Z^{0}_{\infty}(z_{0}). We denote by (x(t),m(t),v(t))(x(t),m(t),v(t)) the blocks of z(t)z(t), and we define (x(t),m(t),v(t))(x^{\prime}(t),m^{\prime}(t),v^{\prime}(t)) similarly. For all t>0t>0, we define m^(t):=m(t)/(1eat)\hat{m}(t):=m(t)/(1-e^{-at}), v^(t):=v(t)/(1ebt)\hat{v}(t):=v(t)/(1-e^{-bt}), and we define m^(t)\hat{m}^{\prime}(t) and v^(t)\hat{v}^{\prime}(t) similarly. By Prop. 7.7, there exists a compact set KZ+K\subset{{\mathcal{Z}}}_{+} s.t. (x(t),m^(t),v^(t))(x(t),\hat{m}(t),\hat{v}(t)) and (x(t),m^(t),v^(t))(x^{\prime}(t),\hat{m}^{\prime}(t),\hat{v}^{\prime}(t)) are both in KK for all t>0t>0. We denote by LSL_{S} and LFL_{\nabla F} the Lipschitz constants of SS and F\nabla F on the compact set {x:(x,m,v)K}\{x:(x,m,v)\in K\}. These constants are finite by Assumptions 7.1 and 7.1. We define M:=sup{m:(x,m,v)K}M:=\sup\{\|m\|_{\infty}:(x,m,v)\in K\}. Define ux(t):=x(t)x(t)2u_{x}(t):=\|x(t)-x^{\prime}(t)\|^{2}, um(t):=m^(t)m^(t)2u_{m}(t):=\|\hat{m}(t)-\hat{m}^{\prime}(t)\|^{2} and uv(t):=v^(t)v^(t)2u_{v}(t):=\|\hat{v}(t)-\hat{v}^{\prime}(t)\|^{2}. Let δ>0\delta>0. Define: u(δ)(t):=ux(t)+δum(t)+δuv(t).u^{(\delta)}(t):=u_{x}(t)+\delta u_{m}(t)+\delta u_{v}(t)\,. By the chain rule and the Cauchy-Schwarz inequality, u˙x(t)2x(t)x(t)m^(t)ε+v^(t)m^(t)ε+v^(t)\dot{u}_{x}(t)\leq 2\|x(t)-x^{\prime}(t)\|\|\frac{\hat{m}(t)}{\varepsilon+\sqrt{\hat{v}(t)}}-\frac{\hat{m}^{\prime}(t)}{\varepsilon+\sqrt{\hat{v}^{\prime}(t)}}\|. Thus, using Lemma 7.11, there exists c>0c>0 s.t.

For any δ>0\delta>0, 2x(t)x(t)m^(t)m^(t)δ1/2(ux(t)+δum(t))δ1/2u(δ)(t)2\|x(t)-x^{\prime}(t)\|\,\|\hat{m}(t)-\hat{m}^{\prime}(t)\|\leq\delta^{-1/2}(u_{x}(t)+\delta u_{m}(t))\leq\delta^{-1/2}u^{(\delta)}(t). Similarly, 2x(t)x(t)v^(t)v^(t)δ1/2u(δ)(t)2\|x(t)-x^{\prime}(t)\|\,\|\hat{v}(t)-\hat{v}^{\prime}(t)\|\leq\delta^{-1/2}u^{(\delta)}(t). Thus, for any δ>0\delta>0,

We now study um(t)u_{m}(t). For all t>0t>0, we obtain after some algebra: ddtm^(t)=a(F(x(t))m^(t))/(1eat).\frac{d}{dt}\hat{m}(t)=a(\nabla F(x(t))-\hat{m}(t))/(1-e^{-at})\,. Therefore, u˙m(t)2aLF1eatm^(t)m^(t)x(t)x(t).\dot{u}_{m}(t)\leq\frac{2aL_{\nabla F}}{1-e^{-at}}\|\hat{m}(t)-\hat{m}^{\prime}(t)\|\,\|x(t)-x^{\prime}(t)\|\,. For any θ>0\theta>0, it holds that 2m^(t)m^(t)x(t)x(t)θux(t)+θ1um(t)2\|\hat{m}(t)-\hat{m}^{\prime}(t)\|\,\|x(t)-x^{\prime}(t)\|\leq\theta u_{x}(t)+\theta^{-1}u_{m}(t). In particular, letting θ:=2LF\theta:=2L_{\nabla F}, we obtain that for all δ>0\delta>0,

where the last inequality is due to the fact that y/(1ey)1+yy/(1-e^{-y})\leq 1+y for all y>0y>0. Using the exact same arguments, we also obtain that

We now choose any δ\delta s.t. 4δ1/max(LS2,LF2)4\delta\leq 1/\max(L_{S}^{2},L_{\nabla F}^{2}). Then, Eq. (23) and (24) respectively imply that δu˙m(t)0.5(a+t1)u(δ)(t)\delta\dot{u}_{m}(t)\leq 0.5(a+t^{-1})u^{(\delta)}(t) and δu˙v(t)0.5(b+t1)u(δ)(t)\delta\dot{u}_{v}(t)\leq 0.5(b+t^{-1})u^{(\delta)}(t). Summing these inequalities along with Eq. (22), we obtain that for every t>0t>0, u˙(δ)(t)ψ(t)u(δ)(t)\dot{u}^{(\delta)}(t)\leq\psi(t)u^{(\delta)}(t)\,, where: ψ(t):=a+b2+1εδ+M2ε2δcmin(1,t)+1t.\psi(t):=\frac{a+b}{2}+\frac{1}{\varepsilon\sqrt{\delta}}+\frac{M}{2\varepsilon^{2}\sqrt{\delta c\min(1,t)}}+\frac{1}{t}\,. From Grönwall’s inequality, it holds that for every t>s>0t>s>0, u(δ)(t)u(δ)(s)exp(stψ(s)ds)u^{(\delta)}(t)\leq u^{(\delta)}(s)\exp\left(\int_{s}^{t}\psi(s^{\prime})ds^{\prime}\right)\,. We first consider the case where t1t\leq 1. We set c1:=(a+b)/2+(εδ)1c_{1}:=(a+b)/2+(\varepsilon\sqrt{\delta})^{-1} and c2:=M/(ε2δc)c_{2}:=M/(\varepsilon^{2}\sqrt{\delta c}). With these notations, stψ(s)dsc1t+c2t+lnts.\int_{s}^{t}\psi(s^{\prime})ds^{\prime}\leq c_{1}t+c_{2}\sqrt{t}+\ln\frac{t}{s}\,. Therefore, u(δ)(t)u(δ)(s)sexp(c1t+c2t+lnt)u^{(\delta)}(t)\leq\frac{u^{(\delta)}(s)}{s}\exp\left(c_{1}t+c_{2}\sqrt{t}+\ln t\right)\,. By Lemma 7.1, recall that x˙(0)\dot{x}(0) and x˙(0)\dot{x}^{\prime}(0) are both well defined (and coincide). Thus,

By Lemma 7.1, zz and zz^{\prime} are differentiable at point zero. Then, the above inequality gives lim sups0m^i(s)m^i(s)s4(LF1)z˙(0)\limsup_{s\downarrow 0}\frac{|\hat{m}_{i}(s)-\hat{m}_{i}^{\prime}(s)|}{s}\leq 4(L_{\nabla F}\vee 1)\|\dot{z}(0)\| and lim sups0um(s)s216d(LF21)z˙(0)2.\limsup_{s\downarrow 0}\frac{u_{m}(s)}{s^{2}}\leq 16d(L_{\nabla F}^{2}\vee 1)\|\dot{z}(0)\|^{2}\,.Therefore, um(s)/su_{m}(s)/s converges to zero as s0s\downarrow 0. By similar arguments, it can be shown that lim sups0uv(s)/s216d(LS21)z˙(0)2\limsup_{s\downarrow 0}{u_{v}(s)}/{s^{2}}\leq 16d(L_{S}^{2}\vee 1)\|\dot{z}(0)\|^{2}, thus limuv(s)/s=0\lim u_{v}(s)/s=0. Finally, we obtain that u(δ)(s)/s{u^{(\delta)}(s)}/{s} converges to zero as s0s\downarrow 0. Letting ss tend to zero, we obtain that for every t1t\leq 1, u(δ)(t)=0u^{(\delta)}(t)=0. Setting s=1s=1 and t>1t>1, and noting that ψ\psi is integrable on [1,t][1,t], it follows that u(δ)(t)=0u^{(\delta)}(t)=0 for all t>1t>1. This proves that z=zz=z^{\prime}.

We recall that a semiflow Φ\Phi on the space (E,d)(E,{{\mathsf{d}}}) is a continuous map Φ\Phi from [0,+)×E[0,+\infty)\times E to EE defined by (t,x)Φ(t,x)=Φt(x)(t,x)\mapsto\Phi(t,x)=\Phi_{t}(x) such that Φ0\Phi_{0} is the identity and Φt+s=ΦtΦs\Phi_{t+s}=\Phi_{t}\circ\Phi_{s} for all (t,s)[0,+)2(t,s)\in[0,+\infty)^{2}.

Let Assumptions 2, 2, 7.1 and 7.1 hold. Assume that 0<b4a0<b\leq 4a. The map ZZ_{\infty}^{\infty} is single-valued on Z+C([0,+),Z+){{\mathcal{Z}}}_{+}\to C([0,+\infty),{{\mathcal{Z}}}_{+}) i.e., there exists a unique global solution to (ODE∞) starting from any given point in Z+{{\mathcal{Z}}}_{+}. Moreover, the following map is a semiflow:

The result is a direct consequence of Lemma 7.19.

3 Proof of Th. 3.2

Consider a semiflow Ψ\Psi on (E,d)(E,d) and a map z:[0,+)Ez:[0,+\infty)\to E. Assume the following:

Ψ\Psi admits a strict Lyapunov function V{{\mathsf{V}}}.

The set ΛΨ\Lambda_{\Psi} of equilibrium points of Ψ\Psi is compact.

V(ΛΨ){{\mathsf{V}}}(\Lambda_{\Psi}) has an empty interior.

Then, t0z([t,))\bigcap_{t\geq 0}\overline{z([t,\infty))} is a compact connected subset of ΛΨ\Lambda_{\Psi} .

For every δ>0\delta>0 and every z=(x,m,v)Z+z=(x,m,v)\in{{\mathcal{Z}}}_{+}, define:

where we recall that V(z):=limtV(t,z)V_{\infty}(z):=\lim_{t\to\infty}V(t,z) for every zZ+z\in{{\mathcal{Z}}}_{+} and VV is defined by Eq.(9). Consider the set E:=h1({0}){{\mathcal{E}}}:=h_{\infty}^{-1}(\{0\}) of all equilibrium points of (ODE∞), namely: E={(x,m,v)Z+:F(x)=0,m=0,v=S(x)}{{\mathcal{E}}}=\{(x,m,v)\in{{\mathcal{Z}}}_{+}:\nabla F(x)=0,m=0,v=S(x)\}\,. The set E{{\mathcal{E}}} is non-empty by Assumption 2.

Let Assumptions 2, 2, 7.1 and 7.1 hold. Assume that 0<b4a0<b\leq 4a. Let KZ+K\subset{{\mathcal{Z}}}_{+} be a compact set. Define K:={Φ(t,z):t0,zK}K^{\prime}:=\overline{\{\Phi(t,z):t\geq 0,z\in K\}}. Let Φ:[0,+)×KK\overline{\Phi}:[0,+\infty)\times K^{\prime}\to K^{\prime} be the restriction of the semiflow Φ\Phi to KK^{\prime} i.e., Φ(t,z)=Φ(t,z)\overline{\Phi}(t,z)=\Phi(t,z) for all t0,zKt\geq 0,z\in K^{\prime}. Then,

Φ\overline{\Phi} is well defined and is a semiflow on KK^{\prime}.

The set of equilibrium points of Φ\overline{\Phi} is equal to EK{{\mathcal{E}}}\cap K^{\prime}.

There exists δ>0\delta>0 s.t. WδW_{\delta} is a strict Lyapunov function for the semiflow Φ\overline{\Phi}.

Denote by LSL_{S} the Lipschitz constant of SS on {x:(x,m,v)K}\{x:(x,m,v)\in K^{\prime}\}. For every t>0t>0,

Using that 2m(t)S(x(t))v(t)LSbεm(t)2+bεLSS(x(t))v(t)22\|m(t)\|\|S(x(t))-v(t)\|\leq\frac{L_{S}}{b\varepsilon}\|m(t)\|^{2}+\frac{b\varepsilon}{L_{S}}\|S(x(t))-v(t)\|^{2}, we obtain LH(t)bS(x(t))v(t)2+LS2bε2m(t)2.{{\mathcal{L}}}_{H}(t)\leq-b\|S(x(t))-v(t)\|^{2}+\frac{L_{S}^{2}}{b\varepsilon^{2}}\|m(t)\|^{2}\,. Hence, for every t>0t>0,

where M(δ):=ε(ε+C1)2δLS2bε2δ(a2+LFε).M(\delta):=\varepsilon(\varepsilon+\sqrt{C_{1}})^{-2}-\frac{\delta L_{S}^{2}}{b\varepsilon^{2}}-\delta\left(\frac{a}{2}+\frac{L_{\nabla F}}{\varepsilon}\right)\,. Choosing δ\delta s.t. M(δ)>0M(\delta)>0,

where c:=min{M(δ),aδ2,δb}c:=\min\{M(\delta),\frac{a\delta}{2},\delta b\}. It can easily be seen that for every zKz\in K^{\prime}, tWδ(Φt(z))t\mapsto W_{\delta}(\overline{\Phi}_{t}(z)) is Lipschitz continuous, hence absolutely continuous. Its derivative almost everywhere coincides with LWδ{{\mathcal{L}}}_{W_{\delta}}, which is non-positive. Thus, WδW_{\delta} is a Lyapunov function for Φ\overline{\Phi}. We prove that the Lyapunov function is strict. Consider zKz\in K^{\prime} s.t. Wδ(Φt(z))=Wδ(z)W_{\delta}(\overline{\Phi}_{t}(z))=W_{\delta}(z) for all t>0t>0. The derivative almost everywhere of tWδ(Φt(z))t\mapsto W_{\delta}(\overline{\Phi}_{t}(z)) is identically zero, and by Eq. (27), this implies that c(mt2+F(xt)2+S(xt)vt2)-c\left(\|m_{t}\|^{2}+\|\nabla F(x_{t})\|^{2}+\|S(x_{t})-v_{t}\|^{2}\right) is equal to zero for every tt a.e. (hence, for every tt, by continuity of Φ\overline{\Phi}). In particular for t=0t=0, m=F(x)=0m=\nabla F(x)=0 and S(x)v=0S(x)-v=0. Hence, zh1({0})z\in h_{\infty}^{-1}(\{0\}).

Let Assumptions 2, 2, 7.1 and 7.1 hold. Assume that 0<b4a0<b\leq 4a. For every zZ+z\in{{\mathcal{Z}}}_{+}, limtd(Φ(z,t),E)=0.\lim_{t\to\infty}{{\mathsf{d}}}(\Phi(z,t),{{\mathcal{E}}})=0\,.

Use Prop. 7.24 with K:={z}K:=\{z\}. and [15, Th. 2.1.7].

3.2 Asymptotic Behavior of the Solution to (ODE)

Let Assumptions 2, 2, 7.1 and 7.1 hold true. Assume that 0<b4a0<b\leq 4a. Then, for every z0Z0z_{0}\in{{\mathcal{Z}}}_{0}, Z0(z0)Z^{0}_{\infty}(z_{0}) is an asymptotic pseudotrajectory of the semiflow Φ\Phi given by (25).

Consider z0Z0z_{0}\in{{\mathcal{Z}}}_{0}, T(0,+)T\in(0,+\infty) and define z:=Z0(z0)z:=Z_{\infty}^{0}(z_{0}). Consider t1t\geq 1. For every s0s\geq 0, define Δt(s):=z(t+s)Φ(z(t))(s)\Delta_{t}(s):=\|z(t+s)-\Phi(z(t))(s)\|. The aim is to prove that sups[0,T]Δt(s)\sup_{s\in[0,T]}\Delta_{t}(s) tends to zero as tt\to\infty. Putting together Prop. 7.7 and Lemma 7.11, the set K:={z(t):t1}K:=\overline{\{z(t):t\geq 1\}} is a compact subset of Z+{{\mathcal{Z}}}_{+}^{*}. Define C(t):=sups0supzKh(t+s,z)h(z)C(t):=\sup_{s\geq 0}\sup_{z^{\prime}\in K}\|h(t+s,z^{\prime})-h_{\infty}(z^{\prime})\|. It can be shown that limtC(t)=0\lim_{t\to\infty}C(t)=0. We obtain that for every s[0,T]s\in[0,T], Δt(s)TC(t)+0sh(z(t+s))h(Φ(z(t))(s))ds.\Delta_{t}(s)\leq TC(t)+\int_{0}^{s}\|h_{\infty}(z(t+s^{\prime}))-h_{\infty}(\Phi(z(t))(s^{\prime}))\|ds^{\prime}\,. By Lemma 7.11, K:=zΦ(K)z([0,))K^{\prime}:=\overline{\bigcup_{z^{\prime}\in\Phi(K)}z^{\prime}([0,\infty))} is a compact subset of Z+{{\mathcal{Z}}}_{+}^{*}. It is immediately seen from the definition that hh_{\infty} is Lipschitz continuous on every compact subset of Z+{{\mathcal{Z}}}_{+}^{*}, hence on KKK\cup K^{\prime}. Therefore, there exists a constant LL, independent from t,st,s, s.t. Δt(s)TC(t)+0sLΔt(s)ds(t1,s[0,T]).\Delta_{t}(s)\leq TC(t)+\int_{0}^{s}L\Delta_{t}(s^{\prime})ds^{\prime}\qquad(\forall t\geq 1,\forall s\in[0,T])\,. Using Grönwall’s lemma, it holds that for all s[0,T]s\in[0,T], Δt(s)TC(t)eLs.\Delta_{t}(s)\leq TC(t)e^{Ls}\,. As a consequence, sups[0,T]Δt(s)TC(t)eLT\sup_{s\in[0,T]}\Delta_{t}(s)\leq TC(t)e^{LT} and the righthand side converges to zero as tt\to\infty.

End of the Proof of Th. 3.2

By Prop. 7.7, the set K:=Z0(z0)([0,))K:=\overline{Z^{0}_{\infty}(z_{0})([0,\infty))} is a compact subset of Z+{{\mathcal{Z}}}_{+}. Define K:={Φ(t,z):t0,zK},K^{\prime}:=\overline{\{\Phi(t,z):t\geq 0,z\in K\}}\,, and let Φ:[0,+)×KK\overline{\Phi}:[0,+\infty)\times K^{\prime}\to K^{\prime} be the restriction Φ\Phi to KK^{\prime}. By Prop. 7.24, there exists δ>0\delta>0 s.t. WδW_{\delta} is a strict Lyapunov function for the semiflow Φ\overline{\Phi}. Moreover, the set of equilibrium points coincides with EK{{\mathcal{E}}}\cap K^{\prime}. In particular, the equilibrium points of Φ\overline{\Phi} form a compact set. By Prop. 7.28, Z0(z0)Z^{0}_{\infty}(z_{0}) is an APT of Φ\overline{\Phi}. Note that every zEz\in{{\mathcal{E}}} can be written under the form z=(x,0,S(x))z=(x,0,S(x)) for some xSx\in{{\mathcal{S}}}. From the definition of WδW_{\delta} in (26), Wδ(z)=Wδ(x,0,S(x))=V(x,0,S(x))=F(x)W_{\delta}(z)=W_{\delta}(x,0,S(x))=V_{\infty}(x,0,S(x))=F(x). Since F(S)F({{\mathcal{S}}}) is assumed to have an empty interior, the same holds for Wδ(EK)W_{\delta}({{\mathcal{E}}}\cap K^{\prime}). By Prop. 7.23, t0Z0(z0)([t,))EK.\bigcap_{t\geq 0}\overline{Z^{0}_{\infty}(z_{0})([t,\infty))}\subset{{\mathcal{E}}}\cap K^{\prime}\,. The set in the righthand side coincides with the set of limits of convergent sequences of the form Z0(z0)(tn)Z^{0}_{\infty}(z_{0})(t_{n}) for tnt_{n}\to\infty. As Z0(z0)([0,))Z^{0}_{\infty}(z_{0})([0,\infty)) is a bounded set, d(Z0(z0)(t),E){{\mathsf{d}}}(Z^{0}_{\infty}(z_{0})(t),{{\mathcal{E}}}) tends to zero.

4 Proof of Th. 3.3

The proof follows the path of [16, Th. 10.1.6, Th. 10.2.3], but requires specific adaptations to deal with the dynamical system at hand. Define for all δ>0\delta>0, t>0t>0, and z=(x,m,v)z=(x,m,v),

Upper-bound on wδ(t)w_{\delta}(t). From Eq. (9), we obtain that for every t1t\geq 1, V(t,z(t))F(x(t))+m(t)22aε(1ea).V(t,z(t))\leq|F(x(t))|+\frac{\|m(t)\|^{2}}{2a\varepsilon(1-e^{-a})}\,.Using F(x),m(F(x)2+m2)/2\langle\nabla F(x),m\rangle\leq(\|\nabla F(x)\|^{2}+\|m\|^{2})/2, we obtain that there exists a constant c1c_{1} (depending on δ\delta) s.t. for every t1t\geq 1,

Upper-bound on ddtwδ(t)\frac{d}{dt}w_{\delta}(t). The function wδw_{\delta} is absolutely continuous on [1,+)[1,+\infty). Moreover, there exist δ>0\delta>0, c2>0c_{2}>0 (both depending on x0x_{0}) s.t. for every t1t\geq 1 a.e.,

The proof of Eq. (30) uses arguments that are similar to the ones used in the proof of Prop. 7.24 (just use Lemma 7.5 to bound the derivative of the first term in Eq. (28)). For this reason, it is omitted.

Putting together (29) and (30) using the Łojasiewicz condition. By Prop. 7.23 and 7.28, the set L:=s0{z(t):ts}L:=\overline{\bigcup_{s\geq 0}\{z(t):t\geq s\}}\, is a compact connected subset of E={(x,0,S(x)):F(x)=0}{{\mathcal{E}}}=\{(x,0,S(x)):\nabla F(x)=0\}. The set U:={x:(x,0,S(x))L}\mathcal{U}:=\{x:(x,0,S(x))\in L\} is a compact and connected subset of S{{\mathcal{S}}}. Using Assumption 3.3 and [16, Lemma 2.1.6], there exist constants σ,c>0\sigma,c>0 and θ(0,12]\theta\in(0,\frac{1}{2}], s.t. F(x)cF(x)1θ\|\nabla F(x)\|\geq c|F(x)|^{1-\theta}\, for all xx s.t. d(x,U)<σ{{\mathsf{d}}}(x,\mathcal{U})<\sigma\,. As d(x(t),U)0{{\mathsf{d}}}(x(t),\mathcal{U})\to 0, there exists T1T\geq 1 s.t. for all tTt\geq T, F(x(t))cF(x(t))1θ\|\nabla F(x(t))\|\geq c|F(x(t))|^{1-\theta}. Thus, we may replace the term F(x(t))2\|\nabla F(x(t))\|^{2} in the righthand side of Eq. (30) using F(x(t))212F(x(t))2+12F(x(t))2(1θ)\|\nabla F(x(t))\|^{2}\geq\frac{1}{2}\|\nabla F(x(t))\|^{2}+\frac{1}{2}|F(x(t))|^{2(1-\theta)}. Upon noting that 2(1θ)12(1-\theta)\geq 1, we thus obtain that there exists a constant c3c_{3} and some T1T^{\prime}\geq 1 s.t. for tTt\geq T^{\prime} a.e.,

Putting this inequality together with Eq. (29), we obtain that for some constant c4>0c_{4}>0 and for all tTt\geq T^{\prime} a.e., ddtwδ(t)c4wδ(t)2(1θ).\frac{d}{dt}w_{\delta}(t)\leq-c_{4}w_{\delta}(t)^{2(1-\theta)}\,.

End of the proof. Following the arguments of [16, Th. 10.1.6], by integrating the preceding inequality, over [T,t][T^{\prime},t], we obtain wδ(t)c5t112θw_{\delta}(t)\leq c_{5}t^{-\frac{1}{1-2\theta}} for tTt\geq T^{\prime} in the case where θ<12\theta<\frac{1}{2}, whereas wδ(t)w_{\delta}(t) decays exponentially if θ=12\theta=\frac{1}{2}. From now on, we focus on the case θ<12\theta<\frac{1}{2}. By definition of (ODE), x˙(t)2m(t)2/((1eaT)2ε2)\|\dot{x}(t)\|^{2}\leq\|m(t)\|^{2}/((1-e^{-aT^{\prime}})^{2}\varepsilon^{2}) for all tTt\geq T^{\prime}. Since Eq. (30) implies m(t)2w˙δ(t)/c2\|m(t)\|^{2}\leq-\dot{w}_{\delta}(t)/c_{2}, we deduce that there exists c,c>0c,c^{\prime}>0 s.t. for all tTt\geq T^{\prime}, t2tx˙(s)2dscwδ(t)ct112θ.\int_{t}^{2t}\|\dot{x}(s)\|^{2}ds\leq cw_{\delta}(t)\leq c^{\prime}t^{-\frac{1}{1-2\theta}}\,.Applying [16, Lemma 2.1.5], it follows that tx˙(s)2dsctθ12θ\int_{t}^{\infty}\|\dot{x}(s)\|^{2}ds\leq ct^{-\frac{\theta}{1-2\theta}} for some other constant cc. Therefore x:=limt+x(t)x^{*}:=\lim_{t\to+\infty}x(t) exists by Cauchy’s criterion and for all tTt\geq T^{\prime}, x(t)xctθ12θ\|x(t)-x^{*}\|\leq ct^{-\frac{\theta}{1-2\theta}} . Finally, since x(t)ax(t)\to a, we remark that, using the same arguments, the global Łojasiewicz exponent θ\theta can be replaced by any Łojasiewicz exponent of ff at xx^{*}. When θ=12\theta=\frac{1}{2}, the proof follows the same line.

Proofs of Section 4

Let Assumptions 2.1, 2.2 and 4 hold true. There exists γˉ0>0\bar{\gamma}_{0}>0 s.t. for every R>0R>0, there exists s>0s>0,

Let R>0R>0. We denote by (Hγ,x,Hγ,m,Hγ,v)(H_{\gamma,{\mathsf{x}}},H_{\gamma,{\mathsf{m}}},H_{\gamma,{\mathsf{v}}}) the block components of HγH_{\gamma}. There exists a constant CsC_{s} depending only on ss s.t. Hγ1+sCs(Hγ,x1+s+Hγ,m1+s+Hγ,v1+s)\|H_{\gamma}\|^{1+s}\leq C_{s}(\|H_{\gamma,{\mathsf{x}}}\|^{1+s}+\|H_{\gamma,{\mathsf{m}}}\|^{1+s}+\|H_{\gamma,{\mathsf{v}}}\|^{1+s}). Hence, it is sufficient to prove that Eq. (31) holds respectively when replacing HγH_{\gamma} with each of Hγ,x,Hγ,m,Hγ,vH_{\gamma,{\mathsf{x}}},H_{\gamma,{\mathsf{m}}},H_{\gamma,{\mathsf{v}}}. Consider z=(x,m,v)z=(x,m,v) in Z+{{\mathcal{Z}}}_{+}. We write: Hγ,x(n+1,z,ξ)ε1(m1αˉ(γ)n+f(x,ξ)).\|H_{\gamma,{\mathsf{x}}}(n+1,z,\xi)\|\leq\varepsilon^{-1}(\|\frac{m}{1-\bar{\alpha}(\gamma)^{n}}\|+\|\nabla f(x,\xi)\|)\,. Thus, for every zz s.t. eγ(n,z)R\|e_{\gamma}(n,z)\|\leq R, there exists a constant CC depending only on ε\varepsilon, RR and ss s.t. Hγ,x(n+1,z,ξ)1+sC(1+f(x,ξ)1+s)\|H_{\gamma,{\mathsf{x}}}(n+1,z,\xi)\|^{1+s}\leq C(1+\|\nabla f(x,\xi)\|^{1+s}). By Assumption 4, (31) holds for Hγ,xH_{\gamma,{\mathsf{x}}} instead of HγH_{\gamma}. Similar arguments hold for Hγ,mH_{\gamma,{\mathsf{m}}} and Hγ,vH_{\gamma,{\mathsf{v}}} upon noting that the functions γ(1αˉ(γ))/γ\gamma\mapsto(1-\bar{\alpha}(\gamma))/\gamma and γ(1βˉ(γ))/γ\gamma\mapsto(1-\bar{\beta}(\gamma))/\gamma are bounded under Assumption 2.2.

Choose γˉ0>0\bar{\gamma}_{0}>0 and s>0s>0 as in Lemma 8.1. For every γγˉ0\gamma\leq\bar{\gamma}_{0},

By Lemma 8.1, the righthand side is finite and does not depend on (n,γ)(n,\gamma).

For every γ,R>0\gamma,R>0, we define zγ,R:=Xγ(zγ,R)=XγBR(zγ){\mathsf{z}}^{\gamma,R}:={\mathsf{X}}_{\gamma}(z^{\gamma,R})={\mathsf{X}}_{\gamma}\circ B_{R}(z^{\gamma}). Namely, zγ,R{\mathsf{z}}^{\gamma,R} is the interpolated process associated with the sequence (znγ,R)(z^{\gamma,R}_{n}). It is a random variable on C([0,+),Z)C([0,+\infty),{{\mathcal{Z}}}). We recall that Fn{{\mathcal{F}}}_{n} is the σ\sigma-algebra generated by the r.v. (ξk:1kn)(\xi_{k}:1\leq k\leq n). For every γ,n,R\gamma,n,R, we use the notation: Δ0γ,R:=0\Delta_{0}^{\gamma,R}:=0 and

It is an immediate consequence of Lemma 8.3 and [6, Lemma 6.2]

The proof of the following lemma is omitted.

By Eq. (34), the RHS of the above inequality tends to zero as γ0\gamma\to 0. The proof is complete.

2 Proof of Th. 13

In the sequel, we will use the notation Pˉγ,x\bar{P}^{\gamma,x} as a shorthand notation for Pˉγ,δx\bar{P}^{\gamma,\delta_{x}} where δx\delta_{x} is the Dirac measure at some point xXx\in{\mathsf{X}}. Finally, let Ψ\Psi be a semiflow on X{\mathsf{X}}. A Markov kernel PP is Feller if PfPf is continuous for every bounded continuous ff. {assumption} Let νP(X)\nu\in{{\mathcal{P}}}({\mathsf{X}}).

For every γ\gamma, Pˉγ\bar{P}_{\gamma} is Feller.

For every δ>0\delta>0, for every compact set KXK\subset{\mathsf{X}}, for every t>0t>0, limγ0supxKPˉγ,x(Xt/γΨt(x)>δ)=0.\lim_{\gamma\to 0}\sup_{x\in K}\bar{P}^{\gamma,x}\left(\|X_{\lfloor t/\gamma\rfloor}-\Psi_{t}(x)\|>\delta\right)=0\,.

Let BCΨBC_{\Psi} be the Birkhof center of Ψ\Psi i.e., the closure of the set of recurrent points.

We omit the proof of this result which follows a similar reasoning to [6, Th. 5.5 and Proof in section 8.4] and makes use of results from .

Proofs of Section 5

The following lemma will be useful in the proofs.

Let the sequence (rn)(r_{n}) be defined as in Algorithm 2. Assume that 0αn10\leq\alpha_{n}\leq 1 for all nn and that (1αn)/γna>0(1-\alpha_{n})/\gamma_{n}\to a>0 as n+n\to+\infty. Then,

The sequence (rn)(r_{n}) is nondecreasing and converges to 11.

Under Assumption 5.3 1, for every ϵ>0\epsilon>0, for sufficiently large nn, we have rn1eaγ02(1κ)n1κr_{n}-1\leq e^{-\frac{a\gamma_{0}}{2(1-\kappa)}n^{1-\kappa}} if κ(0,1)\kappa\in(0,1) and rn1naγ0/(1+ϵ)r_{n}-1\leq n^{-a\gamma_{0}/(1+\epsilon)} if κ=1\kappa=1.

A similar lemma holds for the sequence (rˉn)(\bar{r}_{n}).

We define zˉn=(xn1,mn,vn)\bar{z}_{n}=(x_{n-1},m_{n},v_{n}) (note the shift in the index of the variable xx). We have

where hh_{\infty} is defined in Eq. (18) and where we set

and ςn+1=(ςn+1x,ςn+1m,ςn+1v)\varsigma_{n+1}=(\varsigma_{n+1}^{x},\varsigma_{n+1}^{m},\varsigma_{n+1}^{v}) with the components defined by: ςn+1x=mnε+vnγnγn+1m^nε+v^n\varsigma_{n+1}^{x}=\frac{m_{n}}{\varepsilon+\sqrt{v_{n}}}-\frac{\gamma_{n}}{\gamma_{n+1}}\frac{\hat{m}_{n}}{\varepsilon+\sqrt{\hat{v}_{n}}}, ςn+1m=(1αn+1γn+1a)(F(xn)mn)+a(F(xn)F(xn1))\varsigma_{n+1}^{m}=\left(\frac{1-\alpha_{n+1}}{\gamma_{n+1}}-a\right)(\nabla F(x_{n})-m_{n})+a(\nabla F(x_{n})-\nabla F(x_{n-1})) and ςn+1v=(1βn+1γn+1b)(S(xn)vn)+b(S(xn)S(xn1))\varsigma_{n+1}^{v}=\left(\frac{1-\beta_{n+1}}{\gamma_{n+1}}-b\right)(S(x_{n})-v_{n})+b(S(x_{n})-S(x_{n-1})) . We prove that ςn0\varsigma_{n}\to 0 a.s. Using the triangular inequality,

(where τn=k=0nγk\tau_{n}=\sum_{k=0}^{n}\gamma_{k}), is almost surely a bounded APT of the semiflow Φˉ\bar{\Phi} defined by (ODE∞). The proof is concluded by applying Prop. 7.23 and Prop. 7.24.

2 Proof of Prop. 5.2

As infF>\inf F>-\infty, one can assume without loss of generality that F0F\geq 0. In the sequel, CC denotes some positive constant which may change from line to line. We define an:=(1αn+1)/γna_{n}:=(1-\alpha_{n+1})/\gamma_{n} and Pn:=12anrnmn2,1ε+v^nP_{n}:=\frac{1}{2a_{n}r_{n}}\langle m_{n}^{\odot 2},\frac{1}{\varepsilon+\sqrt{\hat{v}_{n}}}\rangle. We have anaa_{n}\to a and rn1r_{n}\to 1. By Assumption 5.1-1,

We set un:=1an+1anu_{n}:=1-\frac{a_{n+1}}{a_{n}} and Dn:=rn1ε+v^nD_{n}:=\frac{r_{n}^{-1}}{\varepsilon+\sqrt{\hat{v}_{n}}}, so that Pn=12anDn,mn2P_{n}=\frac{1}{2a_{n}}\langle D_{n},m_{n}^{\odot 2}\rangle. We can write:

We estimate the vector Dn+1DnD_{n+1}-D_{n}. Using that (rn1)(r_{n}^{-1}) is non-increasing,

Remarking that vn+1βn+1vnv_{n+1}\geq\beta_{n+1}v_{n}, recalling that (rˉn)(\bar{r}_{n}) is nondecreasing and using the update rules of vnv_{n} and rˉn\bar{r}_{n}, we obtain after some algebra

It is easy to see that cn/γnb/2c_{n}/\gamma_{n}\to b/2. Thus, for any δ>0\delta>0, cn+1(b+2δ)γn/2c_{n+1}\leq(b+2\delta)\gamma_{n}/2 for all nn large enough. Using also that v^n+1/(ε+v^n+1)1\sqrt{\hat{v}_{n+1}}/(\varepsilon+\sqrt{\hat{v}_{n+1}})\leq 1, we obtain that Dn+1Dnb+2δ2γnDn.D_{n+1}-D_{n}\leq\frac{b+2\delta}{2}\gamma_{n}D_{n}\,. Substituting this inequality in Eq. (36), we get

As anaa_{n}\to a, we have anb+2δ4ab+δ4a_{n}-\frac{b+2\delta}{4}\geq a-\frac{b+\delta}{4} for all nn large enough. Hence,

Using the Cauchy-Schwartz inequality and Assumption 5.1 2, it is easy to show the inequality F(xn),m^nε+v^nC(1+F(xn)+Pn)\langle\nabla F(x_{n}),\frac{\hat{m}_{n}}{\varepsilon+\sqrt{\hat{v}_{n}}}\rangle\leq C(1+F(x_{n})+P_{n}). Moreover, using the componentwise inequality (fn+1mn)22fn+12+2mn2(\nabla f_{n+1}-m_{n})^{\odot 2}\leq 2\nabla f_{n+1}^{\odot 2}+2m_{n}^{\odot 2} along with Assumption 5.1 2, we obtain

Putting all pieces together with Eq. (35),

Define Vn:=(1Cγn12)F(xn1)+(1un1)PnV_{n}:=(1-C\gamma_{n-1}^{2})F(x_{n-1})+(1-u_{n-1})P_{n} where the constant CC is fixed so that Eq. (38) holds. Then,

By Assumption 5.1, limsupnun1/γn<2ab/2\lim\sup_{n}u_{n-1}/\gamma_{n}<2a-b/2 and for δ\delta small enough, we obtain

3 Proof of Th. 5.3

We use [20, Th. 1]. All the assumptions in the latter can be verified in our case, at the exception of a positive definiteness condition on the limiting covariance matrix, which corresponds, in our case, to the matrix QQ given by Eq. (16). As QQ is not positive definite, it is strictly speaking not possible to just cite and apply [20, Th. 1]. Nevertheless, a detailed inspection of the proofs of shows that only a minor adaptation is needed in order to cover the present case. Therefore, proving the convergence result of from scratch is worthless. It is sufficient to verify the assumptions of [20, Th. 1] (except the definiteness of QQ) and then to point out the specific part of the proof of which requires some adaptation.

Let zn=(xn,mn,vn)z_{n}=(x_{n},m_{n},v_{n}) be the output of Algorithm 2. Define z=(x,0,S(x))z^{*}=(x^{*},0,S(x^{*})). Define ηn+1:=(0,a(fn+1F(xn)),b(fn+12S(xn)))\eta_{n+1}:=(0,a(\nabla f_{n+1}-\nabla F(x_{n})),b(\nabla f_{n+1}^{\odot 2}-S(x_{n}))). We have

where ϵn+1:=(ϵn+11,ϵn+12,ϵn+13)\epsilon_{n+1}:=(\epsilon_{n+1}^{1},\epsilon_{n+1}^{2},\epsilon^{3}_{n+1}), whose components are given by

Here, ηn+1\eta_{n+1} is a martingale increment noise and ϵn+1=(ϵn+11,ϵn+12,ϵn+13)\epsilon_{n+1}=(\epsilon_{n+1}^{1},\epsilon_{n+1}^{2},\epsilon^{3}_{n+1}) is a remainder term. The aim is to check the assumptions (A1.1) to (A1.3) of , where the role of the quantities (hh, εn\varepsilon_{n}, rnr_{n}, σn\sigma_{n}, α\alpha, ρ\rho, β\beta) in is respectively played by the quantities (hh_{\infty}, ηn\eta_{n}, ϵn\epsilon_{n}, γn\gamma_{n}, κ\kappa, 11, 11) of the present paper.

It is sufficient to verify the latter for ϵni\epsilon^{i}_{n} (i=1,2,3i=1,2,3) in place of ϵn\epsilon_{n}. The map (m,v)m/(ε+v)(m,v)\mapsto m/(\varepsilon+\sqrt{v}) is Lipschitz continuous in a neighborhood of (0,S(x))(0,S(x^{*})) by Assumption 2. Thus, for MM small enough, there exists a constant CC s.t. if znzM\|z_{n}-z^{*}\|\leq M, then ϵn+11Crn+11mn+1mn+Crˉn+11vn+1vn.\|\epsilon_{n+1}^{1}\|\leq C\left\|r_{n+1}^{-1}m_{n+1}-m_{n}\right\|+C\left\|\bar{r}_{n+1}^{-1}v_{n+1}-v_{n}\right\|\,. Using the triangular inequality and the fact that rn+1,rˉn+1r_{n+1},\bar{r}_{n+1} are bounded sequences away from zero, there exists an other constant CC s.t.

Using Lemma 9.1 under Assumption 5.3 (note that γ0>1/2L1/a\gamma_{0}>1/2L\geq 1/a when κ=1\kappa=1), we obtain that the sequence rn1/γn|r_{n}-1|/\gamma_{n} is bounded, thus rn+11Cγn+1|r_{n+1}-1|\leq C\gamma_{n+1}. The sequence (1αn)/γn(1-\alpha_{n})/\gamma_{n} being also bounded, it holds that

References