Convergence and Dynamical Behavior of the ADAM Algorithm for Non-Convex Stochastic Optimization

Anas Barakat, Pascal Bianchi

Introduction

To alleviate these limitations, the popular Adam algorithm adjusts the learning rate coordinate-wise, as a function of the past values of the squared gradient vectors’ coordinates. The algorithm thus combines the assets of inertial methods with an adaptive per-coordinate learning rate selection. Finally, the algorithm includes a so-called bias correction step. Acting on the current estimate of the gradient vector, this step is especially useful during the early iterations.

Despite the growing popularity of the algorithm, only few works investigate its behavior from a theoretical point of view (see the discussion in Section 6). The present paper studies the convergence of Adam from a dynamical system viewpoint.

We introduce a continuous-time version of the Adam algorithm under the form of a non-autonomous ordinary differential equation (ODE). Building on the existence of an explicit Lyapunov function for the ODE, we show the existence of a unique global solution to the ODE. This first result turns out to be non-trivial due to the irregularity of the vector field. We then establish the convergence of the continuous-time Adam trajectory to the set of critical points of the objective function $F$ . The proposed continuous-time version of Adam provides useful insights on the effect of the bias correction step. It is shown that, close to the origin, the objective function $F$ is non-increasing along the Adam trajectory, suggesting that early iterations of Adam can only improve the initial guess.

Under a Łojasiewicz-type condition, we prove that the solution to the ODE converges to a single critical point of the objective function $F$ . We provide convergence rates in this case.

In discrete time, we first analyze the Adam iterates in the constant stepsize regime as originally introduced in . In this work, it is shown that the discrete-time Adam iterates shadow the behavior of the non-autonomous ODE in the asymptotic regime where the stepsize parameter $\gamma$ of Adam is small. More precisely, we consider the interpolated process ${\mathsf{z}}^{\gamma}(t)$ which consists of a piecewise linear interpolation of the Adam iterates. The random process ${\mathsf{z}}^{\gamma}$ is indexed by the parameter $\gamma$ , which is assumed constant during the whole run of the algorithm. In the space of continuous functions on $[0,+\infty)$ equipped with the topology of uniform convergence on compact sets, we establish that ${\mathsf{z}}^{\gamma}$ converges in probability to the solution to the non-autonomous ODE when $\gamma$ tends to zero.

Under a stability condition, we prove the asymptotic ergodic convergence of the probability of the discrete-time Adam iterates to approach the set of critical points of the objective function in the doubly asymptotic regime where $n\to\infty$ then $\gamma\to 0$ .

Beyond the original constant stepsize Adam , we propose a decreasing stepsize version of the algorithm. We provide sufficient conditions ensuring the stability and the almost sure convergence of the iterates towards the critical points of the objective function.

We establish a convergence rate of the stochastic iterates of the decreasing stepsize algorithm under the form of a conditional central limit theorem.

We claim that our analysis can be easily extended to other adaptive algorithms such as e.g. RmsProp or AdaGrad and AmsGrad (see Section 6).

The paper is organized as follows. In Section 2, we present the Adam algorithm and the main assumptions. Our main results are stated in Sections 3 to 5. We provide a review of related works in Section 6. The rest of the paper addresses the proofs of our results (Sections 7 to 9).

The Adam Algorithm

It satisfies: $z_{n}=T_{\gamma,\alpha,\beta}(n,z_{n-1},\xi_{n})\,,$ for every $n\geq 1$ , where for every $z=(x,m,v)$ in ${{\mathcal{Z}}}_{+}$ , $\xi\in\Xi$ ,

The iterates $z_{n}$ form a non-homogeneous Markov chain, because $T_{\gamma,\alpha,\beta}(n,z,\xi)$ depends on $n$ . This is due to the so-called debiasing step, which consists of replacing $m_{n},v_{n}$ in Algorithm 1 by their “debiased” versions $\hat{m}_{n},\hat{v}_{n}$ . The motivation becomes clear when expanding the expression:

From this equation, it is observed that, $\hat{m}_{n}$ forms a convex combination of the past gradients. This is unlike $m_{n}$ , which may be small during the first iterations.

For almost every $\xi$ , the map $f(\,.\,,\xi)$ is continuously differentiable.

2 Asymptotic Regime

We address the constant stepsize regime, where $\gamma$ is fixed along the iterations (the default value recommended in is $\gamma=0.001$ ). As opposed to the decreasing stepsize context, the sequence $z_{n}^{\gamma}:=z_{n}$ cannot in general converge as $n$ tends to infinity, in an almost sure sense. Instead, we investigate the asymptotic behavior of the family of processes $(n\mapsto z_{n}^{\gamma})_{\gamma>0}$ indexed by $\gamma$ , in the regime where $\gamma\to 0$ . We use the so-called ODE method (see e.g. ). The interpolated process ${\mathsf{z}}^{\gamma}$ is the piecewise linear function defined on $[0,+\infty)\to{{\mathcal{Z}}}_{+}$ for all $t\in[n\gamma,(n+1)\gamma)$ by:

We establish the convergence in probability of the family of random processes $({\mathsf{z}}^{\gamma})_{\gamma>0}$ as $\gamma$ tends to zero, towards a deterministic continuous-time system defined by an ODE. The latter ODE, which we provide below at Eq. (ODE), will be referred to as the continuous-time version of Adam.

the initialization being chosen as $z_{0}^{\gamma}=(x_{0},0,0)$ .

Continuous-Time System

In order to gain insight into the behavior of the sequence $(z_{n}^{\gamma})$ defined by (5), it is convenient to rewrite the Adam iterations under the following equivalent form, for every $n\geq 1$ :

where we define for every $\gamma>0$ , $z\in{{\mathcal{Z}}}_{+}$ ,

where $a,b$ are the constants defined in Assumption 2.2. We prove that, for any fixed $(t,z)$ , the quantity $h(t,z)$ coincides with the limit of $h_{\gamma}(\lfloor t/\gamma\rfloor,z)$ as $\gamma\downarrow 0$ . This remark along with Eq. (6) suggests that, as $\gamma\downarrow 0$ , the interpolated process ${\mathsf{z}}^{\gamma}$ shadows the non-autonomous differential equation

2 Existence, Uniqueness, Convergence

Since $h(\,.\,,z)$ is non-continuous at point zero for a fixed $z\in{{\mathcal{Z}}}_{+}$ , and since $h(t,\,.\,)$ is not locally Lipschitz continuous for a fixed $t~{}>~{}0$ , the existence and uniqueness of the solution to (ODE) do not stem directly from off-the-shelf theorems.

Let $x_{0}$ be fixed. A continuous map $z:[0,+\infty)\to{{\mathcal{Z}}}_{+}$ is said to be a global solution to (ODE) with initial condition $(x_{0},0,0)$ if $z$ is continuously differentiable on $(0,+\infty)$ , if Eq. (ODE) holds for all $t>0$ , and if $z(0)=(x_{0},0,0)$ .

Let Assumptions 2.1 to 2.2 hold true. There exists a unique global solution $z:[0,+\infty)\to{{\mathcal{Z}}}_{+}$ to (ODE) with initial condition $(x_{0},0,0)$ . Moreover, $z([0,+\infty))$ is a bounded subset of ${{\mathcal{Z}}}_{+}$ .

On the other hand, we note that a solution may not exist for an initial point $(x_{0},m_{0},v_{0})$ with arbitrary (non-zero) values of $m_{0},v_{0}$ .

Let Assumptions 2.1 to 2.2 hold true. Assume that $F({{\mathcal{S}}})$ has an empty interior. Let $z:t\mapsto(x(t),m(t),v(t))$ be the global solution to (ODE) with the initial condition $(x_{0},0,0)$ . Then, the set ${{\mathcal{S}}}$ is non-empty and $\lim_{t\to\infty}{{\mathsf{d}}}(x(t),{{\mathcal{S}}})=0$ , $\lim_{t\to\infty}m(t)=0$ , $\lim_{t\to\infty}S(x(t))-v(t)=0$ .

Then, $t\mapsto V(t,z(t))$ is decreasing if $z(\cdot)$ is the global solution to (ODE).

Cost decrease at the origin. As $F$ itself is not a Lyapunov function for (ODE), there is no guarantee that $F(x(t))$ is decreasing w.r.t. $t$ . Nevertheless, the statement holds at the origin. Indeed, it can be shown that $\lim_{t\downarrow 0}V(t,z(t))=F(x_{0})$ (see Prop. 7.7). As a consequence,

In other words, the (continuous-time) Adam procedure can only improve the initial guess $x_{0}$ . This is the consequence of the so-called bias correction steps in Adam (see Algorithm 1) . If these debiasing steps were deleted in the Adam iterations, the early stages of the algorithm could degrade the initial estimate $x_{0}$ .

Derivatives at the origin. The proof of Th. 3.1 reveals that the initial derivative is given by $\dot{x}(0)=-\nabla F(x_{0})/(\varepsilon+\sqrt{S(x_{0})})$ (see Lemma 7.1). In the absence of debiasing steps, the initial derivative $\dot{x}(0)$ would be a function of the initial parameters $m_{0}$ , $v_{0}$ , and the user would be required to tune these hyperparameters. No such tuning is required thanks to the debiasing step. When $\varepsilon$ is small and when the variance of $\nabla f(x_{0},\xi)$ is small (i.e., $S(x_{0})\simeq\nabla F(x_{0})^{\odot 2}$ ), the initial derivative $\dot{x}(0)$ is approximately equal to $-\nabla F(x_{0})/|\nabla F(x_{0})|$ . This suggests that in the early stages of the algorithm, the Adam iterations are comparable to the sign variant of the gradient descent, the properties of which were discussed in previous works, see .

3 Convergence rates

In this paragraph, we establish the convergence to a single critical point of $F$ and quantify the convergence rate, using the following assumption . {assumption}[Łojasiewicz property] For any $x^{*}\in{\mathcal{S}}$ , there exist $c>0\,,\sigma>0$ and $\theta\in(0,\frac{1}{2}]$ s.t.

Assumption 3.3 holds for real-analytic functions and semialgebraic functions. We refer to for a discussion and a review of applications. We will call any $\theta$ satisfying (12) for some $c,\sigma>0$ , as a Łojasiewicz exponent of $f$ at $x^{*}$ . The next result establishes the convergence of the function $x(t)$ generated by the ODE to a single critical point of $f$ , and provides the convergence rate as a function of the Łojasiewicz exponent of $f$ at this critical point. The proof is provided in subsection 7.4.

Moreover, if $\theta\in(0,\frac{1}{2}]$ is a Łojasiewicz exponent of $f$ at $x^{*}$ , there exists a constant $C>0$ s.t. for all $t\geq 0$ ,

Discrete-Time System: Convergence of Adam

The sequence $(\xi_{n}:n\geq 1)$ is iid, with the same distribution as $\xi$ .

Let $p>0$ . Assume either one of the following conditions.

The value of $p$ will be specified in the sequel, in the statement of the results. Clearly, Assumption 4 ii is stronger than Assumption 4 i. We shall use either the latter or the former in our statements.

Convergence in the long run. When the stepsize $\gamma$ is constant, the sequence $(x_{n}^{\gamma})$ cannot converge in the almost sure sense as $n\to\infty$ . Convergence may only hold in the doubly asymptotic regime where $n\to\infty$ then $\gamma\to 0$ .

A Decreasing Stepsize Adam Algorithm

Adam inherently uses constant stepsizes. Consequently, the iterates (5) do not converge in the almost sure sense. In order to achieve convergence, we introduce in this section a decreasing stepsize version of Adam. The iterations are given in Algorithm 2.

where for every $n,k$ , $\rho_{n,k}=\alpha_{n}\cdots\alpha_{k+1}(1-\alpha_{k})$ . A similar equation holds for $\hat{v}_{n}$ .

2 Almost sure convergence

$\sum_{n}\gamma_{n}=+\infty$ and $\sum_{n}\gamma_{n}^{2}<+\infty$ ,

There exist $a,b$ s.t. $0<b<4a$ , $\gamma_{n}^{-1}(1-\alpha_{n})\to a$ and $\gamma_{n}^{-1}(1-\beta_{n})\to b$ .

Th. 5.1 establishes the almost sure convergence of $x_{n}$ to the set of critical points of $F$ , under the assumption that the sequence $((x_{n},m_{n},v_{n}))$ is a.s. bounded. The next result provides a sufficient condition under which almost sure boundedness holds. {assumption} The following holds.

We assume the condition: $\lim\sup_{n\to\infty}\left(\frac{1}{\gamma_{n}}-\left(\frac{1-\alpha_{n+2}}{1-\alpha_{n+1}}\right)\frac{1}{\gamma_{n+1}}\right)<2(a-\frac{b}{4})\,,$ which is satisfied for instance if $b<4a$ and $1-\alpha_{n+1}=a\gamma_{n}$ .

3 Central Limit Theorem

Let $x^{*}\in{{\mathcal{S}}}$ . There exists a neighborhood ${{\mathcal{V}}}$ of $x^{*}$ s.t.

$F$ is twice continuously differentiable on ${{\mathcal{V}}}$ , and the Hessian $\nabla^{2}F(x^{*})$ of $F$ at $x^{*}$ is positive definite.

$S$ is continuously differentiable on ${{\mathcal{V}}}$ .

Define $D:=\textrm{diag}\left((\varepsilon+\sqrt{S_{1}(x^{*})})^{-1},\cdots,(\varepsilon+\sqrt{S_{d}(x^{*})})^{-1}\right)\,.$ Let $P$ be an orthogonal matrix s.t. the following spectral decomposition holds:

where $\lambda_{1},\cdots,\lambda_{d}$ are the (positive) eigenvalues of $D^{1/2}\nabla^{2}F(x^{*})D^{1/2}$ . Define

where $I_{d}$ represents the $d\times d$ identity matrix and $\nabla S(x^{*})$ is the Jacobian matrix of $S$ at $x^{*}$ . The largest real part of the eigenvalues of $H$ coincides with $-L$ , where

There exist $\kappa\in(0,1]$ , $\gamma_{0}>0$ , s.t. the sequence $(\gamma_{n})$ satisfies $\gamma_{n}={\gamma_{0}}/{(n+1)^{\kappa}}$ for all $n$ . If $\kappa=1$ , we assume moreover that $\gamma_{0}>\frac{1}{2L}$ .

The sequences $\left(\frac{1}{\gamma_{n}}(\frac{1-\alpha_{n}}{\gamma_{n}}-a)\right)$ and $\left(\frac{1}{\gamma_{n}}(\frac{1-\beta_{n}}{\gamma_{n}}-b)\right)$ are bounded.

The variable $v_{n}$ has an impact on the limiting covariance $\Sigma_{1}$ through its limit $S(x^{*})$ (used to define $D$ ), but the fluctuations of $v_{n}$ and the parameter $b$ have no effect on $\Sigma_{1}$ . As a matter of fact, $\Sigma_{1}$ coincides with the limiting covariance matrix that would have been obtained by considering iterates of the form

which can be interpreted as a preconditioned version of the stochastic heavy ball algorithm . Of course, the above iterates are not implementable because the preconditioning matrix $D$ is unknown.

When $a$ is large, $\Sigma_{1}$ is close to the matrix $\Sigma_{1}^{(0)}$ obtained when letting $a\to+\infty$ in Eq. (17). The matrix $\Sigma_{1}^{(0)}$ is the solution to the Lyapunov equation

The matrix $\Sigma_{1}^{(0)}$ can be interpreted as the asymptotic covariance matrix of the $x$ -variable in the absence of the inertial term (that is, when one considers RmsProp instead of Adam). The matrix $\Sigma_{1}^{(0)}$ approximates $\Sigma_{1}$ in the sense that $\Sigma_{1}=\Sigma_{1}^{(0)}+\frac{1}{a}\Delta+O(\frac{1}{a^{2}})$ for some symmetric matrix $\Delta$ which can be explicited. The matrix $\Delta$ is neither positive nor negative definite in general. This suggests that the question of the potential benefit of the presence of an inertial term is in general problem dependent.

In the statement of Th. 5.3, the conditioning event $\{z_{n}\to z^{*}\}$ can be replaced by the event $\{x_{n}\to x^{*}\}$ under the additional assumption that $\sum_{n}\gamma_{n}^{2}<+\infty$ .

Related Works

Although the idea of adapting the (per-coordinate) learning rates as a function of past gradient values is not new (see e.g. variable metric methods such as the BFGS algorithms), AdaGrad led the way to a new class of algorithms that are sometimes referred to as adaptive gradient methods. AdaGrad consists of dividing the learning rate by the square root of the sum of previous gradients squared componentwise. The idea was to give larger learning rates to highly informative but infrequent features instead of using a fixed predetermined schedule. However, in practice, the division by the cumulative sum of squared gradients may generate small learning rates, thus freezing the iterates too early. Several works proposed heuristical ways to set the learning rates using a less aggressive policy. The work introduced an unpublished, yet popular, algorithm referred to as RmsProp where the cumulative sum used in AdaGrad is replaced by a moving average of squared gradients. Adam combines the advantages of both AdaGrad, RmsProp and inertial methods.

As opposed to AdaGrad, for which theoretical convergence guarantees exist , Adam is comparatively less studied. The initial paper suggests a $\mathcal{O}(\frac{1}{\sqrt{T}})$ average regret bound in the convex setting, but exhibits a counterexample in contradiction with this statement. The latter counterexample implies that the average regret bound of Adam does not converge to zero. A first way to overcome the problem is to modify the Adam iterations themselves in order to obtain a vanishing average regret. This led to propose a variant called AmsGrad with the aim to recover, at least in the convex case, the sought guarantees. The work interprets Adam as a variance-adapted sign descent combining an update direction given by the sign and a magnitude controlled by a variance adaptation principle. A “noiseless” version of Adam is considered in . Under quite specific values of the Adam-hyperparameters, it is shown that for every $\delta>0$ , there exists some time instant for which the norm of the gradient of the objective at the current iterate is no larger than $\delta$ . The recent paper provides a similar result for AmsGrad and AdaGrad, but the generalization to Adam is subject to conditions which are not easily verifiable. The paper provides a convergence result for RmsProp using the objective function $F$ as a Lyapunov function. However, our work suggests that unlike RmsProp, Adam does not admit $F$ as a Lyapunov function. This makes the approach of hardly generalizable to Adam. Moreover, considers biased gradient estimates instead of the debiased estimates used in Adam.

In the present work, we study the behavior of an ODE, interpreted as the limit in probability of the (interpolated) Adam iterates as the stepsize tends to zero. Closely related continuous-time dynamical systems are also studied in . We leverage the idea of approximating a discrete time stochastic system by a deterministic continuous one, often referred to as the ODE method. A recent work fruitfully exploits this method to study a stochastic version of the celebrated heavy ball algorithm. We refer to for the reader interested in the non-differentiable setting with an analysis of the stochastic subgradient algorithm for non-smooth non-convex objective functions.

Concomitant to the present paper, Da Silva and Gazeau (posted only four weeks after the first version of the present work) study the asymptotic behavior of a similar dynamical system as the one introduced here. They establish several results in continuous time, such as avoidance of traps as well as convergence rates in the convex case; such aspects are out of the scope of this paper. However, the question of the convergence of the (discrete-time) iterates is left open. In the current paper, we also exhibit a Lyapunov function which allows, amongst others, to draw useful conclusions on the effect of the debiasing step of Adam . Finally, studies a slightly modified version of Adam allowing to recover an ODE with a locally Lipschitz continuous vector field, whereas the original Adam algorithm leads to an ODE with an irregular vector field. This technical issue is tackled in the present paper.

Proofs of Section 3

When $\eta=0$ , Eq. (ODEη) boils down to the equation of interest (ODE). The choice $\eta\in(0,+\infty)$ will be revealed useful to prove Th. 3.1. Indeed, for $\eta>0$ , a solution to Eq. (ODEη) can be shown to exist (on some interval) due to the continuity of the map $h(\,.+\eta,\,.\,)$ . Considering a family of such solutions indexed by $\eta\in(0,1]$ , the idea is to prove the existence of a solution to (ODE) as a cluster point of the latter family when $\eta\downarrow 0$ . Indeed, as the family is shown to be equicontinuous, such a cluster point does exist thanks to the Arzelà-Ascoli theorem. When $\eta=+\infty$ , Eq. (ODEη) rewrites

where $h_{\infty}(z):=\lim_{t\to\infty}h(t,z)$ . It is useful to note that for $(x,m,v)\in{{\mathcal{Z}}}_{+}$ ,

Contrary to Eq. (ODE), Eq. (ODE∞) defines an autonomous ODE. The latter admits a unique global solution for any initial condition in ${{\mathcal{Z}}}_{+}$ , and defines a dynamical system ${{\mathcal{D}}}$ . We shall exhibit a strict Lyapunov function for this dynamical system ${{\mathcal{D}}}$ , and deduce that any solution to (ODE∞) converges to the set of equilibria of ${{\mathcal{D}}}$ as $t\to\infty$ . On the otherhand, we will prove that the solution to (ODE) with a proper initial condition is a so-called asymptotic pseudotrajectory (APT) of ${{\mathcal{D}}}$ . Due to the existence of a strict Lyapunov function, the APT shall inherit the convergence behavior of the autonomous system as $t\to\infty$ , which will prove Th. 3.2.

It is convenient to extend the map $h:(0,+\infty)\times{{\mathcal{Z}}}_{+}\to{{\mathcal{Z}}}$ on $(0,+\infty)\times{{\mathcal{Z}}}\to{{\mathcal{Z}}}$ by setting $h(t,(x,m,v)):=h(t,(x,m,|v|))$ for every $t>0$ , $(x,m,v)\in{{\mathcal{Z}}}$ . Similarly, we extend $h_{\infty}$ as $h_{\infty}((x,m,v)):=h_{\infty}((x,m,|v|))$ . For any $T\in(0,+\infty]$ and any $\eta\in[0,+\infty]$ , we say that a map $z:[0,T)\to{{\mathcal{Z}}}$ is a solution to (ODEη) on $[0,T)$ with initial condition $z_{0}\in{{\mathcal{Z}}}_{+}$ , if $z$ is continuous on $[0,T)$ , continuously differentiable on $(0,T)$ , and if (ODEη) holds for all $t\in(0,T)$ . When $T=+\infty$ , we say that the solution is global. We denote by $Z^{\eta}_{T}(z_{0})$ the subset of $C([0,T),{{\mathcal{Z}}})$ formed by the solutions to (ODEη) on $[0,T)$ with initial condition $z_{0}$ . For any $K\subset{{\mathcal{Z}}}_{+}$ , we define $Z^{\eta}_{T}(K):=\bigcup_{z\in K}Z^{\eta}_{T}(z)$ .

Let Assumptions 2, 7.1 and 7.1 hold. For every $\eta\in[0,+\infty]$ , $T\in(0,+\infty]$ , $z_{0}\in{{\mathcal{Z}}}_{+}$ , $z\in Z_{T}^{\eta}(z_{0})$ , it holds that $z((0,T))\subset{{\mathcal{Z}}}_{+}^{*}$ .

Set $z(t)=(x(t),m(t),v(t))$ for all $t$ . Consider $i\in\{1,\dots,d\}$ . Assume by contradiction that there exists $t_{0}\in(0,T)$ s.t. $v_{i}(t_{0})<0$ . Set $\tau:=\sup\{t\in[0,t_{0}]:v_{i}(t)\geq 0\}$ . Clearly, $\tau<t_{0}$ and $v_{i}(\tau)=0$ by the continuity of $v_{i}$ . Since $v_{i}(t)\leq 0$ for all $t\in(\tau,t_{0}]$ , it holds that $\dot{v}_{i}(t)=b(S_{i}(x(t))-v_{i}(t))$ is nonnegative for all $t\in(\tau,t_{0}]$ . This contradicts the fact that $v_{i}(\tau)>v_{i}(t_{0})$ . Thus, $v_{i}(t)\geq 0$ for all $t\in[0,T)$ . Now assume by contradiction that there exists $t\in(0,T)$ s.t. $v_{i}(t)=0$ . Then, $\dot{v}_{i}(t)=bS_{i}(x(t))>0$ . Thus, $\lim_{\delta\downarrow 0}\frac{v_{i}(t-\delta)}{-\delta}=bS_{i}(x(t))\,.$ In particular, there exists $\delta>0$ s.t. $v_{i}(t-\delta)\leq-\frac{\delta b}{2}S_{i}(x(t))\,.$ This contradicts the first point.

Recall the definitions of $V$ and $U$ from Eqs. (9) and (10). Clearly, $U_{\infty}(v):=\lim_{t\to\infty}U(t,v)=a(\varepsilon+\sqrt{v})$ is well defined for every $v\in[0,+\infty)^{d}$ . Hence, we can also define $V_{\infty}(z):=\lim_{t\to\infty}V(t,z)$ for every $z\in{{\mathcal{Z}}}_{+}$ .

Let Assumptions 7.1 and 7.1 hold. Assume that $0<b\leq 4a$ . Consider $(t,z)\in(0,+\infty)\times{{\mathcal{Z}}}_{+}^{*}$ and set $z=(x,m,v)$ . Then, $V$ and $V_{\infty}$ are differentiable at points $(t,z)$ and $z$ respectively. Moreover, $\langle\nabla V_{\infty}(z),h_{\infty}(z)\rangle\leq-\varepsilon\left\|\frac{am}{U_{\infty}(v)}\right\|^{2}\,$ and

We only prove the second point, the proof of the first point follows the same line. Consider $(t,z)\in(0,+\infty)\times{{\mathcal{Z}}}_{+}^{*}$ . We decompose $\langle\nabla V(t,z),(1,h(t,z))\rangle=\partial_{t}V(t,z)+\langle\nabla_{z}V(t,z),h(t,z)\rangle$ . After tedious but straightforward derivations, we get:

where $U(t,v_{i})=a(1-e^{-at})\left(\varepsilon+\sqrt{\frac{v_{i}}{1-e^{-bt}}}\right)$ and $\langle\nabla_{z}V(t,z),h(t,z)\rangle$ is equal to:

2 Proof of Th. 3.1

Let Assumptions 2, 2, 7.1 and 7.1 hold. Assume that $0<b\leq 4a$ . For every $z_{0}\in{{\mathcal{Z}}}_{0}$ , there exists a compact set $K\subset{{\mathcal{Z}}}_{+}$ s.t. for all $\eta\in[0,+\infty)$ , all $T\in(0,+\infty]$ and all $z\in Z_{T}^{\eta}(z_{0})$ , $\left\{\bar{e}(t+\eta,z(t)):t\in(0,T)\right\}\subset K\,.$ Moreover, choosing $z_{0}$ of the form $z_{0}=(x_{0},0,0)$ and $z(t)=(x(t),m(t),v(t))$ , it holds that $F(x(t))\leq F(x_{0})$ for all $t\in[0,T)$ .

Let $\eta\in[0,+\infty)$ . Consider a solution $z_{\eta}(t)=(x_{\eta}(t),m_{\eta}(t),v_{\eta}(t))$ as in the statement, defined on some interval $[0,T)$ . Define $\hat{m}_{\eta}(t)=m_{\eta}(t)/(1-e^{-a(t+\eta)})$ , $\hat{v}_{\eta}(t)=v_{\eta}(t)/(1-e^{-b(t+\eta)})$ . By Lemma 7.3, $t\mapsto V(t+\eta,z(t))$ is continuous on $[0,T)$ , and continuously differentiable on $(0,T)$ . By Lemma 7.5, $\dot{V}(t+\eta,z_{\eta}(t))\leq 0$ for all $t>0$ . As a consequence, $t\mapsto V(t+\eta,z_{\eta}(t))$ is non-increasing on $[0,T)$ . Thus, for all $t\geq 0$ , $F(x_{\eta}(t))\leq\lim_{t^{\prime}\downarrow 0}V(t^{\prime}+\eta,z_{\eta}(t^{\prime}))$ . Note that $V(t+\eta,z_{\eta}(t))\leq F(x_{\eta}(t))+\frac{1}{2}\sum_{i=1}^{d}\frac{m_{\eta,i}(t)^{2}}{a(1-e^{-a(t+\eta)})\varepsilon}\,.$ If $\eta>0$ , every term in the sum in the righthand side tends to zero, upon noting that $m_{\eta}(t)\to 0$ as $t\to 0$ . The statement still holds if $\eta=0$ . Indeed, by Lemma 7.1, for a given $i\in\{1,\dots,d\}$ , there exists $\delta>0$ s.t. for all $0<t<\delta$ , $m_{\eta,i}(t)^{2}\leq 2a^{2}(\partial_{i}F(x_{0}))^{2}t^{2}$ and $1-e^{-at}\geq(at)/2$ . As a consequence, each term of the sum is no larger than $4(\partial_{i}F(x_{0}))^{2}t/\varepsilon$ , which tends to zero as $t\to 0$ . We conclude that for all $t\geq 0$ , $F(x_{\eta}(t))\leq F(x_{0})$ . In particular, $\{x_{\eta}(t):t\in[0,T)\}\subset\{F\leq F(x_{0})\}$ , the latter set being bounded by Assumption 2.

As $V(t+\eta,z_{\eta}(t))\leq F(x_{0})$ , we obtain: $F(x_{0})\geq F(x_{\eta}(t))+\frac{1}{2}\left\|m_{\eta}(t)\right\|^{2}_{U(t+\eta,v_{\eta}(t))^{-1}}$ . Thus, $F(x_{0})\geq\inf F+\frac{1}{2a(\varepsilon+\sqrt{M})}\|m_{\eta}(t)\|^{2}\,$ . Therefore, $m_{\eta}(\,.\,)$ is bounded on $[0,T)$ , uniformly in $\eta$ . The same holds for $\hat{m}_{\eta}$ by using the mean value theorem in the same way as for $\hat{v}_{\eta}$ . The proof is complete.

Let Assumptions 2, 2, 7.1 and 7.1 hold. Assume that $0<b\leq 4a$ . Let $K$ be a compact subset of ${{\mathcal{Z}}}_{+}$ . Then, there exists an other compact set $K^{\prime}\subset{{\mathcal{Z}}}_{+}$ s.t. for every $T\in(0,+\infty]$ and every $z\in Z_{T}^{\infty}(K)$ , $z([0,T))\subset K^{\prime}$ .

The proof follows the same line as Prop. 7.7 and is omitted.

For any $K\subset{{\mathcal{Z}}}_{+}$ , define $v_{\min}(K):=\inf\{v_{k}:(x,m,v)\in K,i\in\{1,\dots,d\}\}$ .

Under Assumptions 2, 2, 7.1 and 7.1, the following holds true.

For every compact set $K\subset{{\mathcal{Z}}}_{+}$ , there exists $c>0$ , s.t. for every $z\in Z^{\infty}_{\infty}(K)$ , of the form $z(t)=(x(t),m(t),v(t))$ , $v_{i}(t)\geq c\min\left(1,\frac{v_{\min}(K)}{2c}+t\right)\qquad(\forall t\geq 0,\forall i\in\{1,\dots,d\})\,.$

For every $z_{0}\in{{\mathcal{Z}}}_{0}$ , there exists $c>0$ s.t. for every $\eta\in[0,+\infty)$ and every $z\in Z_{\infty}^{\eta}(z_{0})$ , $v_{i}(t)\geq c\min(1,t)\qquad(\forall t\geq 0,\forall i\in\{1,\dots,d\})\,.$

Set $\kappa_{1}:=0.5(v_{\min}+bS_{\min}\tau)$ . Note that $v_{i}(\tau)\geq\kappa_{1}$ . Define $S_{\min}^{\prime}:=\inf\{S_{i}(x):i\in\{1,\dots,d\},(x,m,v)\in K^{\prime}\}\,.$ Note that $S_{\min}^{\prime}>0$ by Assumptions 7.1 and 2. Finally, define $\kappa=0.5\min(\kappa_{1},S_{\min}^{\prime})$ . By contradiction, assume that the set $\{t\geq\tau:v_{i}(t)<\kappa\}$ is non-empty, and denote by $\tau^{\prime}$ its infimum. It is clear that $\tau^{\prime}>\tau$ and $v_{i}(\tau^{\prime})=\kappa$ . Thus, $b^{-1}\dot{v}_{i}(\tau^{\prime})=S_{i}(x(\tau^{\prime}))-\kappa$ . We obtain that $b^{-1}\dot{v}_{i}(\tau^{\prime})\geq 0.5S_{\min}^{\prime}>0$ . As a consequence, there exists $t\in(\tau,\tau^{\prime})$ s.t. $v_{i}(t)<v_{i}(\tau^{\prime})$ . This contradicts the definition of $\tau^{\prime}$ . We have shown that for all $t\geq\tau$ , $v_{i}(t)\geq\kappa$ . Putting this together with Eq. (21) and using that $\kappa\leq v_{\min}+bS_{\min}\tau$ , we conclude that: $\forall t\geq 0,\ v_{i}(t)\geq\min\left(\kappa\,,\frac{v_{\min}}{2}+\frac{bS_{\min}t}{2}\right)\,.$ Setting $c:=\min(\kappa,bS_{\min}/2)$ , the result follows.

We prove the second point. By Prop. 7.7, there exists a compact set $K\subset{{\mathcal{Z}}}_{+}$ s.t. for every $\eta\geq 0$ , every $z\in Z_{\infty}^{\eta}(x_{0})$ of the form $z(t)=(x(t),m(t),v(t))$ satisfies $\{(x(t),\hat{m}(t),\hat{v}(t)):t\geq 0\}\subset K$ , where $\hat{m}(t)=m(t)/(1-e^{-a(t+h)})$ and $\hat{v}(t)=v(t)/(1-e^{-b(t+h)})$ . Denote by $L_{S}$ the Lipschitz constant of $S$ on the set $\{x:(x,m,v)\in K\}$ . Introduce the constants $M_{1}:=\sup\{\|m/(\varepsilon+\sqrt{v})\|_{\infty}:(x,m,v)\in K\}$ , $M_{2}:=\sup\{\|S(x)\|_{\infty}:(x,m,v)\in K^{\prime}\}$ . These constants being introduced, the rest of the proof follows the same line as the proof of the first point.

2.2 Existence

Let Assumptions 2, 2, 7.1 and 7.1 hold. Assume that $0<b\leq 4a$ . For every $z_{0}\in{{\mathcal{Z}}}_{+}$ , $Z_{\infty}^{\infty}(z_{0})\neq\emptyset$ . For every $(z_{0},\eta)\in{{\mathcal{Z}}}_{0}\times(0,+\infty)$ , $Z_{\infty}^{\eta}(z_{0})\neq\emptyset$ .

We prove the first point (the proof of the second point follows the same line). Under Assumptions 7.1 and 7.1, $h_{\infty}$ is continuous. Therefore, Cauchy-Peano’s theorem guarantees the existence of a solution to the (ODE) issued from $z_{0}$ , which we can extend over a maximal interval of existence $[0,T_{\max})$ .We conclude that the solution is global ( $T_{\max}=+\infty$ ) using the boundedness of the solution given by Prop. 7.9.

Let Assumptions 2, 2, 7.1 and 7.1 hold. Assume that $0<b\leq 4a$ . Consider $z_{0}\in{{\mathcal{Z}}}_{0}$ . Denote by $(z_{\eta}:\eta\in(0,+\infty))$ a family of functions on $[0,+\infty)\to{{\mathcal{Z}}}_{+}$ s.t. for every $\eta>0$ , $z_{\eta}\in Z_{\infty}^{\eta}(z_{0})$ . Then, $(z_{\eta})_{\eta>0}$ is equicontinuous.

Let Assumptions 2, 2, 7.1 and 7.1 hold. Assume that $0<b\leq 4a$ . For every $z_{0}\in{{\mathcal{Z}}}_{0}$ , $Z_{\infty}^{0}(z_{0})\neq\emptyset$ i.e., (ODE) admits a global solution issued from $z_{0}$ .

and the righthand side converges to zero. As a consequence, for all $t>s$ , $z(t)=z(s)+\int_{s}^{t}h(u,z(u))du\,.$ Moreover, $z(0)=z_{0}$ . This proves that $z\in Z^{0}_{\infty}(z_{0})$ .

2.3 Uniqueness

Let Assumptions 2, 2, 7.1 and 7.1 hold. Assume $b\leq 4a$ . For every $z_{0}\in{{\mathcal{Z}}}_{0}$ , $Z_{\infty}^{0}(z_{0})$ is a singleton i.e., there exists a unique global solution to (ODE) with initial condition $z_{0}$ .

Consider solutions $z$ and $z^{\prime}$ in $Z^{0}_{\infty}(z_{0})$ . We denote by $(x(t),m(t),v(t))$ the blocks of $z(t)$ , and we define $(x^{\prime}(t),m^{\prime}(t),v^{\prime}(t))$ similarly. For all $t>0$ , we define $\hat{m}(t):=m(t)/(1-e^{-at})$ , $\hat{v}(t):=v(t)/(1-e^{-bt})$ , and we define $\hat{m}^{\prime}(t)$ and $\hat{v}^{\prime}(t)$ similarly. By Prop. 7.7, there exists a compact set $K\subset{{\mathcal{Z}}}_{+}$ s.t. $(x(t),\hat{m}(t),\hat{v}(t))$ and $(x^{\prime}(t),\hat{m}^{\prime}(t),\hat{v}^{\prime}(t))$ are both in $K$ for all $t>0$ . We denote by $L_{S}$ and $L_{\nabla F}$ the Lipschitz constants of $S$ and $\nabla F$ on the compact set $\{x:(x,m,v)\in K\}$ . These constants are finite by Assumptions 7.1 and 7.1. We define $M:=\sup\{\|m\|_{\infty}:(x,m,v)\in K\}$ . Define $u_{x}(t):=\|x(t)-x^{\prime}(t)\|^{2}$ , $u_{m}(t):=\|\hat{m}(t)-\hat{m}^{\prime}(t)\|^{2}$ and $u_{v}(t):=\|\hat{v}(t)-\hat{v}^{\prime}(t)\|^{2}$ . Let $\delta>0$ . Define: $u^{(\delta)}(t):=u_{x}(t)+\delta u_{m}(t)+\delta u_{v}(t)\,.$ By the chain rule and the Cauchy-Schwarz inequality, $\dot{u}_{x}(t)\leq 2\|x(t)-x^{\prime}(t)\|\|\frac{\hat{m}(t)}{\varepsilon+\sqrt{\hat{v}(t)}}-\frac{\hat{m}^{\prime}(t)}{\varepsilon+\sqrt{\hat{v}^{\prime}(t)}}\|$ . Thus, using Lemma 7.11, there exists $c>0$ s.t.

For any $\delta>0$ , $2\|x(t)-x^{\prime}(t)\|\,\|\hat{m}(t)-\hat{m}^{\prime}(t)\|\leq\delta^{-1/2}(u_{x}(t)+\delta u_{m}(t))\leq\delta^{-1/2}u^{(\delta)}(t)$ . Similarly, $2\|x(t)-x^{\prime}(t)\|\,\|\hat{v}(t)-\hat{v}^{\prime}(t)\|\leq\delta^{-1/2}u^{(\delta)}(t)$ . Thus, for any $\delta>0$ ,

We now study $u_{m}(t)$ . For all $t>0$ , we obtain after some algebra: $\frac{d}{dt}\hat{m}(t)=a(\nabla F(x(t))-\hat{m}(t))/(1-e^{-at})\,.$ Therefore, $\dot{u}_{m}(t)\leq\frac{2aL_{\nabla F}}{1-e^{-at}}\|\hat{m}(t)-\hat{m}^{\prime}(t)\|\,\|x(t)-x^{\prime}(t)\|\,.$ For any $\theta>0$ , it holds that $2\|\hat{m}(t)-\hat{m}^{\prime}(t)\|\,\|x(t)-x^{\prime}(t)\|\leq\theta u_{x}(t)+\theta^{-1}u_{m}(t)$ . In particular, letting $\theta:=2L_{\nabla F}$ , we obtain that for all $\delta>0$ ,

where the last inequality is due to the fact that $y/(1-e^{-y})\leq 1+y$ for all $y>0$ . Using the exact same arguments, we also obtain that

We now choose any $\delta$ s.t. $4\delta\leq 1/\max(L_{S}^{2},L_{\nabla F}^{2})$ . Then, Eq. (23) and (24) respectively imply that $\delta\dot{u}_{m}(t)\leq 0.5(a+t^{-1})u^{(\delta)}(t)$ and $\delta\dot{u}_{v}(t)\leq 0.5(b+t^{-1})u^{(\delta)}(t)$ . Summing these inequalities along with Eq. (22), we obtain that for every $t>0$ , $\dot{u}^{(\delta)}(t)\leq\psi(t)u^{(\delta)}(t)\,$ , where: $\psi(t):=\frac{a+b}{2}+\frac{1}{\varepsilon\sqrt{\delta}}+\frac{M}{2\varepsilon^{2}\sqrt{\delta c\min(1,t)}}+\frac{1}{t}\,.$ From Grönwall’s inequality, it holds that for every $t>s>0$ , $u^{(\delta)}(t)\leq u^{(\delta)}(s)\exp\left(\int_{s}^{t}\psi(s^{\prime})ds^{\prime}\right)\,$ . We first consider the case where $t\leq 1$ . We set $c_{1}:=(a+b)/2+(\varepsilon\sqrt{\delta})^{-1}$ and $c_{2}:=M/(\varepsilon^{2}\sqrt{\delta c})$ . With these notations, $\int_{s}^{t}\psi(s^{\prime})ds^{\prime}\leq c_{1}t+c_{2}\sqrt{t}+\ln\frac{t}{s}\,.$ Therefore, $u^{(\delta)}(t)\leq\frac{u^{(\delta)}(s)}{s}\exp\left(c_{1}t+c_{2}\sqrt{t}+\ln t\right)\,$ . By Lemma 7.1, recall that $\dot{x}(0)$ and $\dot{x}^{\prime}(0)$ are both well defined (and coincide). Thus,

By Lemma 7.1, $z$ and $z^{\prime}$ are differentiable at point zero. Then, the above inequality gives $\limsup_{s\downarrow 0}\frac{|\hat{m}_{i}(s)-\hat{m}_{i}^{\prime}(s)|}{s}\leq 4(L_{\nabla F}\vee 1)\|\dot{z}(0)\|$ and $\limsup_{s\downarrow 0}\frac{u_{m}(s)}{s^{2}}\leq 16d(L_{\nabla F}^{2}\vee 1)\|\dot{z}(0)\|^{2}\,.$ Therefore, $u_{m}(s)/s$ converges to zero as $s\downarrow 0$ . By similar arguments, it can be shown that $\limsup_{s\downarrow 0}{u_{v}(s)}/{s^{2}}\leq 16d(L_{S}^{2}\vee 1)\|\dot{z}(0)\|^{2}$ , thus $\lim u_{v}(s)/s=0$ . Finally, we obtain that ${u^{(\delta)}(s)}/{s}$ converges to zero as $s\downarrow 0$ . Letting $s$ tend to zero, we obtain that for every $t\leq 1$ , $u^{(\delta)}(t)=0$ . Setting $s=1$ and $t>1$ , and noting that $\psi$ is integrable on $[1,t]$ , it follows that $u^{(\delta)}(t)=0$ for all $t>1$ . This proves that $z=z^{\prime}$ .

We recall that a semiflow $\Phi$ on the space $(E,{{\mathsf{d}}})$ is a continuous map $\Phi$ from $[0,+\infty)\times E$ to $E$ defined by $(t,x)\mapsto\Phi(t,x)=\Phi_{t}(x)$ such that $\Phi_{0}$ is the identity and $\Phi_{t+s}=\Phi_{t}\circ\Phi_{s}$ for all $(t,s)\in[0,+\infty)^{2}$ .

Let Assumptions 2, 2, 7.1 and 7.1 hold. Assume that $0<b\leq 4a$ . The map $Z_{\infty}^{\infty}$ is single-valued on ${{\mathcal{Z}}}_{+}\to C([0,+\infty),{{\mathcal{Z}}}_{+})$ i.e., there exists a unique global solution to (ODE∞) starting from any given point in ${{\mathcal{Z}}}_{+}$ . Moreover, the following map is a semiflow:

The result is a direct consequence of Lemma 7.19.

3 Proof of Th. 3.2

Consider a semiflow $\Psi$ on $(E,d)$ and a map $z:[0,+\infty)\to E$ . Assume the following:

$\Psi$ admits a strict Lyapunov function ${{\mathsf{V}}}$ .

The set $\Lambda_{\Psi}$ of equilibrium points of $\Psi$ is compact.

${{\mathsf{V}}}(\Lambda_{\Psi})$ has an empty interior.

Then, $\bigcap_{t\geq 0}\overline{z([t,\infty))}$ is a compact connected subset of $\Lambda_{\Psi}$ .

For every $\delta>0$ and every $z=(x,m,v)\in{{\mathcal{Z}}}_{+}$ , define:

where we recall that $V_{\infty}(z):=\lim_{t\to\infty}V(t,z)$ for every $z\in{{\mathcal{Z}}}_{+}$ and $V$ is defined by Eq.(9). Consider the set ${{\mathcal{E}}}:=h_{\infty}^{-1}(\{0\})$ of all equilibrium points of (ODE∞), namely: ${{\mathcal{E}}}=\{(x,m,v)\in{{\mathcal{Z}}}_{+}:\nabla F(x)=0,m=0,v=S(x)\}\,$ . The set ${{\mathcal{E}}}$ is non-empty by Assumption 2.

Let Assumptions 2, 2, 7.1 and 7.1 hold. Assume that $0<b\leq 4a$ . Let $K\subset{{\mathcal{Z}}}_{+}$ be a compact set. Define $K^{\prime}:=\overline{\{\Phi(t,z):t\geq 0,z\in K\}}$ . Let $\overline{\Phi}:[0,+\infty)\times K^{\prime}\to K^{\prime}$ be the restriction of the semiflow $\Phi$ to $K^{\prime}$ i.e., $\overline{\Phi}(t,z)=\Phi(t,z)$ for all $t\geq 0,z\in K^{\prime}$ . Then,

$\overline{\Phi}$ is well defined and is a semiflow on $K^{\prime}$ .

The set of equilibrium points of $\overline{\Phi}$ is equal to ${{\mathcal{E}}}\cap K^{\prime}$ .

There exists $\delta>0$ s.t. $W_{\delta}$ is a strict Lyapunov function for the semiflow $\overline{\Phi}$ .

Denote by $L_{S}$ the Lipschitz constant of $S$ on $\{x:(x,m,v)\in K^{\prime}\}$ . For every $t>0$ ,

Using that $2\|m(t)\|\|S(x(t))-v(t)\|\leq\frac{L_{S}}{b\varepsilon}\|m(t)\|^{2}+\frac{b\varepsilon}{L_{S}}\|S(x(t))-v(t)\|^{2}$ , we obtain ${{\mathcal{L}}}_{H}(t)\leq-b\|S(x(t))-v(t)\|^{2}+\frac{L_{S}^{2}}{b\varepsilon^{2}}\|m(t)\|^{2}\,.$ Hence, for every $t>0$ ,

where $M(\delta):=\varepsilon(\varepsilon+\sqrt{C_{1}})^{-2}-\frac{\delta L_{S}^{2}}{b\varepsilon^{2}}-\delta\left(\frac{a}{2}+\frac{L_{\nabla F}}{\varepsilon}\right)\,.$ Choosing $\delta$ s.t. $M(\delta)>0$ ,

where $c:=\min\{M(\delta),\frac{a\delta}{2},\delta b\}$ . It can easily be seen that for every $z\in K^{\prime}$ , $t\mapsto W_{\delta}(\overline{\Phi}_{t}(z))$ is Lipschitz continuous, hence absolutely continuous. Its derivative almost everywhere coincides with ${{\mathcal{L}}}_{W_{\delta}}$ , which is non-positive. Thus, $W_{\delta}$ is a Lyapunov function for $\overline{\Phi}$ . We prove that the Lyapunov function is strict. Consider $z\in K^{\prime}$ s.t. $W_{\delta}(\overline{\Phi}_{t}(z))=W_{\delta}(z)$ for all $t>0$ . The derivative almost everywhere of $t\mapsto W_{\delta}(\overline{\Phi}_{t}(z))$ is identically zero, and by Eq. (27), this implies that $-c\left(\|m_{t}\|^{2}+\|\nabla F(x_{t})\|^{2}+\|S(x_{t})-v_{t}\|^{2}\right)$ is equal to zero for every $t$ a.e. (hence, for every $t$ , by continuity of $\overline{\Phi}$ ). In particular for $t=0$ , $m=\nabla F(x)=0$ and $S(x)-v=0$ . Hence, $z\in h_{\infty}^{-1}(\{0\})$ .

Let Assumptions 2, 2, 7.1 and 7.1 hold. Assume that $0<b\leq 4a$ . For every $z\in{{\mathcal{Z}}}_{+}$ , $\lim_{t\to\infty}{{\mathsf{d}}}(\Phi(z,t),{{\mathcal{E}}})=0\,.$

Use Prop. 7.24 with $K:=\{z\}$ . and [15, Th. 2.1.7].

3.2 Asymptotic Behavior of the Solution to (ODE)

Let Assumptions 2, 2, 7.1 and 7.1 hold true. Assume that $0<b\leq 4a$ . Then, for every $z_{0}\in{{\mathcal{Z}}}_{0}$ , $Z^{0}_{\infty}(z_{0})$ is an asymptotic pseudotrajectory of the semiflow $\Phi$ given by (25).

Consider $z_{0}\in{{\mathcal{Z}}}_{0}$ , $T\in(0,+\infty)$ and define $z:=Z_{\infty}^{0}(z_{0})$ . Consider $t\geq 1$ . For every $s\geq 0$ , define $\Delta_{t}(s):=\|z(t+s)-\Phi(z(t))(s)\|$ . The aim is to prove that $\sup_{s\in[0,T]}\Delta_{t}(s)$ tends to zero as $t\to\infty$ . Putting together Prop. 7.7 and Lemma 7.11, the set $K:=\overline{\{z(t):t\geq 1\}}$ is a compact subset of ${{\mathcal{Z}}}_{+}^{*}$ . Define $C(t):=\sup_{s\geq 0}\sup_{z^{\prime}\in K}\|h(t+s,z^{\prime})-h_{\infty}(z^{\prime})\|$ . It can be shown that $\lim_{t\to\infty}C(t)=0$ . We obtain that for every $s\in[0,T]$ , $\Delta_{t}(s)\leq TC(t)+\int_{0}^{s}\|h_{\infty}(z(t+s^{\prime}))-h_{\infty}(\Phi(z(t))(s^{\prime}))\|ds^{\prime}\,.$ By Lemma 7.11, $K^{\prime}:=\overline{\bigcup_{z^{\prime}\in\Phi(K)}z^{\prime}([0,\infty))}$ is a compact subset of ${{\mathcal{Z}}}_{+}^{*}$ . It is immediately seen from the definition that $h_{\infty}$ is Lipschitz continuous on every compact subset of ${{\mathcal{Z}}}_{+}^{*}$ , hence on $K\cup K^{\prime}$ . Therefore, there exists a constant $L$ , independent from $t,s$ , s.t. $\Delta_{t}(s)\leq TC(t)+\int_{0}^{s}L\Delta_{t}(s^{\prime})ds^{\prime}\qquad(\forall t\geq 1,\forall s\in[0,T])\,.$ Using Grönwall’s lemma, it holds that for all $s\in[0,T]$ , $\Delta_{t}(s)\leq TC(t)e^{Ls}\,.$ As a consequence, $\sup_{s\in[0,T]}\Delta_{t}(s)\leq TC(t)e^{LT}$ and the righthand side converges to zero as $t\to\infty$ .

End of the Proof of Th. 3.2

By Prop. 7.7, the set $K:=\overline{Z^{0}_{\infty}(z_{0})([0,\infty))}$ is a compact subset of ${{\mathcal{Z}}}_{+}$ . Define $K^{\prime}:=\overline{\{\Phi(t,z):t\geq 0,z\in K\}}\,,$ and let $\overline{\Phi}:[0,+\infty)\times K^{\prime}\to K^{\prime}$ be the restriction $\Phi$ to $K^{\prime}$ . By Prop. 7.24, there exists $\delta>0$ s.t. $W_{\delta}$ is a strict Lyapunov function for the semiflow $\overline{\Phi}$ . Moreover, the set of equilibrium points coincides with ${{\mathcal{E}}}\cap K^{\prime}$ . In particular, the equilibrium points of $\overline{\Phi}$ form a compact set. By Prop. 7.28, $Z^{0}_{\infty}(z_{0})$ is an APT of $\overline{\Phi}$ . Note that every $z\in{{\mathcal{E}}}$ can be written under the form $z=(x,0,S(x))$ for some $x\in{{\mathcal{S}}}$ . From the definition of $W_{\delta}$ in (26), $W_{\delta}(z)=W_{\delta}(x,0,S(x))=V_{\infty}(x,0,S(x))=F(x)$ . Since $F({{\mathcal{S}}})$ is assumed to have an empty interior, the same holds for $W_{\delta}({{\mathcal{E}}}\cap K^{\prime})$ . By Prop. 7.23, $\bigcap_{t\geq 0}\overline{Z^{0}_{\infty}(z_{0})([t,\infty))}\subset{{\mathcal{E}}}\cap K^{\prime}\,.$ The set in the righthand side coincides with the set of limits of convergent sequences of the form $Z^{0}_{\infty}(z_{0})(t_{n})$ for $t_{n}\to\infty$ . As $Z^{0}_{\infty}(z_{0})([0,\infty))$ is a bounded set, ${{\mathsf{d}}}(Z^{0}_{\infty}(z_{0})(t),{{\mathcal{E}}})$ tends to zero.

4 Proof of Th. 3.3

The proof follows the path of [16, Th. 10.1.6, Th. 10.2.3], but requires specific adaptations to deal with the dynamical system at hand. Define for all $\delta>0$ , $t>0$ , and $z=(x,m,v)$ ,

Upper-bound on $w_{\delta}(t)$ . From Eq. (9), we obtain that for every $t\geq 1$ , $V(t,z(t))\leq|F(x(t))|+\frac{\|m(t)\|^{2}}{2a\varepsilon(1-e^{-a})}\,.$ Using $\langle\nabla F(x),m\rangle\leq(\|\nabla F(x)\|^{2}+\|m\|^{2})/2$ , we obtain that there exists a constant $c_{1}$ (depending on $\delta$ ) s.t. for every $t\geq 1$ ,

Upper-bound on $\frac{d}{dt}w_{\delta}(t)$ . The function $w_{\delta}$ is absolutely continuous on $[1,+\infty)$ . Moreover, there exist $\delta>0$ , $c_{2}>0$ (both depending on $x_{0}$ ) s.t. for every $t\geq 1$ a.e.,

The proof of Eq. (30) uses arguments that are similar to the ones used in the proof of Prop. 7.24 (just use Lemma 7.5 to bound the derivative of the first term in Eq. (28)). For this reason, it is omitted.

Putting together (29) and (30) using the Łojasiewicz condition. By Prop. 7.23 and 7.28, the set $L:=\overline{\bigcup_{s\geq 0}\{z(t):t\geq s\}}\,$ is a compact connected subset of ${{\mathcal{E}}}=\{(x,0,S(x)):\nabla F(x)=0\}$ . The set $\mathcal{U}:=\{x:(x,0,S(x))\in L\}$ is a compact and connected subset of ${{\mathcal{S}}}$ . Using Assumption 3.3 and [16, Lemma 2.1.6], there exist constants $\sigma,c>0$ and $\theta\in(0,\frac{1}{2}]$ , s.t. $\|\nabla F(x)\|\geq c|F(x)|^{1-\theta}\,$ for all $x$ s.t. ${{\mathsf{d}}}(x,\mathcal{U})<\sigma\,$ . As ${{\mathsf{d}}}(x(t),\mathcal{U})\to 0$ , there exists $T\geq 1$ s.t. for all $t\geq T$ , $\|\nabla F(x(t))\|\geq c|F(x(t))|^{1-\theta}$ . Thus, we may replace the term $\|\nabla F(x(t))\|^{2}$ in the righthand side of Eq. (30) using $\|\nabla F(x(t))\|^{2}\geq\frac{1}{2}\|\nabla F(x(t))\|^{2}+\frac{1}{2}|F(x(t))|^{2(1-\theta)}$ . Upon noting that $2(1-\theta)\geq 1$ , we thus obtain that there exists a constant $c_{3}$ and some $T^{\prime}\geq 1$ s.t. for $t\geq T^{\prime}$ a.e.,

Putting this inequality together with Eq. (29), we obtain that for some constant $c_{4}>0$ and for all $t\geq T^{\prime}$ a.e., $\frac{d}{dt}w_{\delta}(t)\leq-c_{4}w_{\delta}(t)^{2(1-\theta)}\,.$

End of the proof. Following the arguments of [16, Th. 10.1.6], by integrating the preceding inequality, over $[T^{\prime},t]$ , we obtain $w_{\delta}(t)\leq c_{5}t^{-\frac{1}{1-2\theta}}$ for $t\geq T^{\prime}$ in the case where $\theta<\frac{1}{2}$ , whereas $w_{\delta}(t)$ decays exponentially if $\theta=\frac{1}{2}$ . From now on, we focus on the case $\theta<\frac{1}{2}$ . By definition of (ODE), $\|\dot{x}(t)\|^{2}\leq\|m(t)\|^{2}/((1-e^{-aT^{\prime}})^{2}\varepsilon^{2})$ for all $t\geq T^{\prime}$ . Since Eq. (30) implies $\|m(t)\|^{2}\leq-\dot{w}_{\delta}(t)/c_{2}$ , we deduce that there exists $c,c^{\prime}>0$ s.t. for all $t\geq T^{\prime}$ , $\int_{t}^{2t}\|\dot{x}(s)\|^{2}ds\leq cw_{\delta}(t)\leq c^{\prime}t^{-\frac{1}{1-2\theta}}\,.$ Applying [16, Lemma 2.1.5], it follows that $\int_{t}^{\infty}\|\dot{x}(s)\|^{2}ds\leq ct^{-\frac{\theta}{1-2\theta}}$ for some other constant $c$ . Therefore $x^{*}:=\lim_{t\to+\infty}x(t)$ exists by Cauchy’s criterion and for all $t\geq T^{\prime}$ , $\|x(t)-x^{*}\|\leq ct^{-\frac{\theta}{1-2\theta}}$ . Finally, since $x(t)\to a$ , we remark that, using the same arguments, the global Łojasiewicz exponent $\theta$ can be replaced by any Łojasiewicz exponent of $f$ at $x^{*}$ . When $\theta=\frac{1}{2}$ , the proof follows the same line.

Proofs of Section 4

Let Assumptions 2.1, 2.2 and 4 hold true. There exists $\bar{\gamma}_{0}>0$ s.t. for every $R>0$ , there exists $s>0$ ,

Let $R>0$ . We denote by $(H_{\gamma,{\mathsf{x}}},H_{\gamma,{\mathsf{m}}},H_{\gamma,{\mathsf{v}}})$ the block components of $H_{\gamma}$ . There exists a constant $C_{s}$ depending only on $s$ s.t. $\|H_{\gamma}\|^{1+s}\leq C_{s}(\|H_{\gamma,{\mathsf{x}}}\|^{1+s}+\|H_{\gamma,{\mathsf{m}}}\|^{1+s}+\|H_{\gamma,{\mathsf{v}}}\|^{1+s})$ . Hence, it is sufficient to prove that Eq. (31) holds respectively when replacing $H_{\gamma}$ with each of $H_{\gamma,{\mathsf{x}}},H_{\gamma,{\mathsf{m}}},H_{\gamma,{\mathsf{v}}}$ . Consider $z=(x,m,v)$ in ${{\mathcal{Z}}}_{+}$ . We write: $\|H_{\gamma,{\mathsf{x}}}(n+1,z,\xi)\|\leq\varepsilon^{-1}(\|\frac{m}{1-\bar{\alpha}(\gamma)^{n}}\|+\|\nabla f(x,\xi)\|)\,.$ Thus, for every $z$ s.t. $\|e_{\gamma}(n,z)\|\leq R$ , there exists a constant $C$ depending only on $\varepsilon$ , $R$ and $s$ s.t. $\|H_{\gamma,{\mathsf{x}}}(n+1,z,\xi)\|^{1+s}\leq C(1+\|\nabla f(x,\xi)\|^{1+s})$ . By Assumption 4, (31) holds for $H_{\gamma,{\mathsf{x}}}$ instead of $H_{\gamma}$ . Similar arguments hold for $H_{\gamma,{\mathsf{m}}}$ and $H_{\gamma,{\mathsf{v}}}$ upon noting that the functions $\gamma\mapsto(1-\bar{\alpha}(\gamma))/\gamma$ and $\gamma\mapsto(1-\bar{\beta}(\gamma))/\gamma$ are bounded under Assumption 2.2.

Choose $\bar{\gamma}_{0}>0$ and $s>0$ as in Lemma 8.1. For every $\gamma\leq\bar{\gamma}_{0}$ ,

By Lemma 8.1, the righthand side is finite and does not depend on $(n,\gamma)$ .

For every $\gamma,R>0$ , we define ${\mathsf{z}}^{\gamma,R}:={\mathsf{X}}_{\gamma}(z^{\gamma,R})={\mathsf{X}}_{\gamma}\circ B_{R}(z^{\gamma})$ . Namely, ${\mathsf{z}}^{\gamma,R}$ is the interpolated process associated with the sequence $(z^{\gamma,R}_{n})$ . It is a random variable on $C([0,+\infty),{{\mathcal{Z}}})$ . We recall that ${{\mathcal{F}}}_{n}$ is the $\sigma$ -algebra generated by the r.v. $(\xi_{k}:1\leq k\leq n)$ . For every $\gamma,n,R$ , we use the notation: $\Delta_{0}^{\gamma,R}:=0$ and

It is an immediate consequence of Lemma 8.3 and [6, Lemma 6.2]

The proof of the following lemma is omitted.

By Eq. (34), the RHS of the above inequality tends to zero as $\gamma\to 0$ . The proof is complete.

2 Proof of Th. 13

In the sequel, we will use the notation $\bar{P}^{\gamma,x}$ as a shorthand notation for $\bar{P}^{\gamma,\delta_{x}}$ where $\delta_{x}$ is the Dirac measure at some point $x\in{\mathsf{X}}$ . Finally, let $\Psi$ be a semiflow on ${\mathsf{X}}$ . A Markov kernel $P$ is Feller if $Pf$ is continuous for every bounded continuous $f$ . {assumption} Let $\nu\in{{\mathcal{P}}}({\mathsf{X}})$ .

For every $\gamma$ , $\bar{P}_{\gamma}$ is Feller.

For every $\delta>0$ , for every compact set $K\subset{\mathsf{X}}$ , for every $t>0$ , $\lim_{\gamma\to 0}\sup_{x\in K}\bar{P}^{\gamma,x}\left(\|X_{\lfloor t/\gamma\rfloor}-\Psi_{t}(x)\|>\delta\right)=0\,.$

Let $BC_{\Psi}$ be the Birkhof center of $\Psi$ i.e., the closure of the set of recurrent points.

We omit the proof of this result which follows a similar reasoning to [6, Th. 5.5 and Proof in section 8.4] and makes use of results from .

Proofs of Section 5

The following lemma will be useful in the proofs.

Let the sequence $(r_{n})$ be defined as in Algorithm 2. Assume that $0\leq\alpha_{n}\leq 1$ for all $n$ and that $(1-\alpha_{n})/\gamma_{n}\to a>0$ as $n\to+\infty$ . Then,

The sequence $(r_{n})$ is nondecreasing and converges to $1$ .

Under Assumption 5.3 1, for every $\epsilon>0$ , for sufficiently large $n$ , we have $r_{n}-1\leq e^{-\frac{a\gamma_{0}}{2(1-\kappa)}n^{1-\kappa}}$ if $\kappa\in(0,1)$ and $r_{n}-1\leq n^{-a\gamma_{0}/(1+\epsilon)}$ if $\kappa=1$ .

A similar lemma holds for the sequence $(\bar{r}_{n})$ .

We define $\bar{z}_{n}=(x_{n-1},m_{n},v_{n})$ (note the shift in the index of the variable $x$ ). We have

where $h_{\infty}$ is defined in Eq. (18) and where we set

and $\varsigma_{n+1}=(\varsigma_{n+1}^{x},\varsigma_{n+1}^{m},\varsigma_{n+1}^{v})$ with the components defined by: $\varsigma_{n+1}^{x}=\frac{m_{n}}{\varepsilon+\sqrt{v_{n}}}-\frac{\gamma_{n}}{\gamma_{n+1}}\frac{\hat{m}_{n}}{\varepsilon+\sqrt{\hat{v}_{n}}}$ , $\varsigma_{n+1}^{m}=\left(\frac{1-\alpha_{n+1}}{\gamma_{n+1}}-a\right)(\nabla F(x_{n})-m_{n})+a(\nabla F(x_{n})-\nabla F(x_{n-1}))$ and $\varsigma_{n+1}^{v}=\left(\frac{1-\beta_{n+1}}{\gamma_{n+1}}-b\right)(S(x_{n})-v_{n})+b(S(x_{n})-S(x_{n-1}))$ . We prove that $\varsigma_{n}\to 0$ a.s. Using the triangular inequality,

(where $\tau_{n}=\sum_{k=0}^{n}\gamma_{k}$ ), is almost surely a bounded APT of the semiflow $\bar{\Phi}$ defined by (ODE∞). The proof is concluded by applying Prop. 7.23 and Prop. 7.24.

2 Proof of Prop. 5.2

As $\inf F>-\infty$ , one can assume without loss of generality that $F\geq 0$ . In the sequel, $C$ denotes some positive constant which may change from line to line. We define $a_{n}:=(1-\alpha_{n+1})/\gamma_{n}$ and $P_{n}:=\frac{1}{2a_{n}r_{n}}\langle m_{n}^{\odot 2},\frac{1}{\varepsilon+\sqrt{\hat{v}_{n}}}\rangle$ . We have $a_{n}\to a$ and $r_{n}\to 1$ . By Assumption 5.1-1,

We set $u_{n}:=1-\frac{a_{n+1}}{a_{n}}$ and $D_{n}:=\frac{r_{n}^{-1}}{\varepsilon+\sqrt{\hat{v}_{n}}}$ , so that $P_{n}=\frac{1}{2a_{n}}\langle D_{n},m_{n}^{\odot 2}\rangle$ . We can write:

We estimate the vector $D_{n+1}-D_{n}$ . Using that $(r_{n}^{-1})$ is non-increasing,

Remarking that $v_{n+1}\geq\beta_{n+1}v_{n}$ , recalling that $(\bar{r}_{n})$ is nondecreasing and using the update rules of $v_{n}$ and $\bar{r}_{n}$ , we obtain after some algebra

It is easy to see that $c_{n}/\gamma_{n}\to b/2$ . Thus, for any $\delta>0$ , $c_{n+1}\leq(b+2\delta)\gamma_{n}/2$ for all $n$ large enough. Using also that $\sqrt{\hat{v}_{n+1}}/(\varepsilon+\sqrt{\hat{v}_{n+1}})\leq 1$ , we obtain that $D_{n+1}-D_{n}\leq\frac{b+2\delta}{2}\gamma_{n}D_{n}\,.$ Substituting this inequality in Eq. (36), we get

As $a_{n}\to a$ , we have $a_{n}-\frac{b+2\delta}{4}\geq a-\frac{b+\delta}{4}$ for all $n$ large enough. Hence,

Using the Cauchy-Schwartz inequality and Assumption 5.1 2, it is easy to show the inequality $\langle\nabla F(x_{n}),\frac{\hat{m}_{n}}{\varepsilon+\sqrt{\hat{v}_{n}}}\rangle\leq C(1+F(x_{n})+P_{n})$ . Moreover, using the componentwise inequality $(\nabla f_{n+1}-m_{n})^{\odot 2}\leq 2\nabla f_{n+1}^{\odot 2}+2m_{n}^{\odot 2}$ along with Assumption 5.1 2, we obtain

Putting all pieces together with Eq. (35),

Define $V_{n}:=(1-C\gamma_{n-1}^{2})F(x_{n-1})+(1-u_{n-1})P_{n}$ where the constant $C$ is fixed so that Eq. (38) holds. Then,

By Assumption 5.1, $\lim\sup_{n}u_{n-1}/\gamma_{n}<2a-b/2$ and for $\delta$ small enough, we obtain

3 Proof of Th. 5.3

We use [20, Th. 1]. All the assumptions in the latter can be verified in our case, at the exception of a positive definiteness condition on the limiting covariance matrix, which corresponds, in our case, to the matrix $Q$ given by Eq. (16). As $Q$ is not positive definite, it is strictly speaking not possible to just cite and apply [20, Th. 1]. Nevertheless, a detailed inspection of the proofs of shows that only a minor adaptation is needed in order to cover the present case. Therefore, proving the convergence result of from scratch is worthless. It is sufficient to verify the assumptions of [20, Th. 1] (except the definiteness of $Q$ ) and then to point out the specific part of the proof of which requires some adaptation.

Let $z_{n}=(x_{n},m_{n},v_{n})$ be the output of Algorithm 2. Define $z^{*}=(x^{*},0,S(x^{*}))$ . Define $\eta_{n+1}:=(0,a(\nabla f_{n+1}-\nabla F(x_{n})),b(\nabla f_{n+1}^{\odot 2}-S(x_{n})))$ . We have

where $\epsilon_{n+1}:=(\epsilon_{n+1}^{1},\epsilon_{n+1}^{2},\epsilon^{3}_{n+1})$ , whose components are given by

Here, $\eta_{n+1}$ is a martingale increment noise and $\epsilon_{n+1}=(\epsilon_{n+1}^{1},\epsilon_{n+1}^{2},\epsilon^{3}_{n+1})$ is a remainder term. The aim is to check the assumptions (A1.1) to (A1.3) of , where the role of the quantities ( $h$ , $\varepsilon_{n}$ , $r_{n}$ , $\sigma_{n}$ , $\alpha$ , $\rho$ , $\beta$ ) in is respectively played by the quantities ( $h_{\infty}$ , $\eta_{n}$ , $\epsilon_{n}$ , $\gamma_{n}$ , $\kappa$ , $1$ , $1$ ) of the present paper.

It is sufficient to verify the latter for $\epsilon^{i}_{n}$ ( $i=1,2,3$ ) in place of $\epsilon_{n}$ . The map $(m,v)\mapsto m/(\varepsilon+\sqrt{v})$ is Lipschitz continuous in a neighborhood of $(0,S(x^{*}))$ by Assumption 2. Thus, for $M$ small enough, there exists a constant $C$ s.t. if $\|z_{n}-z^{*}\|\leq M$ , then $\|\epsilon_{n+1}^{1}\|\leq C\left\|r_{n+1}^{-1}m_{n+1}-m_{n}\right\|+C\left\|\bar{r}_{n+1}^{-1}v_{n+1}-v_{n}\right\|\,.$ Using the triangular inequality and the fact that $r_{n+1},\bar{r}_{n+1}$ are bounded sequences away from zero, there exists an other constant $C$ s.t.

Using Lemma 9.1 under Assumption 5.3 (note that $\gamma_{0}>1/2L\geq 1/a$ when $\kappa=1$ ), we obtain that the sequence $|r_{n}-1|/\gamma_{n}$ is bounded, thus $|r_{n+1}-1|\leq C\gamma_{n+1}$ . The sequence $(1-\alpha_{n})/\gamma_{n}$ being also bounded, it holds that