A general system of differential equations to model first order adaptive algorithms

André Belotto da Silva, Maxime Gazeau

Introduction

Optimization is at the core of many machine learning problems. Estimating the model parameters can often be formulated in terms of an unconstrained optimization problem of the form

The emergence of deep learning has spawned the recent popularity of a special class of optimizers to solve (1.1): first order adaptive optimization algorithms (RMSprop , Adagrad, Adadelta , Adam ) were originally designed to solve unconstrained optimization problem (minimizing an empirical risk in supervised learning). It is commonly observed that the value of the training loss decays faster than for stochastic gradient descent and they have become a default method of choice for training feed-forward and recurrent neural network . However, recent research paper suggests to not use Adam as it diverges for a simple example. As of today, there is no consensus on the benefit of adaptive algorithms over other methods and no guidance on how to choose the many hyper-parameters of the model.

Despite its obvious efficiency in deep learning, the reasons of their success are unclear and a large number of fundamental questions are still unanswered. Our work started from the belief that these algorithms are not intrinsically better than gradient descent but rather well suited to the subclass of non-convex function given by standard deep learning architecture. Studying the convergence of the discrete and stochastic adaptive algorithms for non-convex functional is far too complex and general to get insightful explanation about their efficiency in deep learning. We, therefore, start by studying a deterministic and continuous equation, and we prove that in simple cases (such as convex functional), adaptive algorithm are not converging faster than gradient descent. In particular, the key insights of our analysis are:

The convergence rate is nonlinear –in the sense that it depends on the variables– and depends on the history of the dynamics. Initialization is therefore of crucial importance.

With the standard choices of hyperparameters, adaptivity degrades the rate of convergence to the global minimum of a convex function compared to gradient descent.

These observations are crucial to unwind the mystery of adaptive algorithms and the next questions to ask are now obvious:

Does adaptivity reduces the variance (compared to SGD) and speed up the training for convex functional?

Is the fast training observed in deep learning induced by the specificity of the loss surface and common initialization scheme for the weights?

The main contribution of this paper is to provide a theoretical framework to study deterministic adaptive algorithms. Inspired by the history of gradient descent and stochastic gradient descent, we analyse discrete adaptive optimization algorithms by introducing their continuous time counterparts (equation (2.1)), with a focus on Adam (equation (4.3)). The techniques and analysis are similar for other algorithms and includes classical accelerated method. The continuous time analysis provides conditions on the hyper-parameters of several algorithms (in particular for Adam, see $\S\S$ 4.2) that guarantee convergence of their trajectories. The convergence analysis of the continuous system brings important informations on the discrete dynamics in the limit of large batch sizes and small learning rates. Our study is, nevertheless, far more general than Adam. We provide practitioners a wide range of options to develop and test adaptive-type algorithms, and our analysis give certain guidelines which are discussed in the body of the paper and summarized in $\S\S$ 6.4.

This work is intended, furthermore, to serve as a solid foundation for the posterior study in the discrete and stochastic settings, but in this paper we put an emphasize on the deterministic equation to understand the fundamental properties of adaptive algorithms.

In section 2 we introduce two general continuous dynamical system (2.1) and (2.2) whose forward Euler approximation (2.3) matches a large class of first order methods, summarized in tables 1 and 2. The connection with adaptive and accelerated algorithms is made precise in section 4. In particular, Adam differential equation (4.3) is derived in subsection 4.2. Section 2.3 starts with a first basic energy functional (2.4) (which is inspired on the works of ). Basic properties of the ODE (2.1) are presented in subsection 2.4, including Theorem 2.3 on the existence and uniqueness of solutions and Theorem 2.4 which provides the relation between continuous and discrete systems. We note that a sharper result (Theorem 5.2) on the existence and uniqueness of solutions of ODE (2.1) is presented in section 5.

Section 3 contains the statement of our main results, on the asymptotic behaviour of the continuous deterministic trajectories of the ODE (2.1). In the non-convex setting we prove, under mild assumptions, that the trajectories converge to the critical locus of $f$ (see Theorem 3.1). This result is supplemented with the analysis of sufficient conditions in order to avoid convergence to saddle or local maximum points of the loss function $f$ (see Theorem 3.2). For convex functions, we design a Lyapunov functional (3.1) and obtain a rate of convergence to at least a neighbourhood of the critical locus (see Theorem 3.4). The rate of convergence crucially depends on the behaviour over time of $\nabla f$ and on the term $v$ (see (3.2) and the subsequent discussion). In particular, this indicates that the efficiency of adaptive algorithms is highly dependant on the loss function. In sections 4, we specialize the convergence results to Adam, AdaFom, Heavy Ball, Nesterov, Adagrad and RMSProp. In particular, Corollary 4.1 provides new results on the convergence of the dynamics of Adam, while Corollary 4.4 recovers previously known convergence rates of Nesterov accelerated method. We stress that sections 3 and 4 can be read independently. In Section 5, we prove existence and uniqueness of solutions at $t_{0}=0$ when the functions $h,p,q,r$ are not defined at zero. In Section 6 we provide several empirical observations on adaptive algorithms which are inspired by the continuous analysis, and we collect guidelines for designing new adaptive algorithms in $\S\S$ 6.4. Most proofs supporting the paper are postponed to the Appendix.

Gradient descent , which only depends on the partial derivatives of $f$ , is the simplest discrete algorithm to address the optimization problem above

Another popular iterative approach to solve the above smooth optimization is the proximal point algorithm

These discrete methods can be studied solely from the standpoint of optimization performance. It can be proved that both algorithms converge to a critical point ( $\nabla f(\theta_{k})\to 0$ as $k\to\infty$ ) but also almost surely to a local minimizer . For convex functionals with globally Lipschitz gradient, both algorithms converges at linear rate $f(\theta_{k})-f(\theta_{\star})=\mathcal{O}(1/(sk))$ , where $\theta_{\star}$ is a minimal point of $f$ . These results give important guarantees on the convergence of each method.

For small and constant learning rate $s$ , gradient descent (1.2) (resp. proximal point algorithm (1.3)) corresponds to the forward (resp. backward) Euler’s discretization of the gradient flow system

under the time scaling $t=ks$ . The stable equilibria of this continuous system are given by the the set of strict (local) minima of the loss function $f$ and if the level sets of $f$ are bounded ( $f$ coercive for example), then its trajectories asymptotically converge to a critical point in the sense that $\nabla f(\theta(t))\to 0$ as $t\to 0$ . Moreover for convex functions, a linear rate of convergence $f(\theta(t))-f(\theta_{\star})=\mathcal{O}(1/t)$ holds, which is analogue to those obtained for both gradient descent and proximal point algorithm.

The study of the continuous dynamical system is very useful. The well-behaved convergence properties of the gradient flow (1.4) allows for different discretizations and optimization algorithms . It, furthermore, provides valuable intuition to prove convergence of discrete systems: for example, continuous Lyapunov functional can be often adapted to the discrete counterparts.

The rate of convergence for gradient descent is not optimal and depending on the class of functions $f$ belongs to, more efficient algorithms can be designed . For smooth convex or strongly convex functions, Nesterov introduced an accelerated gradient algorithm which was proven to be optimal (a lower bound matching an upper bound is provided)

However, the key mechanism for acceleration is not well understood and have many interpretations . A particular interesting interpretation of acceleration is through the lens of a second order differential equation of the form

where $t\mapsto a(t)$ is a smooth, positive and decreasing function of time, having possibly a pole at zero. Even if this singularity has important implications for the choice of the initial velocity $\psi_{0}$ , we are more interested by the long term behavior of the solution to (1.7) and hence at $\lim_{t\to\infty}a(t)$ . This system is called dissipative because its energy $E(t)=\frac{1}{2}||\dot{\theta}||^{2}+f(\theta)$ decreases over time. Most accelerated optimization algorithms can be seen as the numerical integration of equation (1.7). For the Heavy Ball method, the function $a$ is constant and is called the damping parameter . In , conditions on the rate of decay of $a$ and its limit are given in order for the trajectories of (1.7) to converge to a critical point of $f$ . This analysis highlights situations where (1.7) are fit (or not) for optimization. Intuitively, if $a$ decays too fast to zero (like $1/t^{2}$ ) the system will oscillate and won’t converge to a critical point. The case $a(t)=3/t$ was studied more specifically in and the authors draw interesting connections between (1.7) and Nesterov’s algorithm (1.5). The convergence rates obtained are $\mathcal{O}(1/(sk^{2}))$ and $\mathcal{O}(1/t^{2})$ respectively, which match with the discrete algorithms by using the time identification $t=\sqrt{s}k$ . Extension of this work are proposed in in which the authors studied acceleration from a different continuous equation having theoretically exponential rate of convergence. However, a naïve discretization looses the nice properties of this continuous system and current work consists on finding a better one preserving the symplectic structure of the continuous flow .

By nature, first order adaptive algorithms have iterates that are non-linear functions of the gradient of the objective function. The analysis of convergence is therefore more complex, potentially because the rate of convergence might depend on the function itself. The first known algorithm Adagrad consists on multiplying the gradient by a diagonal preconditioning matrix, depending on previous squared gradients. The key property to prove the convergence of this algorithm is that the elements of the preconditioning matrix are positive and non-decreasing . Later on, two new adaptive algorithms RMSprop and Adam were proposed. The preconditioning matrix is an exponential moving average of the previous squared gradients. As a consequence, it is no longer non-decreasing. The proof of convergence, relying on this assumption and given in the form of a regret bound in , is therefore not correct . A new algorithm Amsgrad proposed in consists on modifying the preconditioning updates to recover this property. While converging, this algorithm looses the essence of the Adam’s algorithm. Adam is such a mysterious algorithm that many works have been devoted to understand its behaviour. Variants of Adam have been proposed as well as convergence analysis towards a critical point . However, conditions for convergence seem very restrictive and not easy to verify in practice. The existing proofs of convergence on the discrete system, moreover, feel rather technical at the cost of loosing the intuition.

Presentation of the model

We propose a continuous model to unify the analysis of gradient based algorithms. We first introduce notation on vector’s operations used in the paper. In section 2.2, we present a general system of differential equations as well as a possible discretization of it.

2. Presentation of the continuous time model

Throughout this paper we study the following dynamical system

When the system does not contain a momentum term, we also consider the alternative simpler system

whose analysis is similar (and simpler) to the first (see $\S\S$ 3.4), but can not be derived from the first. Unless stated otherwise, our presentation deals with the system (2.1), and it is later specialized to (2.2).

Either $p(t)\not\equiv 0$ , in which case we say that the system is adaptive;

Or $p(t)\equiv q(t)\equiv 0$ , in which case we say that the system is non-adaptive.

It seems reasonable to imagine that there are several different choices of functions $h$ , $r$ , $p$ and $q$ which yield a good optimization algorithm. We provide a list of interesting cases in the table 1 and 2 below. Each choice, furthermore, might be adapted to some different “class” of loss function, that is, loss functions satisfying some extra property (e.g. loss functions coming from a single, double or $N$ -layers in deep learning; convex functions; globally Lipschitz functions, etc). We allow, therefore, the functions $h$ , $r$ , $p$ and $q$ to be as general as possible, so that practitioners may test different combinations of coefficients, probably depending on the properties of the loss function. Some guidelines to chose these coefficients are provided $\S\S$ 6.4.

An adaptive system has a non-trivial dynamic for the memory term $v$ , which changes the rate in the dynamics of the main variable $\theta$ . In practice, this corresponds to an automatic change in the learning rate, controlled only by the history of the trajectory $\mathbf{x}(t)$ and the loss function $f$ . This justifies the name adaptive and supports the intuition that the efficiency of the algorithm depends on the properties of the loss functions $f$ (c.f. $\S\S$ 6.3). We also consider a momentum term $m$ , which allow us to accelerate the convergence of the algorithm depending on the choice of the coefficients $h$ and $r$ . Intuitively, the addition of the momentum $m$ implies the existence of a special energy functional for ODE (2.1) (see equation (2.4) below) which works as a “funnel” around minimum points of $f$ . The trajectories of the ODE can, therefore, be “accelerated” without the risk of diverging to $\infty$ , at the cost of an oscillatory behaviour (just as water in a funnel). In particular, this intuitively explain why it is possible to improve the rate of convergence of the Heavy ball in Nesterov. Based in these two intuitions, we can expect that the choice of $h$ and $r$ controls how fast these algorithms converge in general, while $p$ and $q$ may only slow down the algorithm in general, but accelerate it for certain “classes” of loss functions. This intuition turns out to be precise when dealing with convex functions, as we discuss in $\S\S$ 6.4.

Note that we do not suppose that $\nabla f$ is globally Lipschitz. Indeed, such an assumption is insufficient to guarantee that the ODE (2.1) is itself globally Lipschitz (e.g. $f(\theta)=\theta^{2}/2$ implies $\nabla f(\theta)^{2}=\theta^{2}$ ). The lack of this assumption constitutes an important technical difficulty in the remaining of the paper. Under the extra hypothesis that $f$ is coercive, nevertheless, we are able to replace this hypothesis effectively (see Lemma 2.1 below).

In order to establish a relation between the continuous and the optimization algorithms, we study the finite difference approximation of (2.1) by the forward Euler method

where $t_{k}=ks$ . We chose this method because it fits well with Adam discrete system. As we will see, the approximation error between the discrete system (2.3) and continuous system (2.1) tends to zero (with order one) when the learning rate goes to zero (see Theorem 2.4 for a precise result). However this choice of discretization is of course non-unique, and more efficient quadrature rules could lead to more accurate numerical integration . The connections between our model and the discrete optimization algorithms is summarized by tables 1 and 2, and the proof of these relations is postponed to Section 4.

3. An Energy functional of ODE (2.1) and a natural assumption

A crucial property in the study of ODE (2.1) is the existence of an energy functional, which is inspired from [9, Theorem 2.1]:

This functional plays a crucial role in the study of the convergence of ODE (2.1) in $\S$ 3. We start by computing the derivative of the energy functional:

and, it easily follows from the fact that $p(t),\,h(t),\,v_{i},\,\partial_{\theta_{i}}f(\theta)^{2}\geq 0$ , that:

This leads us to the following natural hypothesis, which is assumed almost everywhere in the paper:

In particular, if $f$ is coercive, then the curve $\theta(t)$ is bounded. Furthermore, if $f$ is a function bounded from below, say by $f_{\ast}$ , then:

The above result shows the importance of the energy functional and has important implications (together with Theorem 2.3 below). For example, if $f$ is coercive we guarantee that $\theta(t)$ is bounded, a necessary condition for the convergence of a solution of ODE (2.1).

Suppose that $p(t)\equiv q(t)\equiv 0$ , in which case inequality (2.5) is an equality, and consider the extreme opposite of assumption 2, that is

Then the differential equation (2.1) does not converge to a minimum because the energy functional (2.4) is always increasing.

4. Basic Properties of ODE (2.1)

In order to continue our analysis of the asymptotic behaviour of solutions of the ODE (2.1), it is essential to understand their domain of definition, and to clarify the difference between continuous and discrete systems. We start by providing two results in this direction, under the assumption that the initial $t_{0}$ is strictly positive. First of all, under mild assumptions, we are able to guarantee that all solutions of the differential equation are defined on the interval $[t_{0},\infty[$ .

The proof of this result is postponed to the the appendix $\S\S$ A. In the case that assumption 2 is satisfied, a very simple proof is provided in $\S\S$ A.1. A stronger version of this result (which includes the case $t_{0}=0$ ) is given in Theorem 5.2 below, but we postpone its discussion to $\S$ 5 in order to keep the initial presentation as simple as possible.

We now study the validity of the approximation given by the discrete system (2.3) of the differential equation (2.1), that is, we study (what is called in the literature) the “convergence rate” of the numerical approximation. In this work, we replace the name “convergence rate” by approximation rate in order to avoid a possible confusion with the convergence of the orbits of the system.

Let us fix the notation: consider a (final) time $T>t_{0}>0$ and an interval $[t_{0},T]$ (on which we will study the approximation of the solution of (2.1)). We recall that the step size is given by $s>0$ . Consider $K_{t,s}=[(t-t_{0})/s]$ , the integer part of $(t-t_{0})/s$ . We write $t_{k}=t_{0}+ks$ for any $k\in[\![0,K_{t,s}]\!]$ and $\pi=\left\{t_{0}<t_{1}<\cdots<t_{K}=T\right\}$ is a homogeneous partition of the interval $[t_{0},T]$ . Now, let $\mathbf{x}_{0}=(\theta_{0},m_{0},v_{0})$ be a fixed admissible initial condition and consider:

Roughly, we are interested in estimating the distance between $(\mathbf{x}_{k})$ and $(\widetilde{\mathbf{x}}_{k})$ in terms of the learning rate $s$ . In what follows, we prove that the approximation rate between the continuous and discrete dynamics is of order one. In particular, when the learning rate goes to zero, then the continuous and discrete dynamics tend to be equal. More precisely:

The case $p=0$ corresponds to Gradient descent, Heavy Ball and Nesterov and similar results hold.

Convergence analysis

In this section, we study the asymptotic behaviour of the solutions of (2.1). Our analysis is divided in the following three steps:

Topological convergence: Find sufficient conditions on the functions $f$ and $p,q,r,h$ in order for the solutions of equation (2.1) to converge to a critical value of $f$ , that is, $\nabla f(\theta(t))\to 0$ when $t\to\infty$ . In particular we do not require $f$ to be convex.

Avoiding local maximum and saddles: We want to strengthen the result of part (1) and give sufficient conditions so that the dynamics avoid local maximum and saddles and only converge to local minimum. In other words, fix $t_{0}>0$ and denote by $S_{t_{0}}$ the set of initial conditions $\mathbf{x}_{0}=(\theta_{0},m_{0},v_{0})$ such that the limit set of the associated solution $\theta(t)$ contains a critical point $\theta_{\star}$ which is not a local minimum. We give, in subsection 3.2, sufficient conditions for the set $S_{t_{0}}$ to have Lebesgue measure zero.

Rate of convergence: Under the convexity assumption, find the rate of convergence of $f$ to a local minimum.

In the remaining of this section, we give precise statements for all of the three steps, and we will make appropriate assumptions on the objective function. However, the following assumption is over-arches the three analysis:

The solution $\theta(t)$ of the ODE (2.1) is bounded.

This assumption is always automatically satisfied when $f$ is coercive, as it was remarked in Lemma 2.1. In general, it is possible to decide when an orbit $\mathbf{x}(t)$ satisfies assumption 3 by algebraic manipulations in terms of its initial condition $\mathbf{x}_{0}$ and the energy functional (2.4).

In this part, we make an additional assumption on the asymptotic behaviour of the coefficients, which is designed to simplify the proof while still covering Adam, and most adaptive algorithms:

Suppose that $\varepsilon>0$ . Consider the functions:

and suppose that these functions are $C^{1}$ in $[0,\infty)$ , $H(0)>0$ and $4R(0)>Q(0)$ .

Note that Assumption 4 is satisfied, essentially, when the coefficients of $h(t)$ and $r(t)$ do not converge to zero at infinity. Hence, it holds for Adam, AdaForm, and the Heavy ball differential equations, c.f. table 1. It also has the interesting feature of being almost completely independent from the functions $p(t)$ and $q(t)$ , a flexibility which should be explored when trying to design new algorithms. Under this assumption, we prove the convergence of the dynamics in the following sense:

Suppose that assumptions 1, 2, 3 and 4 are verified. Then $f(\theta(t))\to f_{\star}$ and $m(t)\to 0$ when $t\to\infty$ , where $f_{\star}$ is a critical value of $f$ . Furthermore, if either $Q(0)>0$ or $p(t)\equiv q(t)\equiv 0$ and $v_{0}=0$ , then $v(t)\to 0$ .

The proof of Theorem 3.1 is postponed in Appendix C. Our method is inspired by the work of Alvarez , based on the energy functional of the system (2.4). We use elementary topological techniques of qualitative theory of ODE’s (à la Poincaré-Bendixson), which are recalled in $\S\S$ C.1. At the one hand, this approach avoids most estimates and analytical arguments, which are typically necessary in this kind of study, and can be easily reproduced in other systems. As an immediate advantage, we do not need assumptions such as convexity of the loss function or globally Lipschitz properties of the differential equation. At the other hand, the assumption is not optimal. For example, it is not satisfied by Nesterov’s acceleration equation (4.12). We believe that the optimal threshold to guarantee convergence of ODE (2.1) should be given by an inequality in terms of poles of order at most one for the functions $H$ and $R$ . The idea is supported in which shows that the function $R$ can not be a polynomial function of order bigger than $1$ in the case of the dissipative system related to accelerated dynamics.

2. Avoiding local maximum and saddles

In this section, we make the following extra assumption:

A critical point $\theta_{\star}$ of $f$ is either a local-minimum or it satisfies the two following properties:

it is a strict saddle (following [15, Definition 1]), that is, there exists a strictly negative eigenvalue of the Hessian $\mathcal{H}_{f}(\theta_{\star})$ of $f$ at $\theta_{\star}$ .

is it an isolated critical point, that is, there is a neighbourhood $U$ around $\theta_{\star}$ that does not contain any other critical points.

Consider the set of initial conditions such that the limit set of the associated orbit contains a critical point which is not a local minimum

The main result of this subsection is the following:

Suppose that assumptions 1, 2, 3, 4 and 5 are satisfied. If either $Q(0)>0$ or $p(t)\equiv q(t)\equiv 0$ , then the set $S_{t_{0}}$ has Lebesgue measure zero for every $t_{0}>0$ .

It follows that, if $\mathbf{x}_{0}=(\theta_{0},m_{0},v_{0})$ is a random initial condition, then the solution $\mathbf{x}(t,t_{0},\mathbf{x}_{0})=(\theta(t),m(t),\theta(t))$ converges to a local minimum of $f$ with total probability. Similar results are proved for discrete systems having isolated critical points in , using essentially the same method as in here. More precisely, we use the theory of central-stable manifold (for vector-fields), which is recalled in $\S\S$ D.1. We have supplemented our analysis by treating the case of the usual gradient flow in $\S\S$ D.2, and we hope that this will help with the dissemination of the technique. We finish this section by providing a technical discussion on the Assumption 5:

Assumption 5 $(a)$ was introduced in and has crucial technical consequences. It allow us to use the center-stable manifold theory recalled in $\S\S$ D.1. Without this hypothesis, the singular points of the ODE (2.1) at infinite (see equation (C.1)) can be arbitrarily degenerated, and there is no general singularity theory to treat these points in dimension higher than three. In order to relax such a hypothesis, it is necessary develop specific singularity techniques for equation (2.1), and we intend to pursue this direction in a future paper.

Assumption 5 $(b)$ allow us to exclude pathological differences between local and global center-stable manifold theory (see example D.2). An alternative to this hypothesis, is to add a globally Lipschitz assumption onto the system (2.1), and to study the relation between the Lipschitz approximation and the Hessian of the loss function $f$ (which would allow us to use the strong global result [40, Ch. 1 Thm 1.1]). This is essentially what is done in , where the authors study the analogue problem in for a simpler ODE without assumption 5 $(b)$ ; more precisely, they crucially show that their system satisfies conditions that replace the globally Lipschitz assumption. We understand that a study in the generality of ODE (2.1) without condition 5 $(b)$ would demand the development of specific singularity techniques for equation (2.1), and we intend to pursue this direction in a mathematical paper.

3. Rate of convergence

The study of the rate of convergence of $f(\theta(t))$ to the minimum value $f(\theta_{\star})$ usually relies on a convexity assumption and a Lyapunov energy functional (see ). It is natural to assume in this section, therefore:

The expressions of $\mathcal{A}(t)$ , $\mathcal{B}(t)$ and $\mathcal{C}(t)$ are simple to compute for all the expressions in table 1, as we show in $\S$ 4. Note, furthermore, that these functions only depend on $h$ and $r$ (which re-enforce the heuristic that $p$ and $q$ can be chosen in a very flexible way). We are ready to introduce the energy functional used in this section:

We now need the following assumptions in order to control the behaviour of this functional:

$\lim_{t\to\infty}\int_{t_{0}}^{t}e^{-\int^{s}_{t_{0}}r(u)du}ds<+\infty$

Note that Assumption 7(a) is necessary for the function $\mathcal{B}(t)$ to be well-defined, and that imposes an important constraint in the asymptotic behaviour of the function $r(t)$ . More precisely, the limit $\lim_{t\to\infty}t^{1+\epsilon}r(t)$ must be zero for every $\epsilon>0$ , which implies that $r(t)$ has at most a pole of order $1$ at infinity. Assumption 7(b) provides the asymptotic control on the derivative of the energy functional (3.1), and should be compared with Assumption 2. Once again, note that it is independent of the function $p$ , and almost independent of $q$ . We are now ready to state the main theorem of this section:

It follows that the ODE (2.1) converges to the minimum point with rate of convergence of order at least:

From the previous theorem, we observe that the rate of convergence to the global minimum is determined by several factors

the choices of $h,r$ which in turns define the function $\mathcal{A}$ .

the choices of $p,q$ which degrade the rate of convergence but allow for more flexibility for the choice of $h,r$ in assumption 2.

It follows from the first inequality of Theorem 3.4 that the rate of convergence of ODE (2.1) depends in an essential way from the asymptotic behaviour of the term:

This has also been independently remarked in . In the later, the authors propose the algorithm AdaFom, whose coefficients are chosen in such a way that the sum of the terms (3.2) is telescopic. They can, therefore, be controlled in an easy way, which allows one to obtain convergence results only under a globally Lipschitz assumption. Note that our Theorem 3.4 recovers the expected (in the deterministic setting) rate of convergence of AdaFom, see Corollary 4.2.

In general, nevertheless, the term (3.2) can not be controlled in the same way. The second inequality of Theorem 3.4 controls the term (3.2) only in terms of the functions $h$ , $r$ and $q$ . We do not know if this control is optimal, but we derive some surprising features from it which we discuss in $\S\S$ 4.2, e.g. it supports practitioners usual choice of hyper-parameters for Adam.

4. Convergence analysis of ODE (2.2)

It is possible to reproduce Theorems 3.1, 3.2 and 3.4 in the case of ODE (2.2) following the same methods (but in an easier way) presented in the Appendixes C, D and E, provided that some changes are made in the energy functional and assumptions. More precisely, we consider the following energy functional:

It follows that Assumptions 2 and 4 are unnecessary! By the same reason, furthermore, Assumption 3 is always verified when $f$ is coercive, c.f. Lemma 2.1. This implies that Theorem 3.1 and 3.2 admit admit the exact same formulation for ODE (2.2) without Assumptions 2 and 4, that is:

Suppose that assumptions 1 and 3 are satisfied for ODE (2.2). Then $f(\theta(t))\to f_{\star}$ when $t\to\infty$ , where $f_{\star}$ is a critical value of $f$ . Furthermore, if either $Q(0)>0$ or $p(t)\equiv q(t)\equiv 0$ and $\omega_{0}=0$ , then $\omega(t)\to 0$ .

Suppose that assumptions 1, 3 and 5 are satisfied for ODE (2.2). If either $Q(0)>0$ or $p(t)\equiv q(t)\equiv 0$ , then the set $S_{t_{0}}$ has Lebesgue measure zero for every $t_{0}>0$ .

Finally, under the convexity assumption 6, we consider the energy functional:

It follows from direct computation, using the convexity assumption, that:

Now, under assumption 3, it is clear that $\frac{d}{dt}\mathcal{E}(t,\theta,\omega)$ is negative for $t>>t_{0}$ . We have, therefore, obtained the following version of Theorem 3.4:

and it follows that the ODE (2.1) converges to the minimum point with rate of convergence of order $\mathcal{O}(1/t)$ .

Convergence results: application to first order algorithms

In this section, we specify the choice of functions $h,p,q,r$ corresponding to different optimization methods and apply each convergence theorem to them. We start by a brief discussion on the assumptions which appear in this section, and we then move on to present differential equations and convergence results on: Adam, AdaFom, Heavy Ball, Nesterov, Adagrad and RMSProp. Our results on Adam are new and we give full details on their proof. Some of our results on AdaFom, Heavy Ball, Adagrad and RMSProp are, up to our knowledge, also new. Their proofs can be easily formalized by repeating the arguments used for Adam, and we present “sketch of proofs” in order to provide a guideline on the necessary changes. Finally, we also recover several known results in an easy manner. In particular, sharp estimates for the rate of convergence of Nesterov are given in Corollary 4.4 below.

In the convergence analysis, we recurrent make assumptions which were introduced in $\S$ 2 and 3. We briefly recall their meaning and situations where they are satisfied:

Assumption 2 is equivalent, for the models in this section, with the hypothesis that the loss function $f$ is $C^{2}$ .

Assumption 3 states that the trajectory $\theta(t)$ is bounded. We recall that there are some very practical situations where this assumption is always satisfied; for example in the case of coercive objective functions, see Lemma 2.1.

Assumption 5 gives a condition on the nature and the degeneracy of the critical points of the objective function. It is used in the study of saddle points and local maximum points of the loss functions, and it also appears in . Note that this assumption is satisfied for generic functions (e.g. Morse functions). See Remark 3.3 for further discussion.

2. Adam

where the two parameters for the moving average are given by

We rewrite the update for the parameters $\theta$ such that

As a consequence, the initial velocity is $\dot{\theta}_{0}=-\operatorname{sign}(\nabla f(\theta_{0}))$ .

Consider now the three parameter family of differential equations

where the coefficients in ODE (2.1) are given by

and $(\lambda,\alpha_{1},\alpha_{2})$ are positive real numbers. Note that both functions have a simple pole at $t=0$ and, therefore, satisfy assumption 8. Now, let us consider the associated discretization (2.3) with learning rate $s$ and a sub-family of discrete models parametrized by $(\beta_{1},\beta_{2})\in(0,1)\times(0,1)$ which are given by

which recovers Adam’s discrete system (4.2) (apart from small difference in the evaluation of $\mu$ ). Therefore, Adam is an Euler discretization of system (2.1) for the choice of function (4.2.1)-(4.4) and parameters (4.5).

2.2. Adam without rescaling differential equation

In the original formulation of the algorithm (as stated in (4.1)), the parameters $\mu$ and $\nu$ depends on the iterations $k$ to correct for the bias induced by the moving average. These coefficients can also be taken constant $\mu=\beta_{1}$ and $\nu=\beta_{2}$ , in which case we say that the algorithm is Adam without rescaling. In this case, it is easy to verify that the differential equation:

is the continuous counter-part of the algorithm when we consider the sub-family given by $\beta_{1}=(1-s/\alpha_{1})$ and $\beta_{2}=(1-s/\alpha_{2})$ .

2.3. Convergence of Adam

Suppose $\varepsilon>0$ and let assumptions 1 and 3 be satisfied for equation (4.3). Moreover, we assume

Then the following convergence results hold true

Topological convergence: $f(\theta(t))\to f_{\star}$ , $m(t)\to 0$ and $v(t)\to 0$ when $t\to\infty$ , where $f_{\star}$ is a critical value of $f$ .

Rate of convergence: Under the additional convexity assumption 6, there exists a constant $\mathcal{K}>0$ which depends on $f$ , $\theta_{0}$ and $v_{0}$ , so that:

The rate of convergence to this neighbourhood, furthermore, is of order $\mathcal{O}(1/t)$ .

Note that there is an apparent paradox between point $I$ and $III$ of the Corollary: the dynamics is convergent by $I$ , but the “fast” rate of convergence (of order $\mathcal{O}(1/t)$ ) can only be guaranteed to a neighbourhood. This is no paradox, nevertheless, because Adam might converge very slowly once it attains the neighbourhood given by point $III$ . In particular, the discrete version of Adam may not converge even in the deterministic case (see Proposition 6.1 below), and this is expected because of the rate of convergence in Theorem 2.4.

Point $III$ has two other surprising consequences. First, it justifies the usual choice of practitioners to take the hyper parameter $\beta_{2}$ as close to $1$ as possible, because:

while $\beta_{1}$ has a smaller direct impact, even if it is convenient to take it close to $1$ , because:

Second, the size of the neighbourhood of fast convergence (of order $\mathcal{O}(1/t)$ ) is controlled by a constant $\mathcal{K}$ which only depends on $f$ and the initial conditions. This indicates that the success of Adam for certain loss functions (e.g. loss functions in deep learning) might be associated to features on the “class” of loss functions considered.

We now turn to the proof of the Corollary:

The proof of (I) directly follows from Theorem 3.1 provided that assumptions 2 and 4 are satisfied. Hence, the proof simply consists on checking the validity of both assumptions under the condition that $3+\beta_{2}>4\beta_{1}$ . Let us recall that the coefficients for the Adam’s differential equations are given by

and $(\lambda,\alpha_{1},\alpha_{2})$ are positive real numbers and:

It is easy to check that assumptions 2 and 4 are satisfied if there exists a $t$ , large enough, such that

Taking the limit as $t$ goes to infinity in the above inequality gives

We conclude using the expressions of $\beta_{1}$ and $\beta_{2}$ .

The proof of (II) follows directly from Theorem 3.2 since assumptions 2 and 4 are satisfied under the condition $3+\beta_{2}>4\beta_{1}$ .

In order to prove (III), let us check the hypotheses of Theorem 3.4. We compute explicitly the functions

so, by direct computation via L’Hôpital’s rule:

and it easily follows that assumption 7 is verified. Finally, by using L’Hôpital’s rule, we get:

3. AdaFom

where the parameter for the moving average are given by (just as in Adam):

Consider now the two parameter family of differential equations

and $(\lambda,\alpha)$ are positive real numbers. Just as in the case of Adam, consider the associated discretization (2.3) with learning rate $s$ and a sub-family of discrete models parametrized by $\lambda=s$ and $\beta=e^{-\lambda/\alpha}$ . The reader may verify, following the same steps of the analysis of Adam that the discretization of this sub-family recovers AdaFom algorithm (4.7). We now turn to the convergence analysis :

Suppose $\varepsilon>0$ and let assumptions 1 and 3 be satisfied for equation (4.8). Then the following convergence results hold true

Topological convergence: $f(\theta(t))\to f_{\star}$ and $m(t)\to 0$ when $t\to\infty$ , where $f_{\star}$ is a critical value of $f$ .

Rate of convergence: Under the additional convexity assumption 6, $f(\theta(t))\to f(\theta_{\star})$ with the rate $\mathcal{O}(\ln(t)/t)$ .

Note that Theorem 3.2 can not be applied to ODE (4.8) in a direct way because $v(t)$ may not go to (in particular, in the notations of Theorem 3.2, $Q(0)=0$ ).

By the choice of functions $h(t)$ , $r(t)$ , $p(t)$ and $q(t)$ , it is easy to see that Assumptions 2 and 4 are always verified. Part $(I)$ is, therefore, direct consequences from Theorem 3.1. Next, the computation of $\mathcal{A}$ , $\mathcal{B}$ and $\mathcal{C}$ are independent on $p(t)$ and $q(t)$ , so they are analogous to the one’s obtained for Adam. It follows that assumption 7 is verified. Finally, by Theorem 3.4, the rate of convergence is controlled by the asymptotic behaviour of:

which can easily be verified to be of order $\mathcal{O}(\ln(t)/t)$ . ∎

4. Heavy Ball

We consider the Heavy ball second order differential equation

where $\gamma>0$ . By taking $\theta=x$ and $m=-\dot{x}$ (and $v\equiv 1$ ), we obtain the system (2.1) with

which corresponds to the classical Heavy ball methods with damping coefficient $\beta=1-s\gamma$ , momentum variable $n_{k}=sm_{k}$ and learning rate $\alpha=s^{2}$ . Implicit discretization has also been considered in .

From our analysis, we recover results given in for the discrete update rules (4.10):

Suppose that assumptions 1 and 3 are satisfied for equation (4.10). Then

Topological convergence: $f(\theta(t))\to f_{\star}$ and $\omega(t)\to\omega_{\infty}>0$ when $t\to\infty$ , where $f_{\star}$ is a critical value of $f$ .

Rate of convergence: Under the additional convexity assumption 6, $f(\theta(t))\to f(\theta_{\star})$ with the rate $\mathcal{O}(1/t)$ .

By the choice of functions $h(t)$ , $r(t)$ , $p(t)$ and $q(t)$ , it is easy to see that Assumptions 2 and 4 are always verified. Part $(I)$ and $(II)$ are, therefore, direct consequences from Theorems 3.1 and 3.2. Next, by direct computation, we get:

It follows that assumption 7 is verified. Finally, by Theorem 3.4, the rate of convergence is controlled by the asymptotic behaviour of $1/\mathcal{A}(t)$ , which is of order $\mathcal{O}(\ln(t)/t)$ . ∎

5. Nesterov

Following , we consider the Nesterov second order differential equation, parametrized by the constant $r>0$ ,

Similarly as in the Heavy Ball case, we define $\theta=x$ and $m=-\dot{x}$ and write the above equation as a system (2.1) with

In , the authors studied a slightly different forward Euler scheme and proved that the difference between the numerical scheme and the Nesterov algorithm goes to zero in the limit $s\to 0$ . This effectively replaces our Theorem (2.4), so that Nesterov differential equation can be assumed to be given by:

where $r>0$ . We are ready enunciate the main convergence result for Nesterov’s differential equation, which have been previously proved in .

Suppose that equation (4.12) satisfies assumptions 1, 3 and 6. Then $f(\theta)\to f(\theta_{\star})$ when $t\to\infty$ with rate of convergence:

The proof for $r\leq 3$ is given in subsection E.2. We assume in here that $r\geq 3$ . Recall that:

It easily follows that, whenever $r\geq 3$ , the inequalities of assumption 7 are verified, and the result follows from Theorem 3.4. ∎

6. Adagrad

with initial condition $\theta_{0}$ and $G_{0}=\nabla f(\theta_{0})^{2}$ . By setting $\alpha=s^{2}$ and $\omega_{k}=\alpha G_{k}$ , it is easy to conclude that the Adagrad’s update rule can be written as:

Suppose that assumptions 1 and 3 are satisfied for equation (4.16). Then

Topological convergence: $f(\theta(t))\to f_{\star}$ and $\omega(t)\to\omega_{\infty}>0$ when $t\to\infty$ , where $f_{\star}$ is a critical value of $f$ .

Rate of convergence: Under the additional convexity assumption 6, $f(\theta(t))\to f(\theta_{\star})$ with the rate $\mathcal{O}(1/t)$ .

7. RMSProp

The only difference between RMSProp and Adagrad is how the preconditioning matrix is computed. In RMSprop, it consists of an exponentially decaying moving average, rather than a sum of the previous gradients, which has many similarities to the update rule of Adam

In particular, following the same analysis as performed in $\S\S$ 4.2, we derive the differential equation:

Suppose that assumptions 1 and 3 are satisfied for equation (4.18). Then

Topological convergence: $f(\theta(t))\to f_{\star}$ and $\omega(t)\to 0$ when $t\to\infty$ , where $f_{\star}$ is a critical value of $f$ .

Rate of convergence: Under the additional convexity assumption 6, $f(\theta(t))\to f(\theta_{\star})$ with the rate $\mathcal{O}(1/t)$ .

In this section we improve Theorem 2.3, and we consider the case that $t_{0}=0$ . Recall that our functions $h(t)$ , $r(t)$ , $p(t)$ and $q(t)$ are allowed to have poles in the origin (c.f. 1) in order to capture the phenomena present in accelerated methods (c.f. Nesterov) and on rescaling due to bias correction (c.f. ADAM). This impose some technical difficulties, as illustrated by the following example:

Consider the differential equation, with $L\neq 0$ :

Now the solutions of this equation with initial condition at $t_{0}=1$ are given by

This allow us to make the following considerations:

Suppose $\theta_{0}\neq 0$ . Then there is no solution for the system.

We, therefore, need to demand extra assumptions on the coefficients of our model and the initial conditions in order to guarantee the existence and uniqueness of the solution at time $t=0$ :

The functions $h,r,p,q$ have a simple pole at $t=0$ .

If $h\in C^{1}([0,+\infty))$ (resp $p\in C^{1}([0,+\infty))$ ), then $r$ (resp. $q$ ) can have (at most) a simple pole at zero.

In any other cases, all functions are assumed to be $C^{1}$ on $[0,\infty)$ .

In cases $(1)$ and $(2)$ , furthermore, we demand the following two extra-conditions:

We assume that $p(t)\not\equiv 0$ and that there exists a small time $\hat{t}$ such that

Conditions (a) and (b) of Assumption 8 should be compared with the issues (a) and (b) raised in Example 5.1, respectively. In terms of the algorithms in table 1, the assumption is always verified for Heavy-Ball, Nesterov, AdaForm and ADAM without rescaling. For ADAM, nevertheless, it is necessary to add mild assumptions on the hyper-parameters such as $\beta_{1}\geq 0.21$ . We are now ready to enunciate our main result:

Suppose, furthermore, that assumption 8 is also satisfied. Then, there exists a unique global solution to equation (2.1) such that:

The proof is postponed in Appendix A, and it is technically much harder than the proof of Theorem 2.3 because of the pole at the origin.

Considerations and guidelines for adaptive algorithms

We make here empirical observations, discuss some limitations, and provide some extra guidelines for the use of adaptive algorithms, based on our previous analysis.

One strong limitation of Adam is the existence of discrete limit cycles in the sense that the algorithm produces oscillations that never damp out. If the discrete dynamics reaches such an equilibrium, the difference $f(\theta_{k})-f(\theta_{\star})$ can not converge arbitrarily close to zero with an increasing number of steps. However, it reaches a neighborhood of the critical point whose radius is determined by the learning rate $s$ . Decaying the learning rate is therefore necessary to obtain convergence of the dynamics Numerically, we found that Adam with $\beta_{1}>0$ suffers from the same phenomena but the limit cycles are more difficult to establish. We believe that the existence of such cycles depend on the local curvature of the function $f$ near the optimum.

Let $\beta_{1}=0$ and $f(\theta)=\theta^{2}/2$ . Then there exists a discrete limit cycle for (4.2).

Let us assume that there exists a $k$ such that $\theta_{k}=s/2$ and that $v_{k}=(s/2)^{2}$ , where $s$ is the learning rate. It easily follows from the update rules that

Therefore $\theta_{k+2}=-\frac{s}{2}+s=\theta_{k}$ and the system has entered a discrete equilibrium. ∎

We illustrate this behavior in Figure 1 on the strongly convex toy function $f(x,y)=x^{2}+y^{2}$ .

It is important to note that the value of the gap between $f(\theta_{k})$ and $f(\theta_{\star})$ depends on the learning rate. Choosing a smaller learning rate reduce the gap, but doesn’t remove it.

The second observation is related to the hyper-parameters of the optimizers and give important guidance on how to tune them. As observed in section 4.2, the parameters $\beta_{1}$ and $\beta_{2}$ are chosen as functions of the learning rate $s$ and parameters $\alpha_{1}$ and $\alpha_{2}$ . It is often the case in practice (in particular in stochastic optimization) to decay the learning rate during the training process. By doing so, the discrete dynamics is completely modified unless the $\beta$ ’s are adjusted to keep the parameters $\alpha_{i}$ constant. Therefore, once a particular choice of hyper-parameters seems promising, a decay in the learning rate should be accompanied by changing the hyper-parameters $\beta_{i}$ according to the formula (4.5), which we recall here

Indeed, by doing so the underlying dynamics is preserved. We illustrate this in plot b), Figure 2, where we compute the logarithm of the error between different trajectories

The reference dynamics: $\beta_{2}=0.99$ and $s=0.001$ . According to formula (4.5), $\alpha_{2}=-0.001/\log(0.99)\approx 0.0995$

Second dynamics: $\beta_{2}=0.99$ and $s=0.01$

Third dynamics: same learning rate as the second dynamics $s=0.01$ but we adjust the hyper-parameter: $\beta_{2}=\exp(-0.01/0.0995)=0.90438$ .

3. Convergence properties of Adam depend on the class of loss functions

The convergence analysis in the convex case (see Theorem 3.4) seem to indicate that Adam is a rather slow algorithm because quick convergence (in the order $o(1/t)$ ) is only guaranteed in a neighbourhood of the global minimum. Under a closer look, however, we note that the size of the neighbourhood might be very small depending on the class of loss functions because of its impact to the term (3.2) and the constant $\mathcal{K}$ . In particular, we believe that there are situations where Adam should perform consistently better than other algorithms, provided that the hyper-parameters are well-chosen; e.g. for flat loss functions (see figure 3). We intend to deepen this remark in forthcoming works.

4. On future algorithms

Existence and uniqueness of solutions for ODE (2.1) holds for almost arbitrary functions $h(t)$ , $r(t)$ , $p(t)$ and $q(t)$ . Our convergence analysis, nevertheless, imposes important restriction on the choice of these functions (see Assumptions 4, 5, 7), which we now discuss. From Theorem 3.4, the rate of convergence to the global minimum is at least given by the decay of the function

where $\mathcal{A}$ depends only on $h$ and $r$ . It is natural to seek for functions $h(t)$ , $r(t)$ , $p(t)$ and $q(t)$ such that this upper-bound decays faster to zero, while satisfying the conditions 2 and 7.

Note that in Theorem 3.4, we obtained tighter estimates on the rate of convergence depending on the history of the gradient and the variables $v$ and $\theta$ . This suggest that the efficiency of the algorithm does not only depend on the choice of the functions $h(t)$ , $r(t)$ , $p(t)$ and $q(t)$ but also on the path the dynamics is taking and therefore on properties of the loss.

On the flexibility of the coefficients $h(t)$ and $r(t)$ . We start by recalling that assumption 7(a) implies a strong constraint on the function $r(t)$ . Indeed, it is necessary that $\lim_{t\to\infty}r(t)\cdot t^{1+\epsilon}=0$ for all $\epsilon>0$ . This is a strong restriction, which have consequences to the choice of $h(t)$ . In order to illustrate this, let us consider two examples $r_{1}(t)=r_{1}$ and $r_{2}(t)=r_{2}/t$ with $r_{2}>1$ , which are the natural allowed examples. With these choices of functions, we seek for $h$ such that $\mathcal{A}$ has at least linear growth. In the notation of $\S\S$ 3.3, we have that:

This imposes strong restrictions on $h(t)$ , indeed:

It follows that the natural reasonable choices are one of the following three:

$r(t)=r$ , $h(t)=h$ (c.f. Heavy Ball, Adam and AdaFom). Under these conditions, one can expect a convergence rate of order at least $\mathcal{O}(1/t)$ and convergence guarantees (including avoidance of saddle points) even in the non-convex setting;

$r(t)=r/t$ , $h(t)=h$ (c.f Nesterov). Under these conditions, one can expect a fast convergence rate of order at least $\mathcal{O}(1/t^{2})$ , but we obtain fewer guarantees in the non-convex setting because of the convergence to zero of the function $r$ ;

$r(t)=r/t$ and $h(t)=h/t$ . Under these conditions, one can expect a convergence rate of order at least $\mathcal{O}(1/t)$ , but we obtain fewer guarantees in the non-convex setting;

Note that assumptions 2 and 7 impose further relations on the hyper-parameters $r$ and $h$ .

On the flexibility of the coefficients $p(t)$ and $q(t)$ . In sharp contrast with the previous analysis, the coefficients $p(t)$ and $q(t)$ require fewer restrictions. In particular, note that assumptions 2, 4 and 7 are almost independent on these coefficients (besides mild constraints on $q(t)$ ).

Our analysis leads to an interesting dilemma, nevertheless, which deserves further investigation. At the one hand, in Theorem 3.2 (which guarantee avoidance of saddle points) it is necessary to assume that $\lim_{t\to\infty}q(t)>0$ (c.f. Adam). On the other hand, in order to obtain faster convergence, Theorem 3.4 indicates that the function $q(t)$ should have a fast decay to zero to control the growth of $\int_{t_{0}}^{t}q(u)du$ . Indeed:

if $\lim_{t\to\infty}q(t)>0$ (c.f Adam), then the denominator $\int_{t_{0}}^{t}q(u)du$ has linear growth which degrades the rate of convergence of the algorithm.

if $\lim_{t\to\infty}t\cdot q(t)>0$ (c.f. AdaFom) then the denominator has logarithmic growth.

if $\lim_{t\to\infty}t^{1+\epsilon}\cdot q(t)=0$ for some $\epsilon>0$ , then there is no loss in the expected convergence rate.

It is, of course, interesting that an optimization algorithm avoids saddle point and converges as fast as possible. This is an intriguing point, which we feel that deserves further empirical investigation.

Final remarks. From the analysis outlined above, we feel that the combination of choices $1.ii$ (AdaFom), $1.iii$ and $2.ii$ are promising, at least from the perspective of our current analysis, and deserve further empirical investigation.

Moreover, we hope to explore further different choices of functions $p(t)$ and $q(t)$ as well as the design of hybrid algorithms, which are also supported by this analysis.

Conclusion and final discussion

The main objective of this work is to provide a theoretical framework to study adaptive algorithms. The proposed continuous dynamical system (2.1) is flexible enough to encompass commonly used adaptive algorithms (as we show in Section 4), but stays specific enough to allow simple proofs and guidelines. Our work shows that adaptive dynamics converge to a critical locus of the loss function but possibly at a slower rate than non-adaptive algorithms. Due to the nature of the adaptivity, the convergence rate is not just a non-increasing function of time but also depends on the gradient history. The performance of adaptive algorithms is linked to the trajectory taken by the dynamics and properties of the loss function. We also analyze how different choices of coefficients $h$ , $r$ , $p$ and $q$ impact on the convergence of the dynamics and we suggested several possible algorithms to be tested (see $\S\S$ 6.4). It supports our interest in linking specific choices of adaptive algorithms (more precisely, specific choices of the coefficients $h$ , $r$ , $p$ and $q$ ) with properties from the loss function. We intend to pursue this direction in future works.

The deterministic convergence analysis leads to natural conjectures on the convergence in the discrete and stochastic setting. In particular, we believe that the Lyapunov functional (3.1) can be adapted to the stochastic discrete framework . We note that, nevertheless, a precise correspondence between results valid for a continuous ODE and the stochastic discrete counterparts is far from being obvious. Indeed, recall that Adam and RMSprop are not always converging in the stochastic setting, even for a convex loss function . We expect, therefore, new restrictions on the coefficients $h(t)$ , $r(t)$ , $p(t)$ and $q(t)$ , as well as on the loss function and the learning rate. We believe that those conditions will be different compared to SGD.

Finally, the mathematical tools used in this work are theoretically simple, and we hope that our presentation helps to diffuse these techniques. Some beautiful mathematical open problems have naturally appeared (see, for example, the discussion in Remark 3.3) and we intend to work on them in future projects.

We would like to thank Daniel Panazzolo for a private communication concerning central-stable manifolds.

Appendix A Existence and uniqueness of solutions

In this section, we prove Theorems 2.3 and 5.2. We start by a simple case, in order to exemplify the main ideas of the proof, before we turn to the general more technical analysis.

which implies that $\dot{\theta}(t)$ is uniformly bounded. We conclude that:

It follows that: either $t_{\infty}=\infty$ , in which case we are done, or $t_{\infty}<\infty$ and $\theta(t)$ satisfies assumption 3. In order to conclude, it is enough to treat this last case:

Let $t_{0}>0$ and $\mathbf{x}_{0}=(\theta_{0},m_{0},v_{0})$ be fixed. Under assumptions 1 and 3, there exists an unique solution $\mathbf{x}(t)=(\theta(t),m(t),v(t))$ of (2.1) with initial condition $\mathbf{x}(t_{0})=\mathbf{x}_{0}$ , and which is defined for all $t$ in $[t_{0},\infty)$ . Furthermore, we have $v(t)\geq 0$ for all $t\in[t_{0},\infty)$ . Furthermore, denoting by $L_{g}=\sup\{\left\|\nabla f(\theta(t))\right\|;t\geq t_{0}\}$ , we get:

where we recall that $d$ stands for the dimension of the space. If we suppose that $r(t)\not\equiv 0$ and $q(t)\not\equiv 0$ , furthermore, then:

By assumption 1 and classical ODE’s, there exists a solution $\mathbf{x}(t)=(\theta(t),m(t),v(t))$ of system (2.1) with maximal interval of definition $[t_{0},T)$ and initial conditions $\mathbf{x}(t_{0})=\mathbf{x}_{0}=(\theta_{0},m_{0},v_{0})$ . Now, consider the functions:

which are increasing functions bigger than $1$ (for all $t\geq t_{0}$ ). We note that:

Next, under assumption 3, we can assume that $|\nabla f(\theta(t))|\leq L_{g}$ for some positive real number $L_{g}$ . It is now easy to get inequalities (A.3), which lead to:

and we easily conclude that $T=\infty$ . Finally, if $r(t)\not\equiv 0$ , we get by direct integration:

A similar computation holds whenever $q(t)\not\equiv 0$ , which concludes the Lemma. ∎

A.2. Gronwall’s Lemma

Let $T>0$ , $\lambda\in L^{1}(0,T)$ , $\lambda\geq 0$ almost everywhere and $C_{1},C_{2}\geq 0$ . Let $\varphi\in L^{1}(0,T)$ , $\varphi\geq 0$ almost everywhere, be such that $\lambda\varphi\in L^{1}(0,T)$ and

for almost every $t\in(0,T)$ . Then we have

for almost every $t\in[t_{0},t_{1}]$ , where $\beta$ is continuous. Then for all $t\in[t_{0},t_{1}]$ and all $0\leq\alpha<1$

for almost every $t\in[t_{0},t_{1}]$ , where $\beta,\gamma$ are continuous. Then for all $t\in[t_{0},t_{1}]$

and we conclude using the standard Gronwall Lemma applied to $\varphi^{1-\alpha}$ . ∎

A.3. A priori estimates and global solution

We now turn to more precise estimates of the functions $\mathbf{x}(t)=(\theta(t),m(t),v(t))$ in order to prove existence and uniqueness of the solution of (2.1). We start by controlling the derivative of $\theta$ :

Let $0<t_{0}<T<\infty$ be fixed and suppose that $p(t)\not\equiv 0$ . For any $s,t\in[t_{0},T]$ such that $s\leq t$ , we have that:

where $d$ is the dimension of the state space. Hence,

and we easily conclude from Gronwall’s Lemma. ∎

The above Lemma allow us to control $\theta(t)$ . The next Lemma replaces Assumption 3 in the existence proof:

Let $0<t_{0}<T<\infty$ be fixed. For any $s,t\in[t_{0},T]$ such that $s\leq t$ :

and, in particular, if $p(t)\not\equiv 0$ :

From the Cauchy-Schwarz inequality, we obtain

We apply Lemma A.3 to $\varphi(t)=\frac{1}{2}\left\|\theta(t)\right\|^{2}$ , $\beta(t)=\sqrt{2}\left\|\frac{m}{\sqrt{v+\varepsilon}}\right\|$ and $\alpha=1/2$ in order to get the first inequality. The second inequality follows from the first together with the estimate of Lemma A.5. ∎

We complete this section by improving the estimates on $m(t)$ and $v(t)$ which were given in Lemma A.1:

Let $0<t_{0}<T<\infty$ be fixed. For any $s,t\in[t_{0},T]$ such that $s\leq t$ :

We now just need to apply Lemma A.4 in order to conclude. ∎

Let $0<t_{0}<T<\infty$ be fixed. For any $s,t\in[t_{0},T]$ such that $s\leq t$ :

and the first inequality easily follows from the Gronwall lemma. In order to get the second inequality, note that

and we just need to apply Gronwall’s Lemma once again to conclude. ∎

We follow a standard argument in dynamical systems: we will approximate the solution of ODE (2.1) by a sequence of functions with good convergence properties. In this section, we limit ourselves to describing which is the sequence of functions, and we leave the “good convergence properties” for later on. Indeed, we consider the orbits $\mathbf{x}_{\delta}$ , for $\delta>0$ , which are solution to the equation

In order to show that this family of functions converge, we use Arzela-Ascoli Theorem. The next section is dedicated to proving that the hypothesis of Arzela-Ascoli are verified.

We prove in this section, that the family of functions $\mathbf{x}_{\delta}$ is equicontinuous and uniformly bounded, where $\mathbf{x}_{\delta}$ is the solution to (A.4). This allow us to apply Arzela-Ascoli in the end of this subsection in order to get the candidate for a solution of ODE (2.1).

The key result is the following proposition whose proof is left to subsection A.5.3

If assumptions 8 is satisfied, then there exists a positive constant $C_{2}(T)$ , independent of $\delta$ , such that for all $t,s\in[0,T]$

As a consequence of the previous Proposition, we can control the norm of the solution $\mathbf{x}_{\delta}$ ; this is done in terms of an special norm (instead of the usual one). More precisely, let us recall the notion of fractional Sobolev space. For a real number $0<\delta<1$ and $p\geq 1$ , we denote by $W^{\alpha,p}([0,T])$ the fractional Sobolev space of functions $u\in L^{p}(0,T)$ satisfying

If assumptions 8 is satisfied, there exists a positive constant $C_{3}(T)$ , independent of $\delta$ , such that

The proof is a direct consequence of Lemma A.9. Indeed

where the last inequality holds if and only is $\gamma<1$ . ∎

A.5.2. Identification of the limit and uniqueness of the solution

The convergence of the initial conditions are a direct consequence of the uniform convergence (which implies point-wise convergence at every point). Now fix $T>0$ ; it is clear that the ODE (A.4) converges uniformely to ODE (2.1) when $\delta\to 0$ in a neighbourhood of $t=T$ (indeed, for $\delta<<T$ , the two differential equations are equal). Since $x_{\delta}$ converges uniformly to $\hat{x}$ , we conclude that $\hat{x}$ is a solution of of ODE (2.1) in a neighbourhood of $t=T$ . Since $T>0$ was arbitrary, we conclude the result.

We proceed by contradiction. Assume there exist two solutions $\mathbf{x}=(\theta,m,v)$ and $\mathbf{y}=(\psi,n,w)$ to the system (2.1).

An easy computation shows that for all $0\leq t\leq T$ (because $v$ and $w$ are lower bounded, see Lemma A.8)

By continuity of the solution of equation (2.1) on $[0,T]$ , we know that there exists a constant $\widetilde{C}$ such that for all $s\leq t$

which are increasing functions bigger than $1$ (for all $t\geq\eta>0$ ). We note that:

It follows from Assumptions 8 and inequality (A.5), that for all $\eta\leq t\leq T$ ,

By continuity of the process $m$ and $n$ , the fact that $m_{0}=n_{0}$ and the continuity of $s\mapsto sh(s)$ on $[0,t]$ , we obtain by taking the limit when $\eta$ goes to zero that, apart from increasing $C_{1}$ ,

Similarly there is a constant $C_{2}$ such that

Hence, by combining all bounds there exists two constants, still denoted $C_{1}$ and $C_{2}$ such that

Since there exists a $t>0$ such that $C_{1}t$ and $C_{2}t$ are strictly smaller than $1$ , this inequality yields a contradiction. We conclude that the solution must be unique.

A.5.3. Proof of Proposition A.9

We start by a preliminary estimate, which extends Lemma A.5 to an uniform bound in terms of $\delta$ :

Suppose that $p(t)\not\equiv 0$ . There exists two constants $K_{1}$ and $K_{2}$ such that for all $t\in[0,T]$ and all $\delta>0$ sufficiently small

From Lemma A.5 and assumption 8 (which implies that $\delta h_{\delta}(\delta)$ , $\delta q_{\delta}(\delta)$ , $\delta r_{\delta}(\delta)$ and $\frac{h_{\delta}(\delta)}{p_{\delta}(\delta)}$ are bounded for $\delta<1$ ), there exits a constants $K_{1}\geq 0$ and $K_{2}\geq 0$ such that for every $\delta\leq 1$ and $t<\delta$ , we have:

Moreover from Lemma A.5 and assumption 8 (which implies that $q_{\delta}(u)-2r_{\delta}(u)<0$ , $h_{\delta}(t)/p_{\delta}(u)$ and $h_{\delta}(t)/r_{\delta}(u)$ are bounded for $\delta$ and $u$ small), there exits a constants $\widetilde{K}_{1}\geq 0$ and $\widetilde{K}_{2}\geq 0$ such that for every $\delta>0$ small enough and $t\geq\delta$ , we have:

Now, combining the two inequalities, (and apart from increasing $\widetilde{K}_{1}$ and $\widetilde{K}_{2}$ ) we get:

We are now ready to prove Proposition A.9. The proof uses the integral formulation and weighted space. First, we define the following norm for all $0<t\leq T$

We claim that there exists a constant $C(T)$ (independent of $\delta$ ) such that $N(t,\delta)\leq C(T)$ for all $t\in(0,T]$ . Note that Proposition A.9 immediately follows from the claim and the following inequality

For all $t\leq\delta$ , the functions $r_{\delta}$ and $q_{\delta}$ are constant and the equations for $m_{\delta}$ and $v_{\delta}$ , given by system (A.4), have the equivalent Duhamel formulation given by

From Lemma A.11, we know that $\left\|m_{\delta}(t)/\sqrt{v_{\delta}(t)+\varepsilon}\right\|^{2}$ is uniformly bounded with respect to $\delta$ . Moreover, $\left\|\theta_{\delta}(t)\right\|$ is uniformly bounded; indeed

Next, consider the first term which appears in $N(t,\delta)$ . From the Duhamel formulation (A.8), the triangle inequality and the fact that the initial condition $m_{0}=\nabla f(\theta_{0})\lim_{t\to 0^{+}}h(t)/r(t)$ , we obtain an upper bound of the form

We now show that each one of these terms are bounded uniformly in terms of $\delta$ .

The term $N_{1}$ is bounded because $f$ is $C^{2}$ (and therefore, the gradient is locally Lipschitz) and $\theta_{\delta}(t)$ is uniformly bounded by inequality (A.10); in particular, denote by $L$ the Lipschitz constant of $\nabla f$ in the compact set containing all solutions $\theta_{\delta}(t)$ for bounded $t$ . More precisely, by the Duhamel formula (A.8) and Lemma A.11

and we easily conclude that $N_{1}$ is uniformly bounded by assumption 8.

The term $N_{2}$ is bounded from the choice of the initial condition and the fact that $h(t)/r(t)$ is a $C^{1}$ function. More precisely

and we can easily conclude that $N_{2}$ is uniformely bounded by usual calculus and assumption 8.

The term $N_{3}$ is bounded in a similar way as $N_{1}$ using assumption 8, inequality (A.10) and Lemma A.11. More precisely:

Gathering all bounds, we easily conclude that there exists a constant $C_{1}$ such that, for every $\delta\leq 1$ :

From a similar argument, we obtain that there exists a constant $C_{2}$ such that, for every $\delta\leq 1$ :

We conclude that there exists a constant $C$ such that $N(t,\delta)<C$ for every $t\leq\delta$ and $\delta\leq 1$ .

The proof uses the same arguments as in the case of $t\leq\delta$ using the appropriate integral formulation and Lemma A.11. We omit the details here.

Appendix B Convergence of the Euler discretization

Just as in the previous section, we need to obtain estimates and bounds for the discrete system in order to compare it with its continuous counterparts. In other words, we need to re-do subsection A.3, but for the discrete system (2.3).

where we used the notation $h_{i}=h(t_{i})$ (and similarly for the other functions). Let assume that the learning rate $s$ satisfies $sr_{1}<1$ and $sq_{1}<1$ . Hence, the numerical scheme preserves the strict positivity of $v$ i.e if we assume $v_{0}>0$ , then for all $k=0,\cdots,K$ , $v_{k}>0$ .

The proof follows directly by induction and the iterative formula for $m$ and $v$ given in (2.3). Indeed, if $k=0$ , then the above formula recovers (2.3). So, suppose by induction that this formula is true for $k^{\prime}<k$ . We have that:

and the same also holds for $v_{k+1}$ , proving the assertion. ∎

Let assume that the learning rate $s$ satisfies $sr_{1}<1$ , $sq_{1}<1$ and that $p(t)\not\equiv 0$ . For all $k=0,\cdots,K-1$ , the following bound holds

From the discrete updates for $m_{k+1}$ and $v_{k+1}$ given by equation (2.3), we easily observe that the following identity holds true

Thus, dividing both side of the previous equality by $v_{j,k+1}v_{j,k}$ , we obtain for all $k\geq 0$

We consider two different cases: if $(1-sr_{k+1})^{2}-(1-sq_{k+1})\leq 0$ then,

On the other hand, if $(1-sr_{k+1})^{2}-(1-sq_{k+1})\geq 0$ , then from the update rule for $v_{k+1}$ , given by equation (2.3), and the fact that $v_{k+1}\geq(1-sq_{k+1})v_{k}$ , we get

Combining the upper bounds obtained in the two cases

We now prove that the previous upper bound is bounded by a constant depending only on the final time $T$ . For all $x\in(0,1)$ , we know that $-x\geq\log(1-x)\geq-x/(1-x)$ . Therefore

From assumption 8, the functions $q$ and $r$ are non-increasing. Then for all $t\in[t_{i},t_{i+1}]$ , we have $q_{i}\geq q(t)\geq q_{i+1}$ and similarly for $r$ . Integrating over $[t_{i},t_{i+1}]$ and summing from zero to $K-1$ gives

Similarly integrating over $[t_{i},t_{i+1}]$ and summing from one to $K-1$ gives

Finally, from the previous estimates, we obtain a moment bound on $\theta$ . This estimate is important since the bound only depends on the final time $T$ and the norm of the initial solution but not on the norm of $\theta$ itself.

Then, from Proposition B.2, if $p(t)\not\equiv 0$ we conclude that there exists a constant $C(T)>0$ , such that for all $s$

Using the previous estimates given by Propositions B.1, B.2 and B.3 and the fact that $f$ is $C^{2}$ , we obtain the following bounds for the solution of the numerical scheme (2.3).

We are ready to prove the main Theorem of this section.

B.2. Proof of Theorem 2.4

we get by definition of $E^{m}_{k}$ and from the triangle inequality

Similar reasoning holds for $v$ and there exist two constants $N_{1}$ and $N_{2}$ independent of $s$ , but depending on $T$ and the initial condition, such that

It remains to consider the discretization error for $\theta$ . We have

The discretization error will be expressed in terms of $m(u)/\sqrt{v(u)+\varepsilon}-m_{k-1}/\sqrt{v_{k-1}+\varepsilon}$ which we decomposed as follow

From Propositions A.5 and B.2, we deduce that there exists three constants $C_{1},C_{2},C_{3}$ , depending on $\varepsilon,T$ and the initial data but not on $s$ , such that

Applying this formula recursively, we obtain for all $k=0,\cdots,K$

Appendix C Proof of Theorem 3.1

The proof follows from a topological study of the vector-field associated to the ODE (2.1) together with the control given by the Lyapunov energy (2.4) studied below. We first introduce the basic notions from qualitative theory of ODE’s in $\S\S$ C.1 necessary for this work, before turning to the main proof in $\S\S$ C.3 (specialists may go directly to the proof). The $\S\S$ C.2 concerns the simple example of the gradient flow.

We start introducing some of the concepts of qualitative theory of ODE’s used in this work. Consider the Cauchy problem:

An autonomous system can be described by a vector-field. In this case, we have:

Now, let us assume that a solution $\mathbf{y}(\tau)$ of the differential equation is defined for every $\tau\in[t_{0},\infty[$ (for the case of ODE (2.1), this is proved in Theorem 5.2). In order to study the asymptotic behaviour of $\mathbf{y}(\tau)$ when $\tau\to\infty$ , we consider its topological limit, that is:

In particular, note that if $\mathbf{y}(\tau)\to p$ when $\tau\to\infty$ , then $\overline{\mathbf{y}([\tau,\infty))}=\mathbf{y}([\alpha,\infty))\cup\{p\}$ for every $\alpha>\tau_{0}$ , so that $\omega(\mathbf{y}(\tau))=\{p\}$ . Classical Poincaré-Bendixson theory provides a topological description of the set $\omega(\mathbf{y}(\tau))$ . There are three crucial Properties used in this work:

$\omega(\mathbf{y}(\tau))$ is either empty, or an union of orbits of $\partial$ ;

The $\omega$ -limit of the orbits in $\omega(\mathbf{y}(\tau))$ must be contained in $\omega(\mathbf{y}(\tau))$ , that is, $\omega(\omega(\mathbf{y}(\tau)))\subset\omega(\mathbf{y}(\tau))$ ;

If $\mathbf{y}(\tau)$ is bounded (that is, its image is bounded) then $\omega(\mathbf{y}(\tau))$ is non-empty compact and connected set.

In what follows we use these three properties in order to study the asymptotic behaviour of the orbits of ODE (2.1).

C.2. Simplified example: Study of the gradient flow

which is an autonomous differential equation. In this case, the analogue of the Lyapunov energy (2.4) is given by:

We are now ready to turn to the Poincaré-Bendixson approach of this problem. Since the system is autonomous, we can consider the vector-field associated to the gradient system, which is given by:

We now study the asymptotic behaviour of the the trajectory via its topological limit $\omega(\theta(t))$ , under the assumption that $\theta(t)$ is bounded c.f assumption 3. It follows from property C.1(c) that $\omega(\theta(t))$ is non-empty, and from property C.1(a) that it is the union of orbits. Now, since the function $E(\theta(t))$ is continuous, monotone and bounded (because $\theta(t)$ is bounded), we get that the limit

exists. It follows from property C.1(b), furthermore, that $\omega(\theta(t))\subset(E(\theta)=E_{\infty})$ , and this can only happen if the derivative of $E$ along any orbit in $\omega(\theta(t))$ are zero. By the derivative expression of the derivative of $E$ , we get that $\omega(\theta(t))\subset(\nabla f(\theta)=0)$ . We easily conclude that $f(\theta(t))$ converges to a critical value of $f$ .

C.3. Proof of Theorem 3.1

and note that the above vector-field is kept autonomous. We recall that the time of the associated differential equation is now denoted by $\tau$ (that is, a solution of this vector field is a curve $\mathbf{y}(\tau)=(\theta(\tau),m(\tau),v(\tau),s(\tau))$ such that $\dot{\mathbf{y}}(\tau)=\partial(\mathbf{y}(\tau))$ ). We now fix an orbit

with initial conditions $\mathbf{y}(\tau_{0})=(\theta(\tau_{0}),m(\tau_{0}),v(\tau_{0}),1/\tau_{0})$ . By the Lemma A.1, we know that $\mathbf{y}(\tau)$ is bounded and $v(\tau)>0$ for all $\tau\in[\tau_{0},\infty)$ . Denote by $\omega(\mathbf{y}(\tau))$ the topological limit of $\mathbf{y}(\tau)$ . By assumption 3 we know that $\theta(\tau)$ is bounded, and by Lemma A.1 we conclude that $m(\tau)$ and $v(\tau)$ are also bounded. It follows from property C.1(c) that $\omega(\mathbf{y}(\tau))$ is non-empty, and from property C.1(a) that it is the union of orbits of $\partial$ . Furthermore, from the expression $s=1/t$ (and the fact that the solutions are defined for all $\tau\in[\tau_{0},\infty)$ , which implies that $t$ takes all values in $[\tau_{0},\infty)$ ) we know that $\omega(\mathbf{y}(\tau))\subset(s=0)$ .

We now consider the energy functional $E$ given in (2.4) but in this new coordinate system. More precisely, consider the functional:

which by assumption 4 is everywhere well-defined (because $H(0)>0$ ). It follows from direct computation that:

which is everywhere non-positive by assumption 2 and 4. Now, since $E(\mathbf{x}(\tau))$ is bounded from below (because $E$ is continuous and $\mathbf{y}(\tau)$ is bounded), we conclude that the limit:

exists. In particular, it follows from property C.1(b) that $\omega(\mathbf{y}(\tau))\subset(\widetilde{E}(\mathbf{y})=\widetilde{E}_{\infty})$ . This implies that $\omega(\mathbf{y}(\tau))$ must be contained in the set of zero derivative of $\widetilde{E}(\mathbf{y})$ . By assumption 4 this implies that $\omega(\mathbf{y}(\tau))\subset(m=0)$ . Now note that:

and since $H(0)\neq 0$ (assumption 4), we conclude that $\omega(\mathbf{y}(\tau))\subset(\nabla f(\theta)=0)$ . Finally, if either $P(s)\equiv Q(s)\equiv 0$ or $Q(0)>0$ , by the expression:

we conclude that $\omega(\mathbf{x}(\tau))\subset(v=0)$ . We conclude easily.

Appendix D Proof of Theorem 3.2

The proof follows from central-stable manifold theory applied to the singular points of the vector-field (C.1). We first recall the version of the central-stable manifold Theorem in $\S\S$ D.1 necessary for this work, before proving Theorem 3.2 in $\S\S$ D.3 (specialists may go directly to the proof). The $\S\S$ D.2 concerns the simpler example of the gradient flow.

In order to explain the notion of the central-stable manifold, let us consider the following-example:

Note that the only singular point (that is, a point where the solution is constant in time) of this system is $(x,y)=(0,0)$ . Now, consider the Taylor expansion of this system at the origin:

and note that the linearisation of the system at the origin has one eigenvalue equal to $1$ (in particular, positive) and another equal to . As we will see, the existence of an eigenvalue which is positive guarantee’s the existence of an unstable manifold, which is crucial in order to prove the existence of a proper sub-manifold $\Sigma$ , the so-called central-stable manifold, which contains all orbits which converge to $(0,0)$ .

More precisely, a solution of the ODE with $t_{0}=0$ and initial condition $(x_{0},y_{0})\neq(0,0)$ is given by:

There may exist a proper sub-manifold, nevertheless, where the unstable effect of the unstable manifold does not influence that dynamics. In this example it is given by $\Sigma=(y=0)$ and we call it the central-stable manifold. Note that every solution $(x(t),y(t))$ which converges to $(0,0)$ (or, more generally, that does not diverge) are contained in $\Sigma$ (but $\Sigma$ may contain solutions which diverge). The strong regularity of the central stable manifold in this example is an accident due to the expression of the differential equation; a more intriguing example is given by the Euler equation $(x-y)\partial_{x}+y^{2}\partial_{y}$ , but we won’t enter this discussion in here.

We are ready to present the main result of this section, which formalizes the above considerations for general differential equations. The following result is a local version of [40, Ch. 1 Thm 4.2], by using the cut-off technique given in [40, Ch. 1 Lem. 3.1]; c.f. [40, Ch. 1, Thm 1.1 and 3.2].

The manifold $\Sigma$ is invariant by the differential equation everywhere over $U$ ;

The manifold contains the origin and has dimension at most $n-1$ ;

If $\mathbf{x}_{0}\in U\setminus\Sigma$ , then there exists $\widetilde{t}_{0}>t_{0}$ such that $\mathbf{x}(\widetilde{t}_{0})\notin U$ , where $\mathbf{x}(t)$ denotes the solution of the differential equation with initial condition $\mathbf{x}(t_{0})=\mathbf{x}_{0}$ .

Global versions of this result do exist, but they demand globally Lipschitz assumptions, and a control of the Lipschitz constant in terms of the linearisation $A$ of the system, see [40, Ch. 1, Thm 1.1]. We finish this section with an example to illustrate why, in general, the result is only local.

and let us consider the orbits of the differential equation whose limit set contain the singularity $(\pi,0)$ . Note the following contrast:

Local: consider a (very) small neighborhood $U$ of $(\pi,0)$ , then the only solutions which contain $(\pi,0)$ in their limit set have initial conditions in:

and all other solutions “leave” $U$ in finite time.

Global: Every solution $(x(t),y(t))$ with initial condition

converges to the set $\{-\pi\leq x\leq\pi,\,y^{2}=\sin(x)^{2}\}$ , which contains the singular point $(\pi,0)$ .

It follows that there are global convergence behaviours which are not controlled by the locally defined central-stable manifold.

D.2. Simplified example: Study of the gradient flow

Consider the vector-field associated to the gradient descent (see $\S\S$ C.2):

and note that the singular points are the critical values of $f$ , that is,

We also know from $\S\S$ C.2 that if $\theta(t)$ is a bounded solution, then $f(\theta(t))\to f_{\ast}$ converges to a critical value of $f$ because $\omega(\theta(t))\subset(\nabla f(\theta)=0)=Sing(\partial)$ . Now, let us consider:

By the assumption 5(b), the set $B$ is discrete and, therefore, a countable union of isolated points. So, let us consider the set:

It follows from Property C.1(3) that $\omega(\theta(t))$ is a connected set, so that:

We now inquire about the size (in terms of measure theory) of $S_{0}$ . In order to study this, let us start with a local analysis for a fixed $\theta_{\ast}\in B$ . Consider the linearisation of the vector-field $\partial$ at a singular point $\theta_{\ast}\in B$ , which is given by

where $\mathcal{H}_{f}$ is the Hessian of $f$ . By the assumption 5(a), the linearisation $A$ has one strictly positive eigenvalue, so we may use the central-manifold theory. More precisely, by Theorem D.1, there exists a neighbourhood $U_{\theta_{\ast}}$ of $\theta_{\ast}$ , and a proper sub-manifold $\Sigma_{\theta_{\ast}}\subset U_{\theta_{\ast}}$ such that every orbit in $U_{\ast}$ which converges to $\theta_{\ast}$ must be contained in $\Sigma_{\theta_{\ast}}$ . Note that the Lebesgue measure of $\Sigma_{\theta_{\ast}}$ is zero. Consider the set $\Sigma$ given by the union of all orbits with initial conditions in $\Sigma_{\theta_{\star}}$ , for every $\theta_{\star}\in B$ . Since $B$ is a countable set, we conclude that the Lebesgue measure of $\Sigma$ must be zero (since each $\Sigma_{\theta_{\ast}}$ has Lesbeague measure zero). Now, since $\theta_{\ast}$ is an isolated singularity of $\partial$ and the $\omega$ -limit of an arbitrary orbit $\theta(t)$ with initial condition in $S_{0}$ is connected, we conclude that if $\omega(\theta(t))=\theta_{\ast}$ , then $\omega(t)\subset\Sigma_{\theta_{\star}}$ for $t>>t_{0}$ . It easily follows that $S_{0}\subset\Sigma$ , and we conclude that $S_{0}$ has measure zero.

D.3. Proof of Theorem 3.2

Recall the vector field $\partial$ defined in (C.1), which describes the ODE (2.1). We consider the set:

is a countable union of isolated points, all of each have the form $\mathbf{y}_{\ast}=(\theta_{\star},0,0,0)$ , where $\theta_{\star}\in B$ . We now consider the set:

It follows from Property C.1(3) that $\omega(\mathbf{y}(\tau))$ is a connected set, so that:

We now make a local argument valid for each singular point in $C$ in order to show that $S$ is locally a sub-manifold; indeed, fix $\mathbf{y}_{\ast}\in S$ . Consider the linearization of $\partial$ at the singular point $\mathbf{y}_{\ast}=(\theta_{\star},0,0,0)$ , which is the $3d+1$ square matrix:

where $Id$ denotes the Identity of a $d$ -square matrix, and $\mathcal{H}_{f}(\theta_{\star})$ is the Hessian of $f$ at $\theta_{\star}$ . It follows from direct computation that the eigenvalues $\lambda$ of this matrix are: with order $1$ , $-Q(0)$ with order $d$ and the solutions of the quadratic equations:

where $\{\eta_{1},\ldots,\eta_{d}\}$ are the eigenvalues of $\mathcal{H}_{f}(\theta_{\star})$ . By assumption 5, we can suppose without loss of generality that $\eta_{1}<0$ , and we easily conclude by equation (D.1) that there exists one strictly positive eigenvalue $\lambda$ of $Jac(\partial)(\mathbf{y}_{\ast})$ . By Theorem D.1, there exists an open neighbourhood $U_{\mathbf{y}_{\ast}}$ of $\mathbf{y}_{\ast}$ and a $C^{1}$ manifold $\Sigma_{\mathbf{y}_{\ast}}\subset U_{\mathbf{y}_{\ast}}$ such that every orbit $\mathbf{y}(\tau)$ with initial condition in $U_{\mathbf{y}_{\ast}}\setminus\Sigma_{\mathbf{y}_{\ast}}$ , leaves $U_{\mathbf{y}_{\ast}}$ in finite time. Note that the Lebesgue measure of $\Sigma_{\mathbf{y}_{\ast}}$ is zero. Consider the set $\Sigma$ given by the union of all orbits with initial conditions in $\Sigma_{\mathbf{y}_{\star}}$ , for every $\mathbf{y}_{\star}\in C$ . Since $C$ is a countable set, we conclude that the Lebesgue measure of $\Sigma$ must be zero (since each $\Sigma_{\mathbf{y}_{\ast}}$ has Lesbeague measure zero). Now, since $\mathbf{y}_{\ast}$ is an isolated singularity of $\partial$ and the $\omega$ -limit of an arbitrary orbit $\mathbf{y}(\tau)$ with initial condition in $S$ is connected, we conclude that if $\omega(\mathbf{y}(\tau))=\mathbf{y}_{\ast}$ , then $\mathbf{y}(\tau)\subset\Sigma_{\mathbf{y}_{\star}}$ for $\tau>>\tau_{0}$ . It easily follows that $S\subset\Sigma$ , and we conclude that $S$ has measure zero.

Appendix E Exploring a (generalized) Lyapunov

In what follows we prove Theorem 3.4 in $\S\S$ E.1, and we extend the study in $\S\S$ E.2. In this section we do not introduce the techniques used, since they are standard in computer science.

Let $\theta_{\star}$ be a minimum point of $f$ (which exists by the convexity Assumption 6). We recall that we consider the energy functional (3.1) given by:

This functional is used as a Lyapunov function to prove convergence to a neighborhood of the global minimum. We first compute its time derivative and we find conditions on the functions $\mathcal{B}$ and $\mathcal{C}$ , as well as the coefficients $h,p,q,r$ , so that $\frac{d}{dt}\mathcal{E}$ is bounded. The conditions must also guarantee that $\mathcal{E}$ is positive. From the convexity assumption on the objective function $f$ , we get

Next, we derive each term of $\mathcal{E}_{2}$ .

By adding all of the above computations, we get that:

if all the following sufficient conditions are satisfied

It is now easy to see that equations (E.3), (E.4) and (E.5) are equivalent to the choices of $\mathcal{A}$ , $\mathcal{B}$ and $\mathcal{C}$ chosen in subsection 3.3, and that assumption 7 implies inequalities (E.6) and (E.7) are satisfied. It is now immediate from the Fundamental Theorem of calculus (and the fact that $\mathcal{E}_{2}\geq 0$ ) that:

which proves the first part of the Theorem. Next, under assumption 3, Lemma A.1 implies that there exists a finite constant:

Now, from the expression of ODE (2.1) in $v$ we get:

The second inequality now easily follows from the fact that $v(t)$ is bounded by Lemma A.1.

E.2. Proof of Corollary 4.4

We may also consider the slightly more general energy functional:

where $\mathcal{D}(t)$ is a positive function. If we assume that $\mathcal{D}(t)$ is bounded, we are able to follow the same reasoning of the previous section. In this case, we need to add the sufficient condition $\mathcal{D}(t)^{\prime}\leq 0$ , and equality (E.5) and inequality (E.7) are now given by:

while the equations for $\mathcal{A}(t)$ and $\mathcal{C}(t)$ are unchanged. Since $\mathcal{D}(t)$ has negative derivative, in general, this computation can not lead to a stronger convergence rate than the one obtained in the previous section. Nevertheless, it does allow one to obtain convergence rates for parameters which are inaccessible in the previous section. Indeed, using this more general energy functional, we prove a convergence result for Nesterov when $0<r<3$ :

Let $t_{0}=1$ and $\mathcal{D}(t)=t^{-\alpha}$ for some positive $\alpha$ which satisfies $2>\alpha>1-r$ . Then:

In other words, it is enough to consider $\alpha=2-2r/3$ for every $0<r<3$ . This implies that $f(\theta(t))\to f_{\star}$ with rate of convergence:

Introduction

Presentation of the model

2. Presentation of the continuous time model

3. An Energy functional of ODE (2.1) and a natural assumption

4. Basic Properties of ODE (2.1)

Convergence analysis

2. Avoiding local maximum and saddles

3. Rate of convergence

4. Convergence analysis of ODE (2.2)

Convergence results: application to first order algorithms

2. Adam

2.2. Adam without rescaling differential equation

2.3. Convergence of Adam

3. AdaFom

4. Heavy Ball

5. Nesterov

6. Adagrad

7. RMSProp

Considerations and guidelines for adaptive algorithms

3. Convergence properties of Adam depend on the class of loss functions

4. On future algorithms

Conclusion and final discussion

Appendix A Existence and uniqueness of solutions

A.2. Gronwall’s Lemma

A.3. A priori estimates and global solution

A.5.2. Identification of the limit and uniqueness of the solution

A.5.3. Proof of Proposition A.9

Appendix B Convergence of the Euler discretization

B.2. Proof of Theorem 2.4

Appendix C Proof of Theorem 3.1

C.2. Simplified example: Study of the gradient flow

C.3. Proof of Theorem 3.1

Appendix D Proof of Theorem 3.2

D.2. Simplified example: Study of the gradient flow

D.3. Proof of Theorem 3.2

Appendix E Exploring a (generalized) Lyapunov

E.2. Proof of Corollary 4.4

References