First-order methods almost always avoid saddle points: the case of vanishing step-sizes

Ioannis Panageas, Georgios Piliouras, Xiao Wang

Introduction

Non-convex optimization has been studied extensively the last years and has been one of the main focuses of Machine Learning community. The reason behind the interest of ML community is that in many applications of interest, one has to deal with the optimization of a non-convex landscape. One of the key obstacles of non-convex optimization is the saddle points (which can outnumber the local minima ) and avoiding them is a fundamental problem .

Recent progress has shown that under mild regularity assumptions on the objective function, first-order methods such as gradient descent can provably avoid the so-called strict saddle pointsThese are saddle points where the Hessian of the objective admits at least one direction of negative curvature. Such property has been shown to hold in a wide range of objective functions, see and references therein..

In particular, a unified theoretical framework is established in to analyze the asymptotic behavior of first-order optimization algorithms such as gradient descent, mirror descent, proximal point, coordinate descent and manifold descent. It is shown that under random initialization, the aforementioned methods avoid strict saddle points almost surely. The proof exploits a powerful theorem from the dynamical systems literature, the so-called Stable-manifold theorem (see supplementary material for a statement of this theorem). For example, given a C2C^{2} (twice continuously differentiable function) ff with LL-Lipschitz gradient, gradient descent method

avoids strict saddle points almost surely, under the assumption that the stepsize is constant and 0<α<1L0<\alpha<\frac{1}{L}. The crux of the proof in is the use Stable-manifold theorem for the time-homogeneousThis means that gg does not depend on time. dynamical system xk+1=g(xk)\mathbf{x}_{k+1}=g(\mathbf{x}_{k}). Stable-manifold theorem implies that the dynamical system gg avoids its unstable fixed points and with the fact that the unstable fixed points of the dynamical system gg coincide with the strict saddles of ff the claim follows.

In many applications/algorithms however the stepsize is adaptive or vanishing/diminishing (meaning limkαk=0,\lim_{k}\alpha_{k}=0, e.g., αk=1k\alpha_{k}=\frac{1}{k} or 1k\frac{1}{\sqrt{k}}). Such applications include stochastic gradient descent (see for analysis of SGD for convex functions), urn models and stochastic approximation , gradient descent , online learning algorithms like multiplicative weights update (which is an instantiation of Mirror Descent with entropy regularizer). It is also important to note that the choice of the stepsize is really crucial in the aforementioned applications as changing the stepsize can change the convergence properties (transition from convergence to oscillations/chaos ) and/or the rate of convergence .

The proof in does not carry over when the stepsize depends on time step, because the Stable-manifold theorem is not applicable. Yet it remains an open question whether the results of hold for vanishing step-sizes and the authors stated in as an open question. This work resolves this question in the affirmative. Our main result is stated below informally.

Gradient Descent, Mirror Descent, Proximal point and Manifold descent, with vanishing step-size αk\alpha_{k} of order Ω(1k)\Omega\left(\frac{1}{k}\right) avoid the set of strict saddle points (isolated and non-isolated) almost surely under random initialization.

The paper is organized as follows: In Section 2 we give important definitions for the rest of the paper, in Section 3 we provide intuition and technical overview of our results, in Section 4 we show a new Stable-manifold theorem that is applicable to a class of time non-homogeneous dynamical systems and finally in Section 5 we show how this new manifold theorem can be applied to Gradient descent, Mirror Descent, Proximal point and Manifold Descent.

Notation.

Preliminaries

In this section we provide all necessary definitions that will be needed for the rest of the paper.

We call a dynamical system xk+1=g(xk)x_{k+1}=g(x_{k}) as time homogeneous since gg does not on step kk. Furthermore, we call a dynamical system xk+1=g(k,xk)x_{k+1}=g(k,x_{k}) time non-homogeneous as gg depends on kk.

A point x\mathbf{x}^{*} is a critical point of ff if f(x)=0\nabla f(\mathbf{x}^{*})=0.

A critical point is a local minimum if there is a neighborhood UU around x\mathbf{x}^{*} such that f(x)f(x)f(\mathbf{x}^{*})\leq f(\mathbf{x}) for all xU\mathbf{x}\in U, and a local maximum if f(x)f(x)f(\mathbf{x}^{*})\geq f(\mathbf{x}).

A critical point is a saddle point if for all neighborhoods UU around x\mathbf{x}^{*}, there are x,yU\mathbf{x},\mathbf{y}\in U such that f(x)f(x)f(y)f(\mathbf{x})\leq f(\mathbf{x}^{*})\leq f(\mathbf{y}).

A critical point x\mathbf{x}^{*} is isolated if there is a neighborhood UU around x\mathbf{x}^{*}, and x\mathbf{x}^{*} is the only critical point in UU.

This paper will focus on saddle points that have directions of strictly negative curvature, that is the concept of strict saddle.

A critical point x\mathbf{x}^{*} of ff is a strict saddle if λmin(2f(x))<0\lambda_{\min}(\nabla^{2}f(\mathbf{x}^{*}))<0 (minimum eigenvalue of the Hessian computed at the critical point is negative).

Let X\mathcal{X}^{*} be the set of strict saddle points of function ff and we follow the Definition 2 of for the global stable set of X\mathcal{X}^{*}.

Given a dynamical system (e.g., gradient descent xk+1=xkαkf(xk)\mathbf{x}_{k+1}=\mathbf{x}_{k}-\alpha_{k}\nabla f(\mathbf{x}_{k}))

the global stable set Ws(X)W^{s}(\mathcal{X}^{*}) of X\mathcal{X}^{*} is the set of initial conditions where the sequence xk\mathbf{x}_{k} converges to a strict saddle. This is defined as:

Moreover, z\mathbf{z} is called a fixed point of the system (1) if z=g(k,z)\mathbf{z}=g(k,\mathbf{z}) for all natural numbers kk.

Intuition and Overview

In this section we will illustrate why gradient descent and related first-order methods do not converge to saddle points, even for time varying/vanishing step-sizes αk\alpha_{k} of order Ω(1k)\Omega\left(\frac{1}{k}\right). Consider the case of a quadratic, f(x)=12xTAxf(\mathbf{x})=\frac{1}{2}\mathbf{x}^{T}A\mathbf{x} where A=diag(λ1,...,λd)A=\mathop{\mathbf{diag}}(\lambda_{1},...,\lambda_{d}) is a d×dd\times d, non-singular, diagonal matrix with at least a negative eigenvalue. Let λ1,...,λj\lambda_{1},...,\lambda_{j} be the positive eigenvalues of AA (the first jj) and λj+1,...,λd\lambda_{j+1},...,\lambda_{d} be the non-positive ones. It is clear that x=0\mathbf{x}^{*}=\mathbf{0} is the unique critical point of function ff and the Hessian 2f\nabla^{2}f is AA everywhere (and hence at the critical point). Moreover, it is clear that x\mathbf{x}^{*} is a strict saddle point (not a local minimum).

Gradient descent with step-size αk\alpha_{k} (it holds that αk0\alpha_{k}\geq 0 for all kk and limkαk=0\lim_{k\to\infty}\alpha_{k}=0) has the following form:

Assuming that x0\mathbf{x}_{0} is the starting point, then it holds that xk+1=(t=0k(IαktA))x0\mathbf{x}_{k+1}=\left(\prod_{t=0}^{k}(I-\alpha_{k-t}A)\right)\mathbf{x}_{0}. We conclude that

We examine when it is true that limkxk=x\lim_{k\to\infty}\mathbf{x}_{k}=\mathbf{x}^{*}. It is clear that t=0(1λαt)=et=0ln(1λαt),\prod_{t=0}^{\infty}(1-\lambda\alpha_{t})=e^{\sum_{t=0}^{\infty}\ln(1-\lambda\alpha_{t})}, and has the same convergence properties as

For λ>0\lambda>0, the term (3) converges to zero if and only if t=0αt=+\sum_{t=0}^{\infty}\alpha_{t}=+\infty which is true if αt\alpha_{t} is Ω(1t)\Omega\left(\frac{1}{t}\right). Moreover, for λ=0\lambda=0 it holds that the term (3) remains a constant (independently of the choice of stepsize αk\alpha_{k}) and for λ<0\lambda<0 it holds that the term (3) diverges for αt\alpha_{t} to be Ω(1t)\Omega\left(\frac{1}{t}\right). Therefore, for αk\alpha_{k} being Ω(1k)\Omega\left(\frac{1}{k}\right) we conclude that limkxk=0\lim_{k\to\infty}x_{k}=\mathbf{0} whenever the initial point x0\mathbf{x}_{0} satisfies x0i=0x_{0}^{i}=0 (ii-th coordinate of x0\mathbf{x}_{0}) for λi0\lambda_{i}\geq 0.

1italic-ϵO\left(\frac{1}{k^{1+\epsilon}}\right)). In the case αk\alpha_{k} is a sequence of stepsizes that converges to zero with a rate 1k1+ϵ\frac{1}{k^{1+\epsilon}} for any ϵ>0\epsilon>0 (example 1k2,12k\frac{1}{k^{2}},\frac{1}{2^{k}} etc), then it holds that t=0αk\sum_{t=0}^{\infty}\alpha_{k} converges and hence in our example above we conclude that limkxk\lim_{k\to\infty}\mathbf{x}_{k} exists, i.e., xk\mathbf{x}_{k} converges but not necessarily to a critical point.

The challenging part of proving our main result is the lack of generic theory for time non-homogeneous dynamical systems, either for continuous or discrete time. Moreover, as far as gradient descent, mirror descent, etc are concerned, the corresponding dynamical system that needs to be analyzed is more complicated when the objective function is not quadratic and the analysis of previous subsection does not apply.

Suppose we are given a function ff that is C2C^{2}, and 0\mathbf{0} is a saddle point of ff. The Taylor expansion of the gradient descent in a neighborhood of 0\mathbf{0} is as follows:

where η(k,0)=0\eta(k,\mathbf{0})=\mathbf{0} and η(k,x)\eta(k,\mathbf{x}) is of order o(x)o(\left\|\mathbf{x}\right\|) around 0\mathbf{0} for all naturals kk.

Due to the error term η(k,xk)\eta(k,\mathbf{x}_{k}), the approach for quadratic functions does not imply the existence of the stable manifold. Inspired by the proof of Stable-manifold theorem for time homogeneous ODEs, we prove a Stable-manifold theorem for discrete time non-homogeneous dynamical system (4). In words, we prove the existence of a manifold WsW^{s} that is not of full dimension (it has the same dimension as EsE^{s}, where EsE^{s} denotes the subspace that is spanned by the eigenvectors with corresponding positive eigenvalues of matrix 2f(0)\nabla^{2}f(\mathbf{0})).

To show this, we derive the expression of (2) for the general function ff to be:

Stable Manifold Theorem for Time Non-homogeneous Dynamical Systems

We start this section by showing the main technical result of this paper. This is a new stable manifold theorem that works for time non-homogeneous dynamical systems and is used to prove our main result (Theorem 1.1) for Gradient Descent, Mirror Descent, Proximal Point and Manifold Descent. The proof of this theorem exploits the structure of the aforementioned first-order methods as dynamical systems.

Let HH be a d×dd\times d real diagonal matrix with at least one negative eigenvalue, i.e. H=diag{λ1,...,λd}H=diag\{\lambda_{1},...,\lambda_{d}\} with λ1λ2...λs>0λs+1...λd\lambda_{1}\geq\lambda_{2}\geq...\lambda_{s}>0\geq\lambda_{s+1}\geq...\geq\lambda_{d} and assume λd<0\lambda_{d}<0. Let η(k,x)\eta(k,\mathbf{x}) be a continuously differentiable function such that η(k,0)=0\eta(k,\mathbf{0})=\mathbf{0} and for each ϵ>0\epsilon>0, there exists a neighborhood of 0\mathbf{0} in which it holds

We can generalize Theorem 4.1 to the case where matrix HH is diagonalizable and for any fixed point x\mathbf{x}^{*} (instead of 0\mathbf{0}, using a shifting argument). The statement is given below.

Dxg(k,x)=IαkG,D_{\mathbf{x}}g(k,\mathbf{x}^{*})=I-\alpha_{k}G, GG real diagonalizable with at least one negative eigenvalue;

Since GG is diagonalizable, there exists invertible matrix QQ such that G=Q1HQG=Q^{-1}HQ, hence QGQ1=H,QGQ^{-1}=H, where H=diag{λ1,...,λd}H=diag\{\lambda_{1},...,\lambda_{d}\} (i.e., HH is a diagonal matrix with entries λ1,...,λd\lambda_{1},...,\lambda_{d}). Consider the map z=φ(x)=Q(xx)\mathbf{z}=\varphi(\mathbf{x})=Q(\mathbf{x}-\mathbf{x}^{*}). φ\varphi induces a new dynamical system in terms of z\mathbf{z} as follows:

Multiplying by QQ from the left on both sides, we have

and the inverse transformation is given by

Applications

In this section, we apply Theorem 4.1 (or its corollary 4.2) to the four of the most commonly used first-order methods and we prove that each one of them avoids strict saddle points even with vanishing stepsize αk\alpha_{k} of order Ω(1k)\Omega\left(\frac{1}{k}\right).

is a time non-homogeneous dynamical system.

We need to verify that the Taylor expansion of g(k,x)g(k,\mathbf{x}) at x\mathbf{x}^{*} satisfies the conditions of Corollary 4.2. Condition 1 is obvious since the Hessian 2f(x)\nabla^{2}f(\mathbf{x}^{*}) is diagonalizable and has at least one negative eigenvalue. It suffice to verify condition 2. Consider the Taylor expansion of g(k,x)g(k,\mathbf{x}) in a neighborhood UU of x\mathbf{x}^{*}:

and then the differential of θ(k,x)\theta(k,\mathbf{x}) with respect to x\mathbf{x} is

From the Fundamental Theorem of Calculus and chain rule for multivariable functions, we have

the verification completes. By Corollary 4.2 and Corollary 4.3, we conclude that the stable set of strict saddle points has Lebesgue measure zero. ∎

2 Mirror Descent

where h(x)=argmaxzMz,xΦ(z)h(\mathbf{x})=\text{argmax}_{\mathbf{z}\in M}\langle\mathbf{z},\mathbf{x}\rangle-\Phi(\mathbf{z}).

We say that Φ\Phi is a mirror map if it satisfies the following properties.

RΦ\nabla_{R}\Phi diverges on the relative boundary of MM, that is limxMRΦ(x)=\lim_{x\rightarrow\partial M}||\nabla_{R}\Phi(x)||=\infty.

The fact that g(k,x)g(k,\mathbf{x}) satisfies the condition 1 of Corollary 4.2 follows from the proof of Proposition 10, , i.e. R2Φ(x)1R2f(x)\nabla^{2}_{R}\Phi(\mathbf{x}^{*})^{-1}\nabla^{2}_{R}f(\mathbf{x}^{*}) is similar to a symmetric linear operator (so diagonalizable) with at least one negative eigenvalue. Next, we verify that g(k,x)g(k,\mathbf{x}) satisfies the condition 2 of Corollary 4.2. From the Taylor expansion, we have

By the continuity of R2f\nabla^{2}_{R}f and R2Φ(x)1\nabla^{2}_{R}\Phi(\mathbf{x})^{-1}, the same argument as the proof of Theorem 5.1 implies that the condition 2 of Corollary 4.2 is satisfied. Combing Corollary 4.2 and Corollary 4.3, we conclude that the stable set of saddle points has Lebesgue measure zero. ∎

3 Proximal Point

The proximal point algorithm is given by the iteration

Different from the other First-order methods, the results is not a direct consequence of Corollary 4.3, but instead, we need to apply part of the proof of Theorem 4.1. It is still necessary to verify that the Taylor expansion of g(k,x)g(k,\mathbf{x}) at x\mathbf{x}^{*} satisfies condition 1 and 2 of Corollary 4.2. From the proof of Proposition 3, , g(k,x)+αkf(g(k,x))=xg(k,\mathbf{x})+\alpha_{k}\nabla f(g(k,\mathbf{x}))=\mathbf{x}. By implicit differentiation, Dg(k,x)+αk2f(g(k,x))Dg(k,x)=IDg(k,\mathbf{x})+\alpha_{k}\nabla^{2}f(g(k,\mathbf{x}))Dg(k,\mathbf{x})=I, and

At saddle point x\mathbf{x}^{*}, Dg(k,x)=(I+αk2f(g(k,x)))1Dg(k,\mathbf{x}^{*})=(I+\alpha_{k}\nabla^{2}f(g(k,\mathbf{x}^{*})))^{-1} that is diagonalizable since 2f(x)\nabla^{2}f(\mathbf{x}^{*}) is diagonalizable. Suppose under the matrix QQ, Q2f(x)Q1=HQ\nabla^{2}f(\mathbf{x}^{*})Q^{-1}=H that is diagonal. Then

where λi\lambda_{i} are the eigenvalues of HH. Notice that 11+αkλi=1αkλi1+αkλi\frac{1}{1+\alpha_{k}\lambda_{i}}=1-\frac{\alpha_{k}\lambda_{i}}{1+\alpha_{k}\lambda_{i}}, the stable-unstable decomposition in the proof of Corollary 4.1 holds. Furthermore, since αkΩ(1k)\alpha_{k}\in\Omega\left(\frac{1}{k}\right), αkλi1+αkλi\frac{\alpha_{k}\lambda_{i}}{1+\alpha_{k}\lambda_{i}} is also of Ω(1k)\Omega\left(\frac{1}{k}\right). To see this, we can assume αkλi1k1=1kkk1\alpha_{k}\lambda_{i}\geq\frac{1}{k-1}=\frac{1}{k}\cdot\frac{k}{k-1}, and then k1kαkλi1k\frac{k-1}{k}\alpha_{k}\lambda_{i}\geq\frac{1}{k} or (11k)αkλi1k\left(1-\frac{1}{k}\right)\alpha_{k}\lambda_{i}\geq\frac{1}{k}, and thus αkλi1k(1+αkλi)\alpha_{k}\lambda_{i}\geq\frac{1}{k}(1+\alpha_{k}\lambda_{i}), implying that αkλi1+αkλi1k\frac{\alpha_{k}\lambda_{i}}{1+\alpha_{k}\lambda_{i}}\geq\frac{1}{k}. So the proof for Lemma C.1 and Lemma C.2 holds for the existence of stable manifold of proximal point algorithm if condition 2 of Corollary 4.2 is satisfied. To verify this, we consider the Taylor expansion of g(k,x)g(k,\mathbf{x}) at x\mathbf{x}^{*} has the form of

Since ff is of C2C^{2}, g(k,x)g(k,\mathbf{x}) and 2f(x)\nabla^{2}f(\mathbf{x}) are continuous, and then θ(k,x)θ(k,y)αkϵxy\left\|\theta(k,\mathbf{x})-\theta(k,\mathbf{y})\right\|\leq\alpha_{k}\epsilon\left\|\mathbf{x}-\mathbf{y}\right\| follows from the same argument as the proof of Theorem 5.1. So the verification completes and by Corollary 4.2 and Corollary 4.3, we conclude that the stable set of strict saddle points is of Lebesgue measure zero. ∎

4 Manifold Gradient Descent

According to the proof of Proposition 8, , for vTxM\mathbf{v}\in T_{\mathbf{x}^{*}}M,

Let xxTxM\mathbf{x}-\mathbf{x}^{*}\in T_{\mathbf{x}^{*}}M, the Taylor expansion in the tangent space can be written as

Using equation 4, , PTxMD(PTxMfˉ)(x)=R2f(x)P_{T_{\mathbf{x}}M}D(P_{T_{\mathbf{x}}M}\nabla\bar{f})(\mathbf{x})=\nabla^{2}_{R}f(\mathbf{x}), which is exactly the Riemannian Hessian, and thus it is diagonalizable. So this verifies the condition 1 of Corollary 4.2. Furthermore, the Taylor expansion gives

The continuity of 2f\nabla^{2}f implies that for each ϵ>0\epsilon>0, there exist neighborhood of x\mathbf{x}^{*}, such that Dxθ(k,x)ϵ\left\|D_{\mathbf{x}}\theta(k,\mathbf{x})\right\|\leq\epsilon. Apply the argument in the proof of Theorem 5.1 (Fundamental Theorem of Calculus and Cauchy-Schwarz inequality), we can conclude that condition 2 of Corollary 4.2 is satisfied. then combing with Corollary 4.3, we have that the stable set of strict saddle points has measure (induced by metric RR) zero. ∎

Let xX\mathbf{x}^{*}\in\mathcal{X}^{*}, then f(x)=0\nabla f(\mathbf{x}^{*})=0, and g(k,x)=xg(k,\mathbf{x}^{*})=\mathbf{x}^{*}. To show that x\mathbf{x}^{*} is unstable, consider the differential

Since at x\mathbf{x}^{*}, f(x)=0\nabla f(\mathbf{x}^{*})=0, i.e. fxj=0\frac{\partial f}{\partial x_{j}}=0 for all jj, we have

Recall that (Rij)=(Rij)1\left(R^{ij}\right)=\left(R_{ij}\right)^{-1}, and as it is shown in , by the similar transformation under (Rij)12\left(R_{ij}\right)^{\frac{1}{2}}, we have

that is a symmetric matrix, so it is diagonalizable. And thus, Dx((Rij)f(x))D_{\mathbf{x}}\left(\left(R^{ij}\right)\cdot\nabla f(\mathbf{x}^{*})\right) is diagonalizable and has the same eigenvalue with (2fxixj)\left(\frac{\partial^{2}f}{\partial x_{i}\partial x_{j}}\right), meaning that it has negative eigenvalues. So the Riemmanian Hessian at x\mathbf{x}^{*} as at least one negative eigenvalue. The Taylor expansion of g(k,x)g(k,\mathbf{x}) at x\mathbf{x}^{*} is

By the continuity of (2fxixj(x))\left(\frac{\partial^{2}f}{\partial x_{i}\partial x_{j}}(\mathbf{x})\right), the same argument as the verification of condition 2 in the proof of Theorem 5.1 implies that θ(k,x)\theta(k,\mathbf{x}) satisfies the condition 2 of Corollary 4.2. Combining with Corollary 4.3, we conclude that the stable set of strict saddle points has measure (induced by metric RR) zero. ∎

Conclusion

In this paper, we generalize the results of for the case of vanishing stepsizes. We showed that if the stepsize αk\alpha_{k} converges to zero with order Ω(1k)\Omega\left(\frac{1}{k}\right), then gradient descent, mirror descent, proximal point and manifold descent still avoid strict saddles. We believe that this is an important result that was missing from the literature since in practice vanishing or adaptive stepsizes are commonly used. Our main result boils down to the proof of a Stable-manifold theorem 4.1 that works for time non-homogeneous dynamical systems and might be of independent interest. We leave as an open question the case of Block Coordinate Descent (as it also appears in ).

References

Appendix A Statement of theorems used and for completion

Let (X,d)(X,d) be a complete metric space, then each contraction map T:XXT:X\rightarrow X has unique fixed point.

Let xx^{*} be a fixed point for the CrC^{r} local diffeomorphism g:XXg:\mathcal{X}\to\mathcal{X}. Suppose that E=EsEuE=E_{s}\oplus E_{u}, where EsE_{s} is the span of the eigenvectors corresponding to eigenvalues of magnitude less than or equal to one of Dg(x)Dg(x^{*}), and EuE_{u} is the span of the eigenvectors corresponding to eigenvalues of magnitude greater than one of Dg(x)Dg(x^{*})Jacobian of function gg.. Then there exists a CrC^{r} embedded disk WloccsW^{cs}_{loc} of dimension dim(Es)dim(E^{s}) that is tangent to EsE_{s} at xx^{*} called the local stable center manifold. Moreover, there exists a neighborhood BB of xx^{*}, such that g(Wloccs)BWloccsg(W^{cs}_{loc})\cap B\subset W^{cs}_{loc}, and k=0gk(B)Wloccs\cap_{k=0}^{\infty}g^{-k}(B)\subset W^{cs}_{loc}.

Appendix B Lyapunov-Perron Method

whose linear approximation at 0\mathbf{0} is

By a change of coordinate system, AA is assumed to be decomposed to stable-unstable blocks respectively. Consider the operator TT defined as follows:

where x0\mathbf{x}_{0} is the initial point, U(t)U(t) and V(t)V(t) are integral operators from the block decomposition of AA. The stable manifold is the fixed point of TT following from the Banach fixed point theorem.

Appendix C Proof of Theorem 4.1

Denote A(m,n)=(IαmH)...(IαnH)A\left(m,n\right)=\left(I-\alpha_{m}H\right)...\left(I-\alpha_{n}H\right) for mnm\geq n, and A(m,n)=IA\left(m,n\right)=I if m<nm<n. Then the dynamical system can be written as

Since HH is diagonal, the matrix A(m,n)A\left(m,n\right) has the form of

where BkB_{k} and CkC_{k} are diagonal as well and corresponding to stable and unstable subspaces of IαkHI-\alpha_{k}H at . Using the same notation of denoting A(m,n)A\left(m,n\right), we denote

Let vv be a vector, we denote v+v^{+} the stable component of vv and vv^{-} the unstable component of vv. Then the solution (43) can be written in terms of stable and unstable components as

If xk+10x_{k+1}\rightarrow 0 as kk\rightarrow\infty, then xk+10x_{k+1}^{-}\rightarrow 0 as kk\rightarrow\infty. So we let kk\rightarrow\infty, the following limit must holds:

and then by taking limit as kk\rightarrow\infty,

where C(m,n)1C(m,n)^{-1} denotes the inverse of C(m,n)C(m,n). So the initial condition x0x_{0}, if written as a column vector, has the form of

Written as a column vecto, the solution of the dynamical system is of the form of

Plugging the equation (44) back to the above expression, we have

Use spectrum norm \left\|\cdot\right\| for matrices, we have

Denote λ\lambda the least negative eigenvalue, then the spectrum norm of C(k+1+i,k+1)1C(k+1+i,k+1)^{-1} is

Since the sequence αi\alpha_{i} is chosen to be small, we have

Using the inequality 1+xex1+x\leq e^{x}, we have

Since by assumption, αjΩ(1jp)\alpha_{j}\in\Omega\left(\frac{1}{j^{p}}\right) and λ<0\lambda<0, so 1αjλ1-\alpha_{j}\lambda is positive and bounded, i.e. 1<1αjλ<c1<1-\alpha_{j}\lambda<c. And then the following inequalities hold:

Multiplying by the negative number λ\lambda, we have

Combining with the inequality 82, we obtain

By definition of definite integral, we notice that

which implies that kexp(t1p)dt\int_{k}^{\infty}\exp(-t^{1-p})dt converges, so does the series 89. Since the incomplete Gamma function Γ(s,x)\Gamma(s,x) has the property

let s=11ps=\frac{1}{1-p} and x=k1px=k^{1-p} so that xs1=(k1p)11p1=kpx^{s-1}=(k^{1-p})^{\frac{1}{1-p}-1}=k^{p}, we have that

as kk\rightarrow\infty. This implies that RkR_{k} is bounded as kk\rightarrow\infty. ∎

Since B(k,i+1)B(k,i+1) is diagonal, denote λ\lambda the least positive eigenvalue of HH, by definition of B(k,i+1)B(k,i+1), we have that

Consider the difference between Sk+1S_{k+1} and SkS_{k}, we have

If Sk=1λS_{k}=\frac{1}{\lambda}, then Sk=Sk+11λS_{k}=S_{k+1}\equiv\frac{1}{\lambda}.

If Sk<1λS_{k}<\frac{1}{\lambda}, or equivalently 1λSk>01-\lambda S_{k}>0, then Sk+1Sk>0S_{k+1}-S_{k}>0, and SkS_{k} increases until Sk>1λS_{k}>\frac{1}{\lambda}.

So SkS_{k} decreases or increases to 1λ\frac{1}{\lambda} (meaning that SkS_{k} is bounded), or SkS_{k} oscillates around 1λ\frac{1}{\lambda}. Suppose that Sk<1λS_{k}<\frac{1}{\lambda} and Sk+1>1λS_{k+1}>\frac{1}{\lambda}, we have that

when kk is large so that αk<1λ\alpha_{k}<\frac{1}{\lambda}. Then in conclusion, SkS_{k} is bounded, and the proof completes. ∎

Denote (Tx)k+1+(Tx)_{k+1}^{+} and (Tx)k+1(Tx)_{k+1}^{-} the stable and unstable component of (Tx)k+1(Tx)_{k+1} respectively. We prove limk(Tx)k+1+=0\lim_{k\rightarrow\infty}(Tx)_{k+1}^{+}=0 and limk(Tx)k+1=0\lim_{k\rightarrow\infty}(Tx)_{k+1}^{-}=0 separately. 1. limk(Tx)k+1+=0\lim_{k\rightarrow\infty}(Tx)_{k+1}^{+}=0: According to the definition of TT in 57,

Since B(k,0)0\left\|B(k,0)\right\|\rightarrow 0 as kk\rightarrow\infty, it is enough to show that

as kk\rightarrow\infty. From the Lipschitz condition on η\eta, we have that

From the proof of Lemma C.2, we know that SkS_{k} is bounded, and thus Sk+1Sk0\left|S_{k+1}-S_{k}\right|\rightarrow 0 as kk\rightarrow\infty. Similar to proof of C.2, we have the following observation:

If Sk+1Sk>0S_{k+1}-S_{k}>0, then xk+1λSk>0\left|x_{k+1}\right|-\lambda S_{k}>0, or Sk<xkλS_{k}<\frac{\left|x_{k}\right|}{\lambda};

If Sk+1Sk<0S_{k+1}-S_{k}<0, then xk+1λSk<0\left|x_{k+1}\right|-\lambda S_{k}<0, or Sk>xkλS_{k}>\frac{\left|x_{k}\right|}{\lambda};

If Sk+1Sk=0S_{k+1}-S_{k}=0, then Sk=constantS_{k}=\text{constant}.

decreasing but Sk>xkλS_{k}>\frac{\left\|x_{k}\right\|}{\lambda},

oscillating around xkλ\frac{\left\|x_{k}\right\|}{\lambda}.

If SkS_{k} is of case 1, then limkSk\lim_{k}S_{k} exists. Suppose that this limit is positive, but since we have αk=\sum\alpha_{k}=\infty and

implying that SkS_{k}\rightarrow\infty. So we conclude that limkSk=0\lim_{k}S_{k}=0, contradicting to the fact that limkSk\lim_{k}S_{k} exists. So the limkSk\lim_{k}S_{k} must be if SkS_{k} is of case 1. If SkS_{k} is of case 2, then immediately lim infSk=0\liminf S_{k}=0. Suppose that lim supSk>0\limsup S_{k}>0. Since SkS_{k} decreases whenever Sk>xkλS_{k}>\frac{\left\|x_{k}\right\|}{\lambda} and SkS_{k} increases whenever Sk<xkλS_{k}<\frac{\left\|x_{k}\right\|}{\lambda}, we can find a subsequence SkmS_{k_{m}}, with Skm1<SkmS_{k_{m}-1}<S_{k_{m}}, converging to lim supSk\limsup S_{k} as mm\rightarrow\infty. But this is impossible since Skm1<xkλS_{k_{m}-1}<\frac{\left\|x_{k}\right\|}{\lambda} and then Skm10S_{k_{m}-1}\rightarrow 0 as mm\rightarrow\infty, which means limmSkm1Skm\lim_{m}\left|S_{k_{m}-1}-S_{k_{m}}\right| is positive, contradicting to the fact that limkSkSk+1=0\lim_{k}\left|S_{k}-S_{k+1}\right|=0. And thus, we have lim supkSk=0\limsup_{k}S_{k}=0, meaning that limkSk=0\lim_{k}S_{k}=0. So we conclude that either in case 1 or 2, the limit limkSk=0\lim_{k}S_{k}=0, which completes the proof of part 1. 2. limk(Tx)k+1=0\lim_{k\rightarrow\infty}(Tx)_{k+1}^{-}=0: According to the equation 57,

And from the Lipschitz condition of on η\eta, we have that

Since {xn}\{x_{n}\} converges to 0 as nn\rightarrow\infty, supn>kxn0\sup_{n>k}\left|x_{n}\right|\rightarrow 0 as kk\rightarrow\infty. And this completes the proof of part 2. ∎

There exists a real number δ>0\delta>0 such that the operator TT given by equation 57

From Lemma C.1 and Lemma C.2, we know that in equation (78),

Since B(k,0)B(k,0) is on the stable subspace and whose norm is calculated by

Then we can choose small positive ϵ\epsilon so that

and by the choice of ϵ\epsilon, we know that K<1K<1. Let δ>0\delta>0 be the radius corresponding to ϵ\epsilon so that the Lipschitz condition is satisfied. Combining 78, 103 and 104, we have that

Since above kk is taken arbitrarily, we conclude that

If we consider the sequence xx as a sequence of functions with the initial condition as the variable, the general term xnx_{n} is written as xn(x0)x_{n}(x_{0}), then the equation (106) is written as

This means that if the some initial condition x0x_{0} goes to 0 through the discrete time process {xn(x0)}\{x_{n}(x_{0})\}, its stable and unstable component must satisfy following relation:

and the equation x0=Φ(x0+,x0)x_{0}^{-}=\Phi(x_{0}^{+},x_{0}^{-}) defines an implicit function x0=φ(x0+)x_{0}^{-}=\varphi(x_{0}^{+}) by the uniqueness of Banach fixed point. Next we will show that φ\varphi is differentiable with respect to x0+x_{0}^{+}. Since it is enough to show the function Φ(x0+,x0)\Phi(x_{0}^{+},x_{0}^{-}) is differentiable with respect to x0+x_{0}^{+}. And it is enough to show that each xn(x0)x_{n}(x_{0}), if considered as a function of initial condition x0x_{0}, is differentiable with respect to x0+x_{0}^{+}.

The solution xn(x0+,x0)x_{n}(x_{0}^{+},x_{0}^{-}) is of C1C^{1} with respect to x0+x_{0}^{+} provided η(n,x)\eta(n,x) is of C1C^{1}.

Proof. It is equivalent to show that xnx0,j\frac{\partial x_{n}}{\partial x_{0,j}}, j=1,..,dj=1,..,d, where dd is the dimension of stable vector space, exist and are continuous for small x0\left|x_{0}\right|. Let P+P^{+} and PP^{-} be the projection operators to the stable and unstable subspaces respectively, then the solution (with initial condition x0x_{0}) of the dynamical system can be written as

Let hh be a scalar and eje_{j} be the jjth standard basis. Denote

Plugging above identity to 107, we can compute the difference quotient q(k+1,x0,h)=xk+1(x0+hej)xk+1(x0)hq(k+1,x_{0},h)=\frac{x_{k+1}(x_{0}+he_{j})-x_{k+1}(x_{0})}{h}

Since for the solution x(n,x0)x(n,x_{0}), x(n,x0)0x(n,x_{0})\rightarrow 0 as nn\rightarrow\infty, and η(n,x)η(n,xˉ)ϵxxˉ\left\|\eta(n,x)-\eta(n,\bar{x})\right\|\leq\epsilon\left\|x-\bar{x}\right\|, we have that

Given δ>0\delta>0, h\left|h\right| can be chosen small that by the mean value theorem and the continuity of DηD\eta, we have

where xnx_{n}^{\prime} is a point on the line segment joining xn(x0+hej)x_{n}(x_{0}+he_{j}) and xn(x0)x_{n}(x_{0}). Since

so q(n,x0,h)\left\|q(n,x_{0},h)\right\| is bounded, denoted as

And then ΔnδM\left\|\Delta_{n}\right\|\leq\delta M. Define the operator as

Notice that the part of infinite sum converges to 0 as kk\rightarrow\infty, one can choose kk large enough so that the norm of the infinite sum to be small, and then we have for any small ϵ>0\epsilon^{\prime}>0, the supqψ\sup\left\|q-\psi\right\| satisfies

Where KK^{\prime\prime} is the bound from that Δi0|\Delta_{i}|\rightarrow 0 as ii\rightarrow\infty. One choose neighborhood small enough so that Kϵd<12K^{\prime}\epsilon d<\frac{1}{2} and then we have

Since η(k,xk)0\left\|\eta^{-}(k,x_{k})\right\|\rightarrow 0 as kk\rightarrow\infty due to αk\alpha_{k}, and Ck+1xk\left\|C_{k+1}x_{k}^{-}\right\|\rightarrow\infty as kk\rightarrow\infty by assumption that xkx_{k}^{-} is bounded away from . But this contradicts to the assumption xk+1x_{k+1}^{-} is bounded in UU. The proof completes. ∎