The Power of Normalization: Faster Evasion of Saddle Points

Kfir Y. Levy

Introduction

Owing to its efficiency, simplicity and intuitive interpretation, Gradient Descent (GD) and its numerous variants are often the method of choice in large-scale optimization tasks, including neural network optimization Bengio (2009), ranking Burges et al. (2005), matrix completion Jain et al. (2010), and reinforcement learning Sutton et al. (1999). Normalized Gradient Descent (NGD) is a less popular alternation of GD, which enjoys the same efficiency and simplicity.

Exploring the limitations of applying GD to non-convex tasks is an important an active line of research. Several phenomena have been found to prevent GD from attaining satisfactory results. Among them are local-minima, and saddle points. Local minima might entrap GD, preventing it from reaching a possibly better solution. Gradients vanish near saddle points, which causes GD to stall. This saddle phenomenon was recently studied in the context of deep learning, where it was established both empirically and theoretically that saddles are prevalent in such tasks, and cause GD to decelerate Saad and Solla (1995); Saxe et al. (2015); Dauphin et al. (2014); Choromanska et al. (2015). Several empirical studies suggest that arriving at local minima is satisfactory for deep learning tasks: in Dauphin et al. (2014), it is asserted that all local minima are of approximately the same quality as the global one. Additionally, in Choromanska et al. (2015) it is argued that global minima may cause overfitting, while local minima are likely to yields sufficient generalization. Very recently, evading saddles was found to be crucial in other non-convex applications, among them are complete dictionary recovery Sun et al. (2015), phase retrieval Sun et al. (2016), and Matrix completion Ge et al. (2016).

Our experimental study demonstrates the benefits of using Saddle-NGD for the setting of online tensor decomposition. The tensor decomposition optimization problem contains both local minima and saddle points with the interesting property that any local minimum is also a global minimum.

Non-convex optimization problems are in general NP-hard and thus most literature on iterative optimization methods focuses on local guarantees. The most natural guarantee to expect GD in this context is to approach a stationary point, namely a point where gradients vanish. This kind of guarantees is provided in Allen-Zhu and Hazan (2016); Ghadimi and Lan (2013), which focus on the stochastic setting. Approaching local minima while evading saddle points is much more intricate than ensuring stationary points. In Dauphin et al. (2014); Sun et al. (2016), it is demonstrated how to avoid saddle points through a trust region approach which utilizes second order derivatives. A perturbed version of GD was explored in Ge et al. (2015), showing both theoretically and empirically that it efficiently evades saddles. Very recently, it was shown in Agarwal et al. (2016) how to use second order methods in order to rapidly approach local minima. Nevertheless, this approach is only practical in settings where Hessian-vector products can be done efficiently.

NGD is well known to guarantee convergence in the offline settings of convex and quasi-convex optimization Nesterov (1984); Kiwiel (2001) . Recently, a stochastic version of NGD was explored in the context of stochastic quasi-convex optimization, Hazan et al. (2015).

Setting and Assumptions

We focus on the optimization of strict-saddle functions, a family of multi-modal functions which encompasses objectives that acquire saddle points. Interestingly, Ge et al. (2015) have found that tensor decomposition problems acquire this strict-saddle property.

λmin(2f(x))γ\lambda_{min}(\nabla^{2}f(x))\leq-\gamma

There exists a local minimum xx^{*} such that xxr\|x-x^{*}\|\leq r, and the function f(x)f(x) restricted to a 2r2r neighbourhood of xx^{*} is α\alpha-strongly-convex.

We also assume that ff is β\beta-smooth, meaning:

Lastly, we assume that ff has ρ\rho-Lipschitz Hessians:

Finally, we recall the definition of strong-convexity:

Saddle-NGD algorithm and Main Results

In Algorithm 1 we present Saddle-NGD, an algorithm that is adapted to handle strict-saddle functions. The main difference from GD is the use of the direction of the gradients rather than the gradients themselves. Note that most updates are noiseless, yet once every N0N_{0} rounds we add zero mean gaussian noise θnt,where ntN(0,Id)\theta n_{t},\text{where }n_{t}\sim\mathcal{N}(0,I_{d}), here IdI_{d} is the dd-dimensional identity matrix. The noisy updates ensure that once we arrive near a saddle, the direction with most negative eigenvalue will be sufficiently large. This in turn implies a fast evasion of the saddle.

Following is the main theorem of this paper:

In fact, since strict-saddle functions are strongly-convex in a 2r2r radius around local minima, our ultimate goal should be arriving rr-close to such minima. Then we could use the well known convex optimization machinery to rapidly converge. The following corollary formalizes the number of rounds required to reach rr-close to some local minimum:

In the Stochastic Optimization setting, an exact gradient for the objective function is unavailable, and we may only access noisy (unbiased) gradient estimates. This setting captures some of the most important tasks in machine learning, and is therefore the subject of an extensive study.

Our Saddle-NGD algorithm for offline optimization of strict-saddle functions could be extended to the stochastic setting. This could be done by using minibatches in order to calculate the gradient, i.e. instead of gt=f(xt)g_{t}=\nabla f(x_{t}) appearing in Algorithm 1, we use:

here {G(xt,ζi)}i\{\mathcal{G}(x_{t},\zeta_{i})\}_{i} are independent and unbiased estimates of f(xt)\nabla f(x_{t}). Except for this alternation, the stochastic version of Saddle-NGD is similar to the one presented in Algorithm 1. We have found that in order to ensure convergence in the stochastic case, a minibatch b=poly(1/η)b=\text{poly}(1/\eta) is required. This dependence implies that in the stochastic setting Saddle-NGD obtains no better guarantees than noisy-GD. We therefore omit the proof for the stochastic setting.

Nevertheless, we have found that in practice, the stochastic version of Saddle-NGD demonstrates a superior performance compared to noisy-GD. Even when we employ a moderate minibatch size. This is illustrated in Section 6 where Saddle-NGD is applied to the task of online tensor decomposition.

Analysis Overview

Our analysis of Algorithm 1 divides according to the three scenarios defined by the strict-saddle property. Here we present the main statements regarding the guarantees of the algorithm for each scenario, and provide a proof sketch of Theorem 3. For ease of notation we assume that we reach at each scenario at t=0t=0.

In case that the gradientNote the use of the notation gt=f(xt)g_{t}=\nabla f(x_{t}) here and in the rest of the paper. is sufficiently large, the following ensures us to improve by value within one step:

Suppose that g0βη\|g_{0}\|\geq\beta\sqrt{\eta}, and we use the Saddle-NGD Algorithm, then:

The next lemma ensures that once we arrive close enough to a local minimum we will remain in its proximity.

The next statement describes the improvement attained by Saddle-NGD near saddle points.

Theorem 3 is based on the above three lemmas. The full proof of the Theorem appears in Appendix A. Next we provide a short sketch of the proof.

Analysis

Here we prove the main statements regrading the three scenarios defined by the strict-saddle property. Section 5.1 analyses the scenario of large gradients (Lemma 5), Section 5.2 analyses the local-minimum scenario (Lemma 6), and Section 5.3 analyses the case of saddle points (Lemma 7). For brevity we do not always provide full proofs, which are deferred to appendix.

We will prove assuming a noisy update, i.e. x1=x0ηg^0+θn0x_{1}=x_{0}-\eta\hat{g}_{0}+\theta n_{0} and n0N(0,Id)n_{0}\sim\mathcal{N}(0,I_{d}) (the noiseless update case is similar). By the update rule:

2 Local Minimum

For brevity we will not prove Lemma 6, but rather state and prove a simpler lemma assuming all updates are noiseless; the proof of Lemma 6 regarding the general case appears in Appendix B.

Suppose that x0x_{0} is close to a local minimum xx^{*}, i.e., x0xr\|x_{0}-x^{*}\|\leq r, and g0βην\|g_{0}\|\leq\beta\sqrt{\eta}\leq\nu. Then the following holds for any t0t\geq 0:

Due to the local strong-strong convexity of ff around xx^{*}, we know that x0x1αg0βαη\|x_{0}-x^{*}\|\leq\frac{1}{\alpha}\|g_{0}\|\leq\frac{\beta}{\alpha}\sqrt{\eta}. In order to be consistent with the definition of strict-saddle property we choose η\eta such that βαηr\frac{\beta}{\alpha}\sqrt{\eta}\leq r.

The proof requires the following lemma regarding strongly-convex functions:

We are now ready to prove Lemma 8 by induction, assuming all updates are noiseless, i.e. xt+1=xtηg^tx_{t+1}=x_{t}-\eta\hat{g}_{t}. Note that the case t=0t=0 naturally holds, next we discuss the case where t1t\geq 1. First assume that xtxβαη\|x_{t}-x^{*}\|\geq\frac{\beta}{\alpha}\eta, the noiseless Saddle-NGD update rule implies:

here the first inequality uses Lemma 9, the second inequality uses gtβxtx\|g_{t}\|\leq\beta\|x_{t}-x^{*}\| which follows from smoothness, and the third inequality uses xtxβαη\|x_{t}-x^{*}\|\geq\frac{\beta}{\alpha}\eta.

For the case where xtxβαη\|x_{t}-x^{*}\|\leq\frac{\beta}{\alpha}\eta similarly to the above we can conclude that:

we use g^t(xtx)0\hat{g}_{t}^{\top}(x_{t}-x^{*})\geq 0, which follows by the local strong-convexity around xx^{*}. ∎

3 Saddle Points

We first provide some intuition regarding the benefits of using NGD rather than GD in escaping saddles. Later we present a short proof sketch of Lemma 7.

Lemma 7 states the decrease in function values attained by the Saddle-NGD near saddle points. Intuitively, Saddle-NGD implicitly performs an approximate power method over the Hessian matrix H0:=2f(x0)H_{0}:=\nabla^{2}f(x_{0}). Since the gradients near saddle points tend to be small the use of normalized gradients rather than the gradients themselves yields a faster improvement. Consider the minimization of a pure saddle function: F(x1,x2)=x12x22F(x_{1},x_{2})=x_{1}^{2}-x_{2}^{2}. As can be seen in Figure 1, the gradients are almost vanishing around the saddle point (0,0)(0,0). Conversely, the normalized gradients (Figure 1) are of a constant magnitude. This intuitively suggets that using NGD instead of (noisy) GD yields a faster escape of the saddle. Figure 1 compares between NGD and (noisy) GD for the pure saddle function; both methods are initialized in the proximity of (0,0)(0,0), and employ a learning rate of η=0.01\eta=0.01. As expected, NGD attains a much faster initial improvement. Later, when the gradients are sufficiently large, GD prevails. Since our goal is the optimization of a general family of functions where saddles behave like pure saddles only locally, we are mostly concerned about the initial local improvement in the case of a pure saddles. This renders NGD more appropriate than GD for our goal.

Let x0x_{0} be a point such that f(x0)βη\|\nabla f(x_{0})\|\leq\beta\sqrt{\eta}, and also let λmin(2f(x0))γ\lambda_{\min}(\nabla^{2}f(x_{0}))\leq-\gamma. Letting H0=2f(x0)H_{0}=\nabla^{2}f(x_{0}), it can be shown that the Saddle-NGD update rule implies:

The noise injections utilized by Saddle-NGD ensure that the additive term, μt\mu_{t}, does not interfere with the increase of the components related to the negative eigenvectors. Since μt\mu_{t} is bounded, a careful choice of the noise magnitude ensures that there is a sufficiently large initial component in these directions. ∎

Analysis

Before proving Lemma 7 we introduce some notation and establish several lemmas regarding the dynamics of the Saddle-NGD update rule near a saddle point.

Let {ei}i=1d\{e_{i}\}_{i=1}^{d} be an orthonormal eigen-basis of H0H_{0} with eigenvalues {λi}i=1d\{\lambda_{i}\}_{i=1}^{d}:

Also assume without loss of generality that e1e_{1} is the direction with the most negative eigenvector, i.e. λ1=γ\lambda_{1}=-\gamma. We represent each point in the eigen-basis of H0H_{0}, i.e.,

Denoting gt(i)=eigt,  nt(i)=eintg_{t}^{(i)}=e_{i}^{\top}g_{t},\;n_{t}^{(i)}=e_{i}^{\top}n_{t}; the NGD update rule: xt+1=xtηgtgt+θntx_{t+1}=x_{t}-\eta\frac{g_{t}}{\|g_{t}\|}+\theta n_{t}, translates coordinate-wise as follows:

The above enables to relate the gradient of the original function to the gradient of the approximation:

the above uses the ρ\rho-Lipschitzness of the Hessian.

Next we show that whenever gt2βη\|g_{t}\|\geq 2\beta\sqrt{\eta} then we have improved by value.

Suppose we are in a saddle, and g0βη\|g_{0}\|\leq\beta\sqrt{\eta} then for the first tt such that gt2βη\|g_{t}\|\geq 2\beta\sqrt{\eta}, the following holds w.p.1ξ\geq 1-\xi

The Lemma follows directly by Lemmas 14, 15. ∎

In Appendix C we provide the complete proofs for the statements that appear in this section.

Experiments

In many challenging machine learning tasks, first and second order moments are not sufficient in order to extract the underlying parameters of the problem; and higher order moments are required. Such problems include Gaussian Mixture Models (GMMs), Independent Component Analysis (ICA), Latent Dirichlet Allocation (LDA), and more. Tensors of order greater than 22 may capture such high order statistics, and their decomposition enables to reveal the underlying parameters of the problem (see Anandkumar et al. (2014) with references therein). Thus, tensor decomposition methods have been extensively studied over the years Harshman (1970); Kolda (2001); Anandkumar et al. (2014). While most studies of tensor decomposition methods have focused on the offline setting, Ge et al. (2015) recently proposed a new setting of online tensor decomposition which is more appropriate for big data tasks.

Tensor decomposition is an intriguing multi-modal optimization task, which provably acquires many saddle points. Interestingly, every local minimum is also a global minimum for this task. Thus we decided to focus our experimental study on this task.

In what follows, we briefly review tensors and the online decompositions task. Then we present our experimental results comparing our method to noisy GD, which was proposed in Ge et al. (2015).

and we call {ai}i=1d\{a_{i}\}_{i=1}^{d} the components of TT. Given TT that has an orthogonal decomposition, the offline decomposition problem is to find its components (which are unique up to permutations and sign flips).

Ge et al. (2015) suggest to calculate the components of a decomposable tensor TT by minimizing the following strict-saddle objective:

In many machine learning tasks, data often arrives sequentially (sampled from some unknown distribution). When the dimensionality of the problem is large, it is desirable to store only small batches of the data. In many such tasks where tensor decomposition is required, data samples {zi}i\{z_{i}\}_{i} arrive from some unknown distribution D\mathcal{D}. And we aim at decomposing a tensor TT, which is often an expectation over mulitilinear operators q(z)q(z), i.e. T=EzD[q(z)]T=\mathbf{E}_{z\sim\mathcal{D}}[q(z)]. Using the linearity of multilinear maps, and the objective appearing in Equation (5), we can formulate such problems as follows:

where ϕz(u) = ijq(z)(ui,ui,uj,uj)\phi_{z}(u)~{}=~{}\sum_{i\neq j}q(z)(u_{i},u_{i},u_{j},u_{j}).

We adopt the online tensor decomposition setting for ICA suggested above, and present experimental results comparing our method to the noisy GD method of Ge et al. (2015). We take d=10d=10, and our performance measure is the reconstruction error defined as:

In our experiments, both methods use a minibatch of size 500500 to calculate gradient estimates. Moreover, both methods employ the following learning rate rule:

In Figure 2 we present our resultsNote that we have repeated each experiment 1010 times. Figure 2 presents the reconstruction error averaged over these 10 runs, as well as error bars, for three values of initial learning rates η0{0.01,0.05,0.1}\eta_{0}\in\{0.01,0.05,0.1\}. As can be seen in all three cases, noisy-GD obtains a faster initial improvement. Yet, at some point the error obtained by Saddle-NGD decreases sharply, and eventually Saddle-NGD outperforms noisy-GD. We found that this behaviour persisted when we employed learning rate rules other than (7).This behaviour also persisted when have employed several noise injection magnitudes to both algorithms. Note that for η0{0.05,0.1}\eta_{0}\in\{0.05,0.1\} Saddle-NGD outperforms noisy-GD within 400\approx 400 iterations, for η=0.01\eta=0.01 this only occurs after 2104\approx 2\cdot 10^{4} iterations.

Discussion

We have demonstrated both empirically and theoretically the benefits of using normalized gradients rather than gradients for an intriguing family of non-convex objectives. It is natural to ask what are the limits of the achievable rates for evading saddles. Concretely, we ask wether we can do better than Saddle-NGD using only first order information.

Acknowledgement

I would like to thank Elad Hazan for many useful discussions during the early stages of this work.

References

Appendix A Proof of Theorem 3

Let us define the following sets with respects to the three scenarios of the strict saddle property:

Define the following sequence of stopping times with τ0=0\tau_{0}=0, and

The second inequality holds by the following Corollary of Lemma 7

Now define a series of events {Ei}i\{E_{i}\}_{i} as follows Ei={ji:xτjA3}E_{i}=\{\exists j\leq i:x_{\tau_{j}}\in\mathcal{A}_{3}\}. Note that Ei1EiE_{i-1}\subset E_{i} and therefore P(Ei1)P(Ei)P(E_{i-1})\leq P(E_{i}). Now consider f(wτi+1)1Eif(w_{\tau_{i+1}})1_{E_{i}}

Summing the above equation over ii we conclude that:

Appendix B Proof of Lemma 6 (Local Minimum)

Here we prove Lemma 6 regarding the local minimum for the general case which includes the noisy updates. In Section B.1 we prove Lemma 9 which is used during the proof of Lemmas 6,8.

Let us first bound the increase in the square distance to xx^{*} due to the first noisy update.

Let t2t\geq 2, and suppose xt1xβαη\|x_{t-1}-x^{*}\|\geq\frac{\beta}{\alpha}\sqrt{\eta}, the Saddle-NGD update rule implies:

here in the first inequality we use Lemma 9, the second inequality uses gtβxtx\|g_{t}\|\leq\beta\|x_{t}-x^{*}\| which follows from smoothness, and the last inequality uses xt1xβαη\|x_{t-1}-x^{*}\|\geq\frac{\beta}{\alpha}\sqrt{\eta}.

Combining Equations (B),(B) it follows that within the total of Θ(η1/2)\Theta(\eta^{-1/2}) rounds of noiseless updates the distance to the local minimum must decrease to βαη\frac{\beta}{\alpha}\sqrt{\eta}. The rest of the proof goes along the same lines as the proof of Lemma 8. ∎

Writing the strong-convexity inequality for FF at both x,xx,x^{*}, and using F(x)(xx)0\nabla F(x^{*})^{\top}(x-x^{*})\geq 0, which follows by the optimality of xx^{*}, we have:

summing the above equations the lemma follows. ∎

Appendix C Omitted Proofs from Section 5.3 (Saddle Analysis)

Denote Δt(i)=eiΔt\Delta_{t}^{(i)}=e_{i}^{\top}\Delta_{t}, then coordinate-wise the NGD rule translates to

First part: First we show that the following always applies:

We will prove Equation (18) by showing the following to hold for all t[T]t\in[T]:

We will now prove Equation (19) by induction. The base case t=0t=0 clearly holds. Now assume by the induction assumption that it holds for t[T]t\in[T], then we divide into two cases:

Case 1: Suppose that T1+ρηT22λiT\leq 1+\frac{\rho\eta T^{2}}{2\lambda_{i}}. Since αt(i)\alpha_{t}^{(i)} can not change by more than η\eta in each round then the following holds:

thus the induction hypothesis of Equation (19) holds in this case.

Case 2: Suppose that T1+ρηT22λiT\geq 1+\frac{\rho\eta T^{2}}{2\lambda_{i}}. In this case our induction hypothesis asserts:

If αt(i)max{η,α0(i)}+ρη2T22λi|\alpha_{t}^{(i)}|\geq\max\{\eta,|\alpha_{0}^{(i)}|\}+\frac{\rho\eta^{2}T^{2}}{2\lambda_{i}} then using Equation (17), we conclude that:

thus by Equation (C.1), the induction hypothesis holds.

If αt(i)max{η,α0(i)}+ρη2T22λi|\alpha_{t}^{(i)}|\leq\max\{\eta,|\alpha_{0}^{(i)}|\}+\frac{\rho\eta^{2}T^{2}}{2\lambda_{i}}, then since each coordinate does not change more than η\eta in each iterartion then we have:

and again, the induction hypothesis holds.

Second part: Here we show that whenever λiα0(i)(ρ/2)η2T2\lambda_{i}|\alpha_{0}^{(i)}|\geq(\rho/2)\eta^{2}T^{2} then the following applies:

We will now show by induction that Equation (21) holds for any t[T]t\in[T]. The base case for t=0t=0 clearly holds. Now assume by the Induction Hypothesis that αt(i)max{η,α0(i)}+η|\alpha_{t}^{(i)}|\leq\max\{\eta,|\alpha_{0}^{(i)}|\}+\eta. We now divide into two cases:

Case 1: Suppose that αt(i)max{η,α0(i)}|\alpha_{t}^{(i)}|\leq\max\{\eta,|\alpha_{0}^{(i)}|\}. Since each coordinate does not change by more than η\eta in each round, then:

Case 2: Suppose that αt(i)max{η,α0(i)}|\alpha_{t}^{(i)}|\geq\max\{\eta,|\alpha_{0}^{(i)}|\}. Recall that t[T]t\in[T], and Δt(i)(ρ/2)η2T2|\Delta_{t}^{(i)}|\leq(\rho/2)\eta^{2}T^{2}, thus:

In this case, a similar analysis to the one appearing in Equation (C.1), shows:

Combining Equations (18),(21), establishes the Lemma. ∎

C.2 Proof of Lemma 11

Denote Δt(i)=eiΔt\Delta_{t}^{(i)}=e_{i}^{\top}\Delta_{t}, then coordinate-wise the NGD rule translates to

First part: First we show that the following always applies for some cc\in:

Equation (23) follows since by the Saddle-NGD update rule the magnitude of the query points can not decrease by more than ηT\eta T over TT rounds.

Second part: Here we show that the following applies whenever λiα0(i)>(ρ/2)η2T2\lambda_{i}|\alpha_{0}^{(i)}|>(\rho/2)\eta^{2}T^{2}:

We will prove Equation (24), by induction. Clearly the base case t=0t=0 holds. Now assume that αt(i)α0(i)>(ρ/2λi)η2T2|\alpha_{t}^{(i)}|\geq|\alpha_{0}^{(i)}|>(\rho/2\lambda_{i})\eta^{2}T^{2}. Since gt(i)=λiαt(i)+Δt(i)g_{t}^{(i)}=\lambda_{i}\alpha_{t}^{(i)}+\Delta_{t}^{(i)}, and Δt(i)(ρ/2)η2T2|\Delta_{t}^{(i)}|\leq(\rho/2)\eta^{2}T^{2}, then we necessarily have:

where we used λi0\lambda_{i}\leq 0. Hence,

and the induction hypothesis holds. Combining Equations (23), (24), proves the lemma. ∎

C.3 Proof of Corollary 12

where the first inequality uses Lemma 10, and the last inequality uses λiα0(i)=g0(i)βη\lambda_{i}|\alpha_{0}^{(i)}|=|g_{0}^{(i)}|\leq\beta\sqrt{\eta}, and β=maxiλi\beta=\max_{i}|\lambda_{i}|.

Low g0(i)|g_{0}^{(i)}|: Suppose that g0(i):=λiα0(i)(ρ/2)η2T2|g_{0}^{(i)}|:=\lambda_{i}|\alpha_{0}^{(i)}|\leq(\rho/2)\eta^{2}T^{2}. Denote B=min{1+ρηT22λi,T}B=\min\{1+\frac{\rho\eta T^{2}}{2\lambda_{i}},T\}, and note that this expression is maximized when λi=ρηT22(T1)\lambda_{i}=\frac{\rho\eta T^{2}}{2(T-1)}.

Combining Equations (C.3),(C.3), establishes the lemma. ∎

C.4 Proof of Corollary 13

Using Equation (27) we divide into two cases: Case 1: Suppose that λiα0(i)>(ρ/2)η2T2\lambda_{i}|\alpha_{0}^{(i)}|>(\rho/2)\eta^{2}T^{2}, then by Equation (27) we have αt(i)α0(i)|\alpha_{t}^{(i)}|\geq|\alpha_{0}^{(i)}| and therefore

Case 2: Suppose that λiα0(i)(ρ/2)η2T2\lambda_{i}|\alpha_{0}^{(i)}|\leq(\rho/2)\eta^{2}T^{2}, using Equation (27) we get:

C.5 Proof of Lemma 14

Following is the key relation that enables us to prove the lemma:

where we denote μt=01[H(xt1+s(xtxt1))H0]ds(xtxt1)\mu_{t}=\int_{0}^{1}[H(x_{t-1}+s(x_{t}-x_{t-1}))-H_{0}]ds(x_{t}-x_{t-1}).

Using the Saddle-NGD update rule inside Equation (C.5) we obtain:

Let us look at the gradient component in the most negative direction e1e_{1}:

Next we will show that taking g1(1),N0|g_{1}^{(1)}|,N_{0} that fulfill the following two conditions imply that the magnitude of the gradient rises beyond 2βη2\beta\sqrt{\eta} within less than N0N_{0} rounds:

Assume by contradiction that the gradient does not rise beyond 2βη2\beta\sqrt{\eta} for g1(1),N0|g_{1}^{(1)}|,N_{0} that fulfill the above conditions. We will now show by induction that the following holds for any t{1,,N01}t\in\{1,\ldots,N_{0}-1\}:

The above clearly holds for t=1t=1. Assume it holds for tt, and we will now show that is holds for t+1t+1. By Equation (30) we have:

here the second inequality uses gt2βη\|g_{t}\|\leq 2\beta\sqrt{\eta}, which is our contradiction assumption. The third inequality holds since tN0t\leq N_{0}, and by the induction hypothesis gt(1)g1(1)|g_{t}^{(1)}|\geq|g_{1}^{(1)}|, combining these with Equation (31) implies that

Thus the induction hypothesis holds for any t{1,,N01}t\in\{1,\ldots,N_{0}-1\}. The induction hypothesis implies that within less than 4βγηlog(2βηg1(1))\frac{4\beta}{\gamma\sqrt{\eta}}\log\left(\frac{2\beta\sqrt{\eta}}{|g_{1}^{(1)}|}\right) rounds the magnitude of the gradient rises beyond 2βη2\beta\sqrt{\eta}, combining this with the condition of Equation (32) contradicts our assumption that the gradient does not rise beyond 2βη2\beta\sqrt{\eta} within less that N0N_{0}. We therefore conclude that the gradient rises beyond 2βη2\beta\sqrt{\eta} within less than N0N_{0} rounds, for g1(1),N0|g_{1}^{(1)}|,N_{0} that fulfill conditions (31),(32).

C.6 Proof of Lemma 15

Given xx, there always exists x[x0,x]x^{\prime}\in[x_{0},x] such that the following holds:

where Hx=2f(x)H_{x^{\prime}}=\nabla^{2}f(x^{\prime}). Using the above equation, and the Lipschitzness of the Hessian, we may bound the difference between the original function and its quadratic approximation around x0x_{0} as follows: