"Convex Until Proven Guilty": Dimension-Free Acceleration of Gradient Descent on Non-Convex Functions

Yair Carmon, Oliver Hinder, John C. Duchi, Aaron Sidford

Introduction

Nesterov’s seminal 1983 accelerated gradient method has inspired substantial development of first-order methods for large-scale convex optimization. In recent years, machine learning and statistics have seen a shift toward large scale non-convex problems, including methods for matrix completion Koren et al. (2009), phase retrieval Candès et al. (2015); Wang et al. (2016), dictionary learning Mairal et al. (2008), and neural network training LeCun et al. (2015). In practice, techniques from accelerated gradient methods—namely, momentum—can have substantial benefits for stochastic gradient methods, for example, in training neural networks Rumelhart et al. (1986); Kingma and Ba (2015). Yet little of the rich theory of acceleration for convex optimization is known to transfer into non-convex optimization.

Optimization becomes more difficult without convexity, as gradients no longer provide global information about the function. Even determining if a stationary point is a local minimum is (generally) NP-hard Murty and Kabadi (1987); Nesterov (2000). It is, however, possible to leverage non-convexity to improve objectives in smooth optimization: moving in directions of negative curvature can guarantee function value reduction. We explore the interplay between negative curvature, smoothness, and acceleration techniques, showing how an understanding of the three simultaneously yields a method that provably accelerates convergence of gradient descent for a broad class of non-convex functions.

We consider the unconstrained minimization problem

Approximating the global minimum of $f$ to $\epsilon$ -accuracy is generally intractable, requiring time exponential in $d\log\frac{1}{\epsilon}$ (Nemirovski and Yudin, 1983, §1.6). Instead, we seek a point $x$ that is $\epsilon$ -approximately stationary, that is,

Finding stationary points is a canonical problem in nonlinear optimization Nocedal and Wright (2006), and while saddle points and local maxima are stationary, excepting pathological cases, descent methods that converge to a stationary point converge to a local minimum (Lee et al., 2016; Nemirovski, 1999, §3.2.2).

If we assume $f$ is convex, gradient descent satisfies the bound (2) after $O(\epsilon^{-1})$ gradient evaluations, and AGD improves this rate to $O(\epsilon^{-1/2}\log\frac{1}{\epsilon})$ Nesterov (2012). Without convexity, gradient descent is significantly worse, having worst-case complexity $\Theta(\epsilon^{-2})$ Cartis et al. (2010). More sophisticated gradient-based methods, including nonlinear conjugate gradient Hager and Zhang (2006) and L-BFGS Liu and Nocedal (1989) provide excellent practical performance, but their global convergence guarantees are no better than $O(\epsilon^{-2})$ . Our work Carmon et al. (2016) and, independently, Agarwal et al. (2016), break this $O(\epsilon^{-2})$ barrier, obtaining the rate $O(\epsilon^{-7/4}\log\frac{d}{\epsilon})$ . Before we discuss this line of work in Section 1.3, we overview our contributions.

2 Our contributions

Underpinning our results is the observation that when we run Nesterov’s accelerated gradient descent (AGD) on any smooth function $f$ , one of two outcomes must follow:

AGD behaves as though $f$ was $\sigma$ -strongly convex, satisfying inequality (2) in $O(\sigma^{-1/2}\log\frac{1}{\epsilon})$ iterations.

There exist points $u,v$ in the AGD trajectory that prove $f$ is “guilty” of not being $\sigma$ -strongly convex,

The intuition behind these observations is that if inequality (3) never holds during the iterations of AGD, then $f$ “looks” strongly convex, and the convergence (a) follows. In Section 2 we make this observation precise, presenting an algorithm to monitor AGD and quickly find the witness pair $u,v$ satisfying (3) whenever AGD progresses more slowly than it does on strongly convex functions. We believe there is potential to apply this strategy beyond AGD, extending additional convex gradient methods to non-convex settings.

In Section 3 we propose a method that iteratively applies our monitored AGD algorithm to $f$ augmented by a proximal regularizer. We show that both outcomes (a) and (b) above imply progress minimizing $f$ , where in case (b) we make explicit use of the negative curvature that AGD exposes. These progress guarantees translate to an overall first-order oracle complexity of $O(\epsilon^{-7/4}\log\frac{1}{\epsilon})$ , a strict improvement over the $O(\epsilon^{-2})$ rate of gradient descent. In Section 5 we report preliminary experimental results, showing a basic implementation of our method outperforms gradient descent but not nonlinear conjugate gradient.

As we show in Section 4, assuming Lipschitz continuous third derivatives instead of Lipschitz continuous Hessian allows us to increase the step size we take when exploiting negative curvature, making more function progress. Consequently, the complexity of our method improves to $O(\epsilon^{-5/3}\log\frac{1}{\epsilon})$ . While the analysis of the third-order setting is more complex, the method remains essentially unchanged. In particular, we still use only first-order information, never computing higher-order derivatives.

3 Related work

Nesterov and Polyak (2006) show that cubic regularization of Newton’s method finds a point that satisfies the stationarity condition (2) in $O(\epsilon^{-3/2})$ evaluations of the Hessian. Given sufficiently accurate arithmetic operations, a Lipschitz continuous Hessian is approximable to arbitrary precision using finite gradient differences, and obtaining a full Hessian requires $O(d)$ gradient evaluations. A direct implementation of the Nesterov-Polyak method with a first-order oracle therefore has gradient evaluation complexity $O(\epsilon^{-3/2}d)$ , improving on gradient descent only if $d\ll\epsilon^{-1/2}$ , which may fail in high-dimensions.

In two recent papers, we (Carmon et al., 2016) and (independently) Agarwal et al. obtain better rates for first-order methods. Agarwal et al. (2016) propose a careful implementation of the Nesterov-Polyak method, using accelerated methods for fast approximate matrix inversion. In our earlier work, we employ a combination of (regularized) accelerated gradient descent and the Lanczos method. Both find a point that satisfies the bound (2) with probability at least $1-\delta$ using $O\left(\epsilon^{-7/4}\log\frac{d}{\delta\epsilon}\right)$ gradient and Hessian-vector product evaluations.

The primary conceptual difference between our approach and those of Carmon et al. and Agarwal et al. is that we perform no eigenvector search: we automatically find directions of negative curvature whenever AGD proves $f$ “guilty” of non-convexity. Qualitatively, this shows that explicit second orders information is unnecessary to improve upon gradient descent for stationary point computation. Quantitatively, this leads to the following improvements:

Our result is dimension-free and deterministic, with complexity independent of the ratio $d/\delta$ , compared to the $\log\frac{d}{\delta}$ dependence of previous works. This is significant, as $\log\frac{d}{\delta}$ may be comparable to $\epsilon^{-1/4}/\log\frac{1}{\epsilon}$ , making it unclear whether the previous guarantees are actually better than those of gradient descent.

Our method uses only gradient evaluations, and does not require Hessian-vector products. In practice, Hessian-vector products may be difficult to implement and more expensive to compute than gradients.

Under third-order smoothness assumptions we improve our method to achieve $O(\epsilon^{-5/3}\log\frac{1}{\epsilon})$ rate. It is unclear how to extend previous approaches to obtain similar guarantees.

In distinction from the methods of Carmon et al. (2016) and Agarwal et al. (2016), our method provides no guarantees on positive definiteness of $\nabla^{2}f(x)$ ; if initialized at a saddle point it will terminate immediately. However, as we further explain in Section C, we may combine our method with a fast eigenvector search to recover the approximate positive definiteness guarantee $\nabla^{2}f(x)\succeq-\sqrt{\epsilon}I$ , even improving it to $\nabla^{2}f(x)\succeq-{\epsilon}^{2/3}I$ using third-order smoothness, but at the cost of reintroducing randomization, Hessian-vector products and a $\log\frac{d}{\delta}$ complexity term.

4 Preliminaries and notation

Algorithm components

We begin our development by presenting the two building blocks of our result: a monitored variation of AGD (Section 2.1) and a negative curvature descent step (Section 2.2) that we use when the monitored version of AGD certifies non-convexity. In Section 3, we combine these components to obtain an accelerated method for non-convex functions.

The main component in our approach is Alg. 1, AGD-until-proven-guilty. We take as input an $L$ -smooth function $f$ , conjectured to be $\sigma$ -strongly convex, and optimize it with Nesterov’s accelerated gradient descent method for strongly convex functions (lines 3 and 4). At every iteration, the method invokes Certify-progress to test whether the optimization is progressing as it should for strongly convex functions, and in particular that the gradient norm is decreasing exponentially quickly (line 5). If the test fails, Find-witness-pair produces points $u,v$ proving that $f$ violates $\sigma$ -strong convexity. Otherwise, we proceed until we find a point $y$ such that $\left\|{\nabla f(y)}\right\|\leq\varepsilon$ .

The efficacy of our method is based on the following guarantee on the performance of AGD.

where $\kappa=\frac{L}{\sigma}$ and $\psi(w)=f(y_{0})-f(w)+\frac{\sigma}{2}\left\|{w-y_{0}}\right\|^{2}$ .

Proposition 1 is essentially a restatement of established results (Nesterov, 2004; Bubeck, 2014), where we take care to phrase the requirements on $f$ in terms of local inequalities, rather than a global strong convexity assumption. For completeness, we provide a proof of Proposition 1 in Section A.1.

With Proposition 1 in hand, we summarize the guarantees of Alg. 1 as follows.

where $v=x_{j}$ for some $0\leq j<t$ and $u=y_{j}$ or $u=w_{t}$ (defined on line 5 of AGD-until-proven-guilty). Moreover,

The bound (7) is clear for $t=1$ . For $t>1$ , the algorithm has not terminated at iteration $t-1$ , and so we know that neither the condition in line 9 of AGD-until-proven-guilty nor the condition in line 5 of Certify-progress held at iteration $t-1$ . Thus

which gives the bound (7) when rearranged.

since $\psi(w_{t})=\psi(y_{0})=0$ . Since this contradicts the progress bound (6), we obtain the certificate (8) by the contrapositive of Proposition 1: condition (5) must not hold for some $0\leq s<t$ , implying Find-witness-pair will return for some $j\leq s$ .

Similarly, if $w_{t}=z_{t}=y_{t}-\frac{1}{L}\nabla f(y_{t})$ then by line 5 of Certify-progress we must have

Since $f$ is $L$ -smooth we have the standard progress guarantee (c.f. Nesterov (2004) §1.2.3) $f(y_{t})-f(z_{t})\geq\frac{1}{2L}\left\|{\nabla f(y_{t})}\right\|^{2}$ , again contradicting inequality (6).

To see that the bound (9) holds, note that $f(y_{s})\leq f(y_{0})$ for $s=0,\ldots,t-1$ since condition 2 of Certify-progress did not hold. If $u=y_{j}$ for some $0\leq j<t$ then $f(u)\leq f(y_{0})$ holds trivially. Alternatively, if $u_{t}=w_{t}=z_{t}$ then condition 2 did not hold at time $t$ as well, so we have $f(y_{t})\leq f(y_{0})$ and also $f(u)=f(z_{t})\leq f(y_{t})-\frac{1}{2L}\left\|{\nabla f(y_{t})}\right\|^{2}$ as noted above; therefore $f(z_{t})\leq f(y_{0})$ . ∎

Before continuing, we make two remarks about implementation of Alg. 1.

As stated, the algorithm requires evaluation of two function gradients per iteration (at $x_{t}$ and $y_{t}$ ). Corollary 1 holds essentially unchanged if we execute line 9 of AGD-until-proven-guilty and lines 3-5 of Certify-progress only once every $\tau$ iterations, where $\tau$ is some fixed number (say 10). This reduces the number of gradient evaluations to $1+\frac{1}{\tau}$ per iteration.

Direct implementation would require $O(d\cdot t)$ memory to store the sequences $y_{0}^{t}$ , $x_{0}^{t}$ and $\nabla f(x_{0}^{t})$ for later use by Find-witness-pair. Alternatively, Find-witness-pair can regenerate these sequences from their recursive definition while iterating over $j$ , reducing the memory requirement to $O(d)$ and increasing the number of gradient and function evaluations by at most a factor of 2.

In addition, while our emphasis is on applying AGD-until-proven-guilty to non-convex problems, the algorithm has implications for convex optimization. For example, we rarely know the strong convexity parameter $\sigma$ of a given function $f$ ; to remedy this, O’Donoghue and Candès (2015) propose adaptive restart schemes. Instead, one may repeatedly apply AGD-until-proven-guilty and use the witnesses to update $\sigma$ .

2 Using negative curvature

The second component of our approach is exploitation of negative curvature to decrease function values; in Section 3 we use AGD-until-proven-guilty to generate $u,v$ such that

a nontrivial violation of convexity (where $\alpha>0$ is a parameter we control using a proximal term). By taking an appropriately sized step from $u$ in the direction $\pm(u-v)$ , Alg. 2 can substantially lower the function value near $u$ whenever the convexity violation (10) holds. The following basic lemma shows this essential progress guarantee.

We proceed in two parts; in the first part, we show that $f$ has negative curvature of at least $\alpha/2$ in the direction $\delta=(u-v)/\left\|{u-v}\right\|$ at the point $u$ . We then show that this negative curvature guarantees a gradient step with magnitude $\eta$ produces the required progress in function value.

Defining $\delta=(u-v)/\left\|{u-v}\right\|$ , we obtain

where inequality $(i)$ follows by assumption (10). Let $\tau^{*}=\inf_{\theta\in[0,\left\|{u-v}\right\|]}\delta^{T}\nabla^{2}f(v+\theta\delta)\delta$ . Using that $\int_{0}^{t}\tau d\tau=t^{2}/2$ , we substitute for the integrand to find that $\tau^{*}<-\alpha$ , and using the Lipschitz continuity of $\nabla^{2}f$ yields

where we have used that $\left\|{u-v}\right\|\leq\frac{\alpha}{2L_{2}}$ by assumption.

Using again the Lipschitz continuity of $\nabla^{2}f$ , we apply (4) with $x_{0}=u,\theta_{0}=0$ and $\theta=\pm\eta$ to obtain

The first order term must be negative for one of $u_{+}$ and $u_{-}$ . Therefore, using inequality (12) and $\eta\leq\alpha/L_{2}$ , we have $f(z)=\min\{f(u_{+}),f(u_{-})\}\leq f(u)-\frac{\alpha\eta^{2}}{12}$ as desired. ∎

Accelerating non-convex optimization

We now combine the accelerated convergence guarantee of Corollary 1 and the non-convex progress guarantee of Lemma 1 to form Guarded-non-convex-AGD. The idea for the algorithm is as follows. Consider iterate $k-1$ , denoted $p_{k-1}$ . We create a proximal function $\hat{f}$ by adding the proximal term $\alpha\left\|{x-p_{k-1}}\right\|^{2}$ to $f$ . Applying AGD-until-proven-guilty to $\hat{f}$ yields the sequences $x_{0},\ldots,x_{t}$ , $y_{0},\ldots,y_{t}$ and possibly a non-convexity witnessing pair $u,v$ (line 3). If $u,v$ are not available, we set $p_{k}=y_{t}$ and continue to the next iteration. Otherwise, by Corollary 1, $u$ and $v$ certify that $\hat{f}$ is not $\alpha$ strongly convex, and therefore that $f$ has negative curvature. Exploit-NC-pair then leverages this negative curvature, obtaining a point $b^{(2)}$ . The next iterate $p_{k}$ is the best out of $y_{0},\ldots,y_{t},u$ and $b^{(2)}$ in terms of function value.

The following central lemma provides a progress guarantee for each of the iterations of Alg. 3.

where the inequality is Eq. (14). Therefore, we conclude that inequality (10) must hold. To apply Lemma 1, we must control the distance between $u$ and $v$ . We present the following lemma, which shows how $b^{(1)}$ acts as an insurance policy against $\left\|{u-v}\right\|$ growing too large.

Deferring the proof of Lemma 3 briefly, we show how it yields Lemma 2. We set $\tau:=\frac{\alpha}{8L_{2}}$ and consider two cases. First, if

then we are done, since $f(p_{k})\leq f(b^{(1)})$ . In the converse case, if $f(b^{(1)})\geq f(y_{0})-\alpha\tau^{2}$ , Lemma 3 implies $\left\|{u-v}\right\|\leq 4\tau\leq\frac{\alpha}{2L_{2}}$ . Therefore, we can exploit the negative curvature in $f$ , as Lemma 1 guarantees (with $\eta=\frac{\alpha}{L_{2}}$ ),

Proof of Lemma 3. We begin by noting that $\hat{f}(y_{i})\leq\hat{f}(y_{0})=f(y_{0})$ for $i=1,\ldots,t-1$ by Corollary 1. Using $f(y_{i})\geq f(b^{(1)})\geq f(y_{0})-\alpha\tau^{2}$ we therefore have

which implies $\left\|{y_{i}-y_{0}}\right\|\leq\tau$ . By Corollary 1 we also have $\hat{f}(u)\leq\hat{f}(y_{0})$ , so we similarly obtain $\left\|{u-y_{0}}\right\|\leq\tau$ . Using $x_{i}=(1+\omega)y_{i}-\omega y_{i-1}$ , where $\omega=\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\in(0,1)$ , we have by the triangle inequality that

for every $i=1,\ldots,t-1$ . Finally, since $v=x_{j}$ for some $0\leq j\leq t-1$ , this gives the last inequality of the lemma, as $\left\|{u-v}\right\|\leq\left\|{u-y_{0}}\right\|+\left\|{x_{j}-y_{0}}\right\|\leq 4\tau$ . ∎

Lemma 2 shows we can accelerate gradient descent in a non-convex setting. Indeed, ignoring all problem-dependent constants, setting $\alpha=\sqrt{\epsilon}$ in the bound (13) shows that we make $\Omega(\epsilon^{3/2})$ progress at every iteration of Guarded-non-convex-AGD, and consequently the number of iterations is bounded by $O(\epsilon^{-3/2})$ . Arguing that calls to AGD-until-proven-guilty each require $O(\epsilon^{-1/4}\log\frac{1}{\epsilon})$ gradient computations yields the following complexity guarantee.

then Guarded-non-convex-AGD( $f$ , $p_{0}$ , $L_{1}$ , $\epsilon$ , $\alpha$ , $\frac{\alpha}{L_{2}}$ ) finds a point $p_{K}$ such that $\left\|{\nabla{f(p_{K}})}\right\|\leq\epsilon$ with at most

We bound two quantities: the number of calls to AGD-until-proven-guilty, which we denote by $K$ , and the maximum number of steps AGD-until-proven-guilty performs when it is called, which we denote by $T$ . The overall number gradient evaluations is $2KT$ , as we compute at most $2T$ gradients per iterations (at the points $x_{0},\ldots,x_{t-1}$ and $y_{1},\ldots,y_{t}$ ).

The upper bound on $K$ is immediate from Lemma 2, as by telescoping the progress guarantee (13) we obtain

where the final inequality follows by substituting our choice (15), of $\alpha$ . We conclude that

Therefore, substituting $\varepsilon=\epsilon/10$ , $L=L_{1}+2\alpha$ and $L=\alpha=2\sqrt{L_{2}\epsilon}$ into the guarantee (7) of Corollary 1,

We use $\epsilon\leq\min\{\Delta_{f}^{2/3}L_{2}^{1/3},L_{1}^{2}/(12L_{2})\}$ to simplify the bounds on $K$ and $T$ . Using $1\leq\Delta_{f}L_{2}^{1/2}\epsilon^{-3/2}$ simplifies the bound (17) to

Applying $1\leq L_{1}/(8\sqrt{L_{2}\epsilon})$ in the bound (18) gives

We conclude this section with two brief remarks.

The conditions on $\epsilon$ guarantee that the bound (16) is non-trivial. If $\epsilon\geq L_{1}^{2}/L_{2}$ , then gradient descent achieves better guarantees. Indeed, with step-size $1/L_{1}$ , gradient descent satisfies $\left\|{\nabla f(x)}\right\|\leq\epsilon$ within at most $\frac{2L_{1}\Delta_{f}}{\epsilon^{2}}$ iterations (cf. Nesterov, 2004, Eq. 1.2.13). Substituting $\epsilon\geq L_{1}^{2}/L_{2}$ therefore yields

Alternatively, if $\epsilon>10^{2/3}\Delta_{f}^{2/3}L_{2}^{1/3}$ then we have by inequality (17) that $K<2$ , so Alg. 3 halts after at most a single iteration. Nevertheless, the bounds (17) and (18) hold for any $\epsilon\geq 0$ .

While we state Theorem 1 in terms of gradient evaluation count, a similar bound holds for function evaluations as well. Indeed, inspection of our method reveals that each iteration of Alg. 3 evaluates the function and not the gradient at at most the three points $u,u_{+}$ and $u_{-}$ ; both complexity measures are therefore of the same order.

Incorporating third-order smoothness

In this section, we show that when third-order derivatives are Lipschitz continuous, we can improve the convergence rate of Alg. 3 by modifying two of its subroutines. In Section 4.1 we introduce a modified version of Exploit-NC-pair that can decrease function values further using third-order smoothness. In Section 4.2 we change Find-best-iterate to provide a guarantee that $f(v)$ is never too large. We combine these two results in Section 4.3 and present our improved complexity bounds.

Our first observation is that third-order smoothness allows us to take larger steps and make greater progress when exploiting negative curvature, as the next lemma formalizes.

the last inequality using $h(0)=f(u)$ , $h^{\prime\prime}(0)=-\frac{\alpha}{2}$ and $\eta^{2}\leq\frac{3\alpha}{L_{3}}$ . That $f(u+\bar{\eta}\delta)=h(\bar{\eta})$ gives the result. ∎

Comparing Lemma 4 to the second part of the proof of Lemma 1, we see that second-order smoothness with optimal $\eta$ guarantees $\alpha^{3}/(12L_{2}^{2})$ function decrease, while third-order smoothness guarantees a $3\alpha^{2}/(8L_{3})$ decrease. Recalling Theorem 1, where $\alpha$ scales as a power of $\epsilon$ , this is evidently a significant improvement. Additionally, this benefit is essentially free: there is no increase in computational cost and no access to higher order derivatives. Examining the proof, we see that the result is rooted in the anti-symmetry of the odd-order terms in the Taylor expansion. This rules out extending this idea to higher orders of smoothness, as they contain symmetric fourth order terms.

Extending this insight to the setting of Lemma 1 is complicated by the fact that, at relevant scales of $\left\|{u-v}\right\|$ , it is no longer possible to guarantee that there is negative curvature at either $u$ or $v$ . Nevertheless, we are able to show that a small modification of Exploit-NC-pair achieves the required progress.

We prove Lemma 5 in Section B.1 in the supplementary material; it is essentially a more technical version of the proof of Lemma 4, where we address the asymmetry of condition (10) by taking steps of different sizes from $u$ and $v$ .

2 Bounding the function values of the iterates using cubic interpolation

An important difference between Lemmas 1 and 5 is that the former guarantees lower objective value than $f(u)$ , while the latter only improves $\max\{f(v),f(u)\}$ . We invoke these lemmas for $v=x_{j}$ for some $x_{j}$ produced by AGD-until-proven-guilty, but Corollary 1 only bounds the function value at $y_{j}$ and $w$ ; $f(x_{j})$ might be much larger than $f(y_{0})$ , rendering the progress guaranteed by Lemma 5 useless. Fortunately, we are able show that whenever this happens, there must be a point on the line that connects $x_{j},y_{j}$ and $y_{j-1}$ for which the function value is much lower than $f(y_{0})$ . We take advantage of this fact in Alg. 5, where we modify Find-best-iterate to consider additional points, so that whenever the iterate it finds is not much better than $y_{0}$ , then $f(x_{j})$ is guaranteed to be close to $f(y_{0})$ . We formalize this claim in the following lemma, which we prove in Section B.2.

3 An improved rate of convergence

With our algorithmic and analytic upgrades established, we are ready to state the enhanced performance guarantees for Guarded-non-convex-AGD, where from here on we assume that Exploit-NC-pair3 and Find-best-iterate3 subsume Exploit-NC-pair and Find-best-iterate, respectively.

The proof of Lemma 7 is essentially identical to the proof of Lemma 2, where we replace Lemma 1 with Lemmas 5 and 6 and set $\tau=\sqrt{\alpha/(32L_{3})}$ . For completeness, we give a full proof in Section B.3. The gradient evaluation complexity guarantee for third-order smoothness then follows precisely as in our proof of Theorem 1; see Sec. B.4 for a proof of the following

Guarded-non-convex-AGD( $f$ , $p_{0}$ , $L_{1}$ , $\epsilon$ , $\alpha$ , $\sqrt{\frac{2\alpha}{L_{3}}}$ ) finds a point $p_{K}$ such that $\left\|{\nabla{f(p_{K}})}\right\|\leq\epsilon$ and requires at most

While achieving the guarantees that Theorem 2 provides requires some modification of our algorithms, these do not come at the expense of the convergence guarantees of Theorem 1 when we have only second order smoothness. That is, the results of Theorem 1 remain valid even with the algorithmic modifications of this section, and Alg. 3 transitions between smoothness regimes by varying the scaling of $\alpha$ and $\eta$ with $\epsilon$ .

Preliminary experiments

The primary purpose of this paper is to demonstrate the feasibility of acceleration for non-convex problems using only first-order information. Given the long history of development of careful schemes for non-linear optimization, it is unrealistic to expect a simple implementation of the momentum-based Algorithm 3 to outperform state-of-the-art methods such as non-linear conjugate gradients and L-BFGS. It is important, however, to understand the degree of non-convexity in problems we encounter in practice, and to investigate the efficacy of the negative curvature detection-and-exploitation scheme we propose.

Toward this end, we present two experiments: (1) fitting a non-linear regression model and (2) training a small neural network. In these experiments we compare a basic implementation of Alg. 3 with a number baseline optimization methods: gradient descent (GD), non-linear conjugate gradients (NCG) Hager and Zhang (2006), Accelerated Gradient Descent (AGD) with adaptive restart O’Donoghue and Candès (2015) (RAGD), and a crippled version of Alg. 3 without negative curvature exploitation (C-Alg. 3). We compare the algorithms on the number of gradient steps, but note that the number of oracle queries per step varies between methods. We provide implementation details in Section D.1.

For our first experiment, we study robust linear regression with the smooth biweight loss Beaton and Tukey (1974),

The function $\phi$ is a robust modification of the quadratic loss; it is approximately quadratic for small errors, but insensitive to larger errors. For 1,000 independent experiments, we randomly generate problem data to create a highly non-convex problem as follows. We set $d=30$ and $m=60$ , and we draw $a_{i}\stackrel{{\scriptstyle\rm iid}}{{\sim}}\mathcal{N}(0,I_{d})$ . We generate $b$ as follows. We first draw a “ground truth” vector $z\sim\mathcal{N}(0,4I_{d})$ . We then set $b=Az+3\nu_{1}+\nu_{2}$ , where $\nu_{1}\sim\mathcal{N}(0,I_{m})$ and the elements of $\nu_{2}$ are i.i.d. Bernoulli $(0.3)$ . These parameters make the problem substantially non-convex.

In Figure 1 we plot aggregate convergence time statistics, as well as gradient norm and function value trajectories for a single representative problem instance. The figure shows that gradient descent and C-Alg. 3 (which does not exploit curvature) converge more slowly than the other methods. When C-Alg. 3 stalls it is detecting negative curvature, which implies the stalling occurs around saddle points. When negative curvature exploitation is enabled, Alg. 3 is faster than RAGD, but slower than NCG. In this highly non-convex problem, different methods often converge to local minima with (sometimes significantly) different function values. However, each method found the “best” local minimum in a similar fraction of the generated instances, so there does not appear to be a significant difference in the ability of the methods to find “good” local minima in this problem ensemble.

For the second experiment we fit a neural network modelOur approach in its current form is inapplicable to training neural networks of modern scale, as it requires computation of exact gradients. comprising three fully-connected hidden layers containing $20$ , $10$ and $5$ units, respectively, on the MNIST handwritten digits dataset LeCun et al. (1998) (see Section D.2). Figure 2 shows a substantial performance gap between gradient descent and the other methods, including Alg. 3. However, this is not due to negative curvature exploitation; in fact, Alg. 3 never detects negative curvature in this problem, implying AGD never stalls. Moreover, RAGD never restarts. This suggests that the loss function $f$ is “effectively convex” in large portions of the training trajectory, consistent with the empirical observations of Goodfellow et al. (2015); this phenomenon may merit further investigation.

We conclude that our approach can augment AGD in the presence of negative curvature, but that more work is necessary to make it competitive with established methods such as non-linear conjugate gradients. For example, adaptive schemes for setting $\alpha,\eta$ and $L_{1}$ must be developed. However, the success of our method may depend on whether AGD stalls at all in real applications of non-convex optimization.

Acknowledgment

OH was supported by the PACCAR INC fellowship. YC and JCD were partially supported by the SAIL-Toyota Center for AI Research and NSF-CAREER award 1553086. YC was partially supported by the Stanford Graduate Fellowship and the Numerical Technologies Fellowship.

References

Supplementary material

The proof is closely based on the proof of Theorem 3.18 of Bubeck (2014), which itself is based on the estimate sequence technique of Nesterov (2004). We modify the proof slightly to avoid arguments that depend on the global minimum of $f$ . This enables using inequalities (5) to prove the result, instead of $\sigma$ -strong convexity of the function $f$ .

We define $\sigma$ -strongly convex quadratic functions $\Phi_{s}$ by induction as

Using (5) with $u=w$ , straightforward induction shows that

We now prove (26) by induction. Note that it is true at $s=0$ since $x_{0}=y_{0}$ is the global minimizer of $\Phi_{0}$ . We have,

where inequality (a) follows from the definition $y_{s+1}=x_{s}-\frac{1}{L}\nabla f(x_{s})$ and the $L$ -smoothness of $f$ , inequality (b) is the induction hypothesis and inequality (c) is assumption (5) with $u=y_{s}$ .

Past this point the proof is identical to the proof of Theorem 3.18 of Bubeck (2014), but we continue for sake of completeness.

To complete the induction argument we now need to show that:

Note that $\nabla^{2}\Phi_{s}=\sigma I_{n}$ (immediate by induction) and therefore

Since by definition $\Phi_{s+1}(v_{s+1})=0$ , we have

Using (24), evaluating evaluating $\Phi_{s+1}$ at $x_{s}$ gives,

Examining this equation, it is seen that $v_{s}-x_{s}=\sqrt{\kappa}(x_{s}-y_{s})$ implies (27) and therefore concludes the proof of Proposition 1. We establish the relation $v_{s}-x_{s}=\sqrt{\kappa}(x_{s}-y_{s})$ by induction,

where the first equality comes from (28), the second from the induction hypothesis, the third from the definition of $y_{s+1}$ and the last one from the definition of $x_{s+1}$ . ∎

Appendix B Proofs from Section 4

We begin by proving the following normalized version of Lemma 5.

for some $A\geq 0$ . Then for any $\rho\geq 4$

where $\rho^{\prime}=\sqrt{\rho(\rho+2)}-2$ .

where the equality follows from the definition $(2+\rho^{\prime})^{2}=\rho(2+\rho)$ . We lower bound $\rho^{\prime}(2+\rho^{\prime})$ as

The fact that either (35) or (36) must hold implies (31).

With the auxiliary Lemma 8, we prove Lemma 5.

Additionally, since $f$ has $L_{3}$ -Lipschitz third order derivatives, $h^{\prime\prime\prime}$ is $\frac{1}{16}L_{3}\left\|{u-v}\right\|^{4}:=L$ Lipschitz continuous, so we may apply Lemma 8 at $\rho=2\eta/\left\|{u-v}\right\|\geq 4$ . Letting $\delta=(u-v)/\left\|{u-v}\right\|$ , we note that $h(1-\rho)=f(v-\eta\delta)$ . Similarly, for $2+\rho^{\prime}=\sqrt{\rho(2+\rho)}$ we have $h(1+\rho^{\prime})=f(u+\eta^{\prime}\delta)$ with $\eta^{\prime}$ given in line 2 of Exploit-NC-pair3. The result is now immediate from (31), as

where in the last transition we have used $\eta\leq\sqrt{\frac{2\alpha}{L_{3}}}$ . ∎

B.2 Proof of Lemma 6

We first state and prove a normalized version of the central argument in the proof of Lemma 6

Let $0\leq j<t$ be such that $v=x_{j}$ (such $j$ always exists by Corollary 1). If $j=0$ then $x_{j}=y_{0}$ and the result is trivial, so we assume $j\geq 1$ . Let

where $0<\omega<1$ is defined in line 1 of AGD-until-proven-guilty, and we have used the guarantee $\max\{f(y_{j-1}),f(y_{j})\}\leq f(y_{0})$ from Corollary 1. Moreover, by the Lipschitz continuity of the third derivatives of $f$ , $h^{\prime\prime\prime}$ is $L_{3}\left\|{y_{j}-y_{j-1}}\right\|^{4}$ -Lipschitz continuous. Therefore, we can apply Lemma 9 with $A=C=0$ and $B=D=\alpha\tau^{2}$ at $\theta=\omega$ and obtain

To complete the proof, we note that Lemma 3 guarantees $\left\|{y_{j}-y_{j-1}}\right\|\leq\left\|{y_{j}-y_{0}}\right\|+\left\|{y_{j-1}-y_{0}}\right\|\leq 2\tau$ and therefore

where we have used $\tau^{2}\leq\alpha/(16L_{3})$ . ∎

B.3 Proof of Lemma 7

Therefore, we can use Lemma 5 (with $\eta$ as defined above) to show that

By Corollary 1, $f(u)\leq\hat{f}(u)\leq\hat{f}(y_{0})=f(p_{k-1})$ . Moreover, since $f(b^{(1)})\geq f(y_{0})-\alpha\tau^{2}$ and $\tau=\sqrt{\frac{\alpha}{32L_{3}}}$ , we may apply Lemma 6 to obtain

B.4 Proof of Theorem 2

The proof proceeds exactly like the proof of Theorem 1. As argued there, the number of gradient evaluations is at most $2KT$ , where $K$ is number of iterations of Guarded-non-convex-AGD and $T$ is the maximum amount of steps performed in any call to AGD-until-proven-guilty.

We derive the upper bound on $K$ directly from Lemma 7, as by telescoping (13) we obtain

where the last transition follows from substituting (22), our choice of $\alpha$ . We therefore conclude that

where $\log_{+}(\cdot)$ is shorthand for $\max\{0,\log(\cdot)\}$ .

Finally, we use $\epsilon^{2/3}\leq\min\{\Delta_{f}^{1/2}L_{3}^{1/6},L_{1}/(8L_{3}^{1/3})\}$ to simplify the bounds on $K$ and $T$ . Using $1\leq\Delta_{f}L_{3}^{1/3}\epsilon^{-4/3}$ reduces (17) to

Applying $1\leq L_{1}/(8{L_{3}^{1/3}\epsilon^{2/3}})$ on (18) gives

Appendix C Adding a second-order guarantee

In this section, we sketch how to obtain simultaneous guarantees on the gradient and minimum eigenvalue of the Hessian. We use the $\widetilde{O}(\cdot)$ notation to hide logarithmic dependence on $\epsilon$ , Lipschitz constants $\Delta_{f},L_{1},L_{2},L_{3}$ and a high probability confidence parameter $\delta\in(0,1)$ , as well as lower order polynomial terms in $\epsilon^{-1}$ .

Using approximate eigenvector computation, we can efficiently generate a direction of negative curvature, unless the Hessian is almost positive semi-definite. More explicitly, there exist methods of the form Approx-Eig( $f$ , $x$ , $L_{1}$ , $\alpha$ , $\delta$ ) that require $\widetilde{O}(\sqrt{L_{1}/\alpha}\log d)$ Hessian-vector products to produce a unit vector $v$ such that whenever $\nabla^{2}f(x)\succeq-\alpha I$ , with probability at least $1-\delta$ we have $v^{T}\nabla^{2}f(x)v\leq-\alpha/2$ , e.g. the Lanczos method (see additional discussion in (Carmon et al., 2016, §2.2)). Whenever a unit vector $v$ satisfying $v^{T}\nabla^{2}f(x)v\leq-\alpha/2$ is available, we can use it to make function progress. If $\nabla^{2}f$ is $L_{2}$ -Lipschitz continuous then by Lemma 1 $f(x\pm\frac{\alpha}{L_{2}}v)<f(x)-\frac{\alpha^{3}}{12L_{2}^{2}}$ where by $f(x\pm z)$ we mean $\min\{f(x+z),f(x-z)\}$ . If instead $f$ has $L_{3}$ -Lipschitz continuous third-order derivatives then by Lemma 4, $f(x\pm\sqrt{\frac{2\alpha}{L_{3}}}v)<f(x)-\frac{\alpha^{2}}{4L_{3}}$ .

We can combine Approx-Eig with Algorithm 3 that finds a point with a small gradient as follows:

As discussed above, under third order smoothness , $\eta=\sqrt{{2\alpha}/{L_{3}}}$ guarantees that the step (41c) makes at least $\alpha^{2}/(4L_{3})$ function progress whenever $v_{k}^{T}\nabla^{2}f(\hat{z}_{k})v_{k}\leq-\alpha/2$ . Therefore the above iteration can run at most $\widetilde{O}(\Delta_{f}L_{3}/\alpha^{2})$ times before $v_{k}^{T}\nabla^{2}f(\hat{z}_{k})v_{k}\geq-\alpha/2$ is satisfied. Whenever $v_{k}^{T}\nabla^{2}f(\hat{z}_{k})v_{k}\geq-\alpha/2$ , with probability $1-\delta^{\prime}\cdot k$ we have the Hessian guarantee $\nabla^{2}f(\hat{z}_{k})\succeq-\alpha I$ . Moreover, $\left\|{\nabla f(\hat{z}_{k})}\right\|\leq\epsilon$ always holds. Thus, by setting $\alpha=L_{3}^{1/3}\epsilon^{2/3}$ we obtain the required second order stationarity guarantee upon termination of the iterations (41).

It remains to bound the computational cost of the method, with $\alpha=L_{3}^{1/3}\epsilon^{2/3}$ . The total number of Hessian-vector products required by Approx-Eig is,

Moreover, it is readily seen from the proof of Theorem 2 that every evaluation of (41a) requires at most

gradient and function evaluations. By telescoping the first term and multiplying the second by $\widetilde{O}(\Delta_{f}L_{3}/\alpha^{2})$ , we guarantee $\left\|{\nabla f(x)}\right\|\leq\epsilon$ and $\nabla^{2}f(x)\succeq-L_{3}^{1/3}\epsilon^{2/3}I$ in at most $\widetilde{O}(\Delta_{f}L_{1}^{1/2}L_{3}^{1/6}\epsilon^{-5/3}\log d)$ function, gradient and Hessian-vector product evaluations.

The argument above is the same as the one used to prove Theorem 4.3 of Carmon et al. (2016), but our improved guarantees under third order smoothness allows us get a better $\epsilon$ dependence for the complexity and lower bound on the Hessian in that regime. If instead we use the second order smoothness setting, we recover exactly the guarantees of Carmon et al. (2016); Agarwal et al. (2016), namely $\left\|{\nabla f(x)}\right\|\leq\epsilon$ and $\nabla^{2}f(x)\succeq-L_{2}^{1/2}\epsilon^{1/2}I$ in at most $\widetilde{O}(\Delta_{f}L_{1}^{1/2}L_{2}^{1/4}\epsilon^{-7/4}\log d)$ function, gradient and Hessian-vector product evaluations.

Finally, we remark that the above analysis would still apply if in (41a) we replace Guarded-non-convex-AGD with any method with a run -time guarantee of the form (42). The resulting method will guarantee whatever the original method does, and also $\nabla^{2}f(x)\succeq-\alpha I$ . In particular, if the first method guarantees a small gradient, the combined method guarantees convergence to second-order stationary points.

Appendix D Experiment details

Both gradient descent and AGD are based on gradients steps of the form

In practice $L_{1}$ is often unknown and non-uniform, and therefore needs to be estimated adaptively. A common approach is backtracking line search, which we use for conjugate gradient. However, combining line search with AGD without invalidating its performance guarantees would involve non-trivial modification of the proposed method. Therefore, for the rest of the methods we keep an estimate of $L_{1}$ , and double it whenever the gradient steps fails to make sufficient progress. That is, whenever

we set $L_{1}\leftarrow 2L_{1}$ and try again. In all experiments we start with $L_{1}=1$ , which underestimates the actual smoothness of $f$ by 2-3 orders of magnitude. We call our scheme for setting $L_{1}$ semi-adaptive, since we only increase $L_{1}$ , and therefore do not adapt to situations where the function becomes more smooth as optimization progresses. Thus, we avoid painstaking tuning of $L_{1}$ while preserving the ‘fixed step-size’ nature of our approach, as $L_{1}$ is only doubled a small number of times.

We implement Guarded-non-convex-AGD with the following modifications, indented to make it more practical without substantially compromising its theoretical properties.

We use the semi-adaptive scheme described above to set $L$ . Specifically, whenever the gradient steps in lines 3 and 3 of AGD-until-proven-guilty and Certify-progress respectively fail, we double $L$ until it succeeds, terminate AGD-until-proven-guilty and multiply $L_{1}$ by the same factor.

We make the input parameters for AGD-until-proven-guilty dynamic. In particular, we set $\epsilon^{\prime}=\left\|{\nabla f(p_{k-1})}\right\|/10$ and use $\alpha=\sigma=C_{1}\left\|{\nabla f(p_{k-1})}\right\|^{2/3}$ , where $C_{1}$ is a hyper-parameter. We use the same value of $\alpha$ to construct $\hat{f}$ . This makes our implementation independent on the final desired accuracy $\epsilon$ .

Since this inequality is a clear convexity violation, we return $w_{t}=y_{t}$ whenever it holds. We find that this substantially increases our method’s capability of detecting negative curvature; most of the non-convexity detection in the first experiment is due to this check.

for the $2t$ points of the form $v=x_{j}$ and $u=y_{j}$ or $u=w_{t}$ , with $0\leq j<t$ , where here we use the original $f$ rather than $\hat{f}$ given to AGD-until-proven-guilty. We discard all pairs with $\alpha_{v,u}<0$ (no evidence of negative curvature), and select the 5 pairs with highest value of $\alpha_{v,u}$ . For each selected pair $v,u$ , we exploit negative curvature by testing all the points of the form $\{z\pm\eta\delta\}$ with $\delta=(u-v)/\left\|{u-v}\right\|$ , $z\in\{v,u\}$ and $\eta$ in a grid of 10 points log-uniformly spaced between $0.01\left\|{u-v}\right\|$ and $100(\left\|{u}\right\|+\left\|{v}\right\|)$ .

The hyper-parameter $C_{1}$ was tuned separately for each experiment by searching on a small grid. For the regression experiment the tuning was performed on different problem instances (different seeds) than the ones reported in Fig. 1. For the neural network training problem the tuning was performed on a subsample of 10% one reported in Fig. 2. The specific parameters used were $C_{1}=0.01$ for regression and $C_{1}=0.1$ for neural network training.

This method is identical to the one described above, except that at every iteration $p_{k}$ is set to $b^{(1)}$ produced by Find-best-iterate3 (i.e. the output of negative curvature exploitation is never used). We used the same hyper-parameters described above.

Gradient descent descent is simply (43), with $y_{t+1}=x_{t+1}$ , where the semi-adaptive scheme is used to set $L_{1}$ .

We use the accelerated gradient descent scheme of Beck and Teboulle (2009) with $\omega_{t}=t/(t+3)$ . We use the restart scheme given by O’Donoghue and Candès (2015) where if $f(y_{t})>f(y_{t-1})$ then we restart the algorithm from the point $y_{t}$ . For the gradient steps we use the same semi-adaptive procedure described above and also restart the algorithm whenever the $L_{1}$ estimate changes (restarts performed for this reason are not shown in Fig. 1 and 2).

The method is given by the following recursion Polak and Ribière (1969),

where $\delta_{0}=0$ and $\eta_{t}$ is found via backtracking line search, as follows. If $\delta^{T}\nabla f(x_{t})\geq 0$ we set $\delta_{t}=-\nabla f(x_{t})$ (truncating the recursion). We set $\eta_{t}=2\eta_{t-1}$ and then check whether

holds. If it does we keep the value of $\eta_{t}$ , and if it does not we set $\eta_{t}=\eta_{t}/2$ and repeat. The key difference from the semi-adaptive scheme used for the rest of the methods is the initialization $\eta_{t}=2\eta_{t-1}$ , that allows the step size to grow. Performing line search is crucial for conjugate gradient to succeed, as otherwise it cannot produce approximately conjugate directions. If instead we use the semi-adaptive step size scheme, performance becomes very similar to that of gradient descent.

In the figures, the x-axis is set to the number of steps performed by the methods. We do this because it enables a one-to-one comparison between the steps of the restarted AGD and Algorithm 3. However, Algorithm 3 requires twice the number of gradient evaluations per step of the other algorithms. Furthermore, the number of function evaluations of Algorithm 3 increases substantially when we exploit negative curvature, due to our naive grid search procedure. Nonetheless, we believe it is possible to derive a variation of our approach that performs only one gradient computation per step, and yet maintains similar performance (see remark after Corollary 1, and that effective negative curvature exploitation can be carried out with only few function evaluations, using a line search.

While the rest of the methods tested require one gradient evaluation per step, the required number of function evaluations differs. GD requires only one function evaluation per step, while RAGD evaluates $f$ twice per step (at $x_{t}$ and $y_{t}$ ); the number of additional function evaluations due to the semi-adaptive scheme is negligible. NCG is expected to require more function evaluations due to its use of a backtracking line search. In the first experiment, NCG required 2 function evaluations per step on average, indicating that its $L_{1}$ estimate was stable for long durations. Alg. 3 required 5.3 function evaluations per step (on average over the 1,000 problem instances, with standard deviation 0.5), putting the amortized cost of our crude negative curvature exploitation scheme at 3.3 function evaluations per step.

D.2 Neural network training

The function $f$ is the average cross-entropy loss of 10-way prediction of class labels from input features. The prediction if formed by applying softmax on the output of a neural network with three hidden layers of 20, 10 and 5 units and $\tanh$ activations. To obtain data features we perform the following preprocessing, where the training examples are treated as $28^{2}$ dimensional vectors. First, each example is separately normalized to zero mean and unit variance. Then, the $28^{2}\times 28^{2}$ data covariance matrix is formed, and a projection to the 10 principle components is found via eigen-decomposition. The projection is then applied to the training set, and then each of the 10 resulting features is normalized to have zero mean and unit variance across the training set. The resulting model has $d=545$ parameters and underfits the 60,000 examples training set. We randomly initialize the weights according the well-known scaling proposed by Glorot and Bengio (2010). We repeated the experiment for 10 different initializations of the weights, and all results were consistent with those reported in Fig. 2.