Accelerated Methods for Non-Convex Optimization

Yair Carmon, John C. Duchi, Oliver Hinder, Aaron Sidford

Introduction

In this paper, we consider the optimization problem

The simplest method for obtaining a guarantee of the form (2) is gradient descent (GD). It is well-known that if GD begins from a point x1x_{1}, then for any Δff(x1)infxf(x)\Delta_{f}\geq f(x_{1})-\inf_{x}f(x) and ϵ>0\epsilon>0, it finds a point satisfying the bound (2) in O(ΔfL1ϵ2)O(\Delta_{f}L_{1}\epsilon^{-2}) iterations. If one additionally assumes that the function ff is convex, substantially more is possible: GD then requires only O(RL1ϵ1)O(RL_{1}\epsilon^{-1}) iterations, where RR is an upper bound on the distance between x1x_{1} and the set of minimizers of ff. Moreover, in the same note , Nesterov also shows that acceleration and regularization techniques can reduce the iteration complexity to O~(RL1ϵ1/2)\widetilde{O}(\sqrt{RL_{1}}\epsilon^{-1/2}).The notation O~\widetilde{O} hides logarithmic factors. See Definition 5.

In this paper, we ask a natural question: using only gradient information, is it possible to improve on the ϵ2\epsilon^{-2} iteration complexity of gradient descent in terms of number of gradient calculations? We answer the question in the affirmative, providing an algorithm that requires at most

gradient and Hessian-vector product evaluations to find an xx such that f(x)ϵ\left\|{\nabla f(x)}\right\|\leq\epsilon. For a summary of our results in relation to other work, see Table 1.

Another advantage of the cubic-regularized Newton method is that it provides a second-order guarantee of the form 2f(x)ϵI\nabla^{2}f(x)\succeq-\sqrt{\epsilon}I, thus giving a rate of convergence to points with zero gradient and positive semi-definite Hessian. Such second-order stationary points are finer approximations of local minima compared to first-order stationary points (with zero gradient). Our approach also provides this guarantee, and is therefore an example of a first-order method that converges to a second-order stationary point in time polynomial in the desired accuracy and with logarithmic dependence on the problem dimension. A notable consequence of this approach is that for strict saddle functions —with only non-degenerate stationary points—our approach converges linearly to local minimizers. We discuss this result in detail in Section 5.

In the optimization and machine learning literature, there has been substantial recent work on the convergence properties of optimization methods for non-convex problems. One line of work investigates the types of local optima to which gradient-like methods converge, as well as convergence rates. In this vein, under certain reasonable assumptions (related to geometric properties of saddle points), Ge et al. show that stochastic gradient descent (SGD) converges to second-order local optima (stationary points with positive semidefinite Hessian), while Lee et al. show that GD generically converges to second-order local optima. Anandkumar and Ge extend these ideas, showing how to find a point that approximately satisfies the third-order necessary conditions for local optimality in polynomial time. While these papers used second-order smoothness assumptions to ensure convergence to stronger local minima than the simple stationary condition (2), they do not improve rates of convergence to stationarity.

A second line of work focuses on improving the slow convergence rates of SGD to stationary points (typically O(ϵ4)O(\epsilon^{-4}) stochastic gradient evaluations are sufficient ) under appropriate structural conditions on ff. One natural condition—common in the statistics and machine learning literature—is that ff is the sum of nn smooth non-convex functions. Indeed, the work of Reddi et al. and Allen-Zhu and Hazan achieves a rate of convergence O(ϵ2)O(\epsilon^{-2}) for such problems without performing the nn gradient evaluations (one per function) that standard gradient descent requires in each iteration. These analyses extend variance-reduction techniques that apply to incremental convex optimization problems . Nonetheless, they do not improve on the O(ϵ2)O(\epsilon^{-2}) iteration complexity of GD.

Additionally, a number of researchers apply accelerated gradient methods to non-convex optimization problems, though we know no theoretical guarantees giving improved performance over standard gradient descent methods. Ghadimi and Lan show how to modify Nesterov’s accelerated gradient descent method so that it enjoys the same convergence guarantees as gradient descent on non-convex optimization problems, while maintaining the accelerated (optimal) first-order convergence rates for convex problems. Li and Lin develop an accelerated method for non-convex optimization and show empirically that on (non-convex) sparse logistic regression test problems their methods outperform other methods, including gradient descent.

While the subproblem that appears in the cubic-regularized Newton method is expensive to solve exactly, it is possible to consider methods in which such subproblems are solved only approximately by a low complexity Hessian-free procedure. A number of researchers investigate this approach, including Cartis et al. and Bianconcini et al. . These works exhibit strong empirical results, but their analyses do not improve on the O(ϵ2)O(\epsilon^{-2}) evaluation complexity of gradient descent. Recently, Hazan and Koren and Ho-Nguyen and Kılınc-Karzan have shown how to solve the related quadratic non-convex trust-region problem using accelerated first-order methods; both these papers use accelerated eigenvector computations as a primitive, similar to our approach. It is therefore natural to ask whether acceleration can give faster convergence to stationary points of general non-convex functions, a question we answer in the affirmative.

Concurrently to and independently of this paper,A preprint of the current paper appears on the arXiv . Agarwal et al. also answer this question affirmatively. They develop a method that uses fast approximate matrix inversion as a primitive to solve cubic-regularized Newton-type steps , and applying additional acceleration techniques they show how to find stationary points of general smooth non-convex objectives. Though the technical approach is somewhat different, their convergence rates to ϵ\epsilon-stationary points are identical to ours. They also specialize their technique to problems of the finite sum form f=1ni=1nfif=\frac{1}{n}\sum_{i=1}^{n}f_{i}, showing that additional improvements in terms of nn are achievable.

2 Our approach

Our method is in the spirit of the techniques that underly accelerated gradient descent (AGD). While Nesterov’s 1983 development of acceleration schemes may appear mysterious at first, there are multiple interpretations of AGD as the careful combination of different routines for function minimization. The estimate sequence ideas of Nesterov and proximal point proofs show how to view accelerated gradient descent as a trade-off between building function lower bounds and directly making function progress. Bubeck et al. develop an AGD algorithm with a geometric interpretation based on shrinking spheres, while the work of Allen-Zhu and Orecchia shows that AGD may be viewed as a coupling between mirror descent and gradient descent; this perspective highlights how to trade each method’s advantages in different scenarios to achieve faster—accelerated—running time.

We follow a similar template of leveraging two competing techniques for making progress on computing a stationary point, but we deviate from standard analyses involving acceleration in our coupling of the algorithms. The first technique we apply is fairly well known. If the problem is locally non-convex, the Hessian must have a negative eigenvalue. In this case, under the assumption that the Hessian is Lipschitz continuous, moving in the direction of the corresponding eigenvector must make progress on the objective. Nesterov and Polyak (and more broadly, the literature on cubic regularization) use this implicitly, while other researchers use this more explicitly to escape from saddle points.

The second technique is the crux of our approach. While L1L_{1}-Lipschitz continuity of f\nabla f ensures that the smallest eigenvalue of the Hessian is at least L1-L_{1}, we show that any stronger bound—any deviation from this “worst possible” negative curvature—allows us to improve upon gradient descent. We show that if the smallest eigenvalue is at least γ-\gamma, which we call γ-\gamma-strong convexity, we can apply proximal point techniques and accelerated gradient descent to a carefully constructed regularized problem to obtain a faster running time. Our procedure proceeds by approximately minimizing a sequence of specially constructed such functions. This procedure is of independent interest since it can be applied in a standalone manner whenever the function is known to be globally almost convex.

By combining these procedures, we achieve our result. We run an accelerated (single) eigenvector routine—also known as principle components analysis (PCA)—to estimate the eigenvector corresponding to the smallest eigenvalue of the Hessian. Depending on the estimated eigenvalue we either move along the approximate eigenvector or apply accelerated gradient descent to a regularized sub-problem, where we carefully construct the regularization based on this smallest eigenvalue. Trading between these two cases gives our improved running time. We remark that an improvement over gradient descent is obtainable even if we use a simpler (non-accelerated) method for estimating eigenvectors, such as the power method. That said, an accelerated gradient descent subroutine for the regularized sub-problems we solve appears to be crucial to achieving faster convergence rates than gradient descent.

The remainder of the paper is structured as follows. Section 2 introduces the notation and existing results on which our approach is based. Section 3.1 introduces our method for accelerating gradient descent on “almost convex” functions, while Section 3.2 presents and explains our “negative curvature descent” subroutine. Section 4 joins the two building blocks to obtain our main result, while in Section 5, we show how our results give linear convergence to local minima for strict-saddle functions.

Notation and standard results

We assume throughout without further mention that ff is L1L_{1}-smooth, has L2L_{2}-Lipschitz continuous Hessian, and has optimality gap Δf<\Delta_{f}<\infty at the initial search point, generally denoted z1z_{1}.

The next definition is atypical, because we allow the strong convexity parameter σ1\sigma_{1} to be negative. Of course, if σ1<0\sigma_{1}<0 the function may be non-convex, but we can use σ1\sigma_{1} to bound the extent to which the function is non-convex, similar to the ideas of lower C2C^{2}-functions in variational analysis . As we show in Lemma 3.1 this “almost convexity” allows improvements in runtime over gradient descent.

The next three results are standard but useful lemmas using the definitions above.

Let ff be L1L_{1}-smooth and μ\mu-strongly convex. Then for all xx the minimizer xx^{*} of ff satisfies

Lemma 2.1 guarantees any L1L_{1}-smooth function is (L1)(-L_{1})-strongly convex. A key idea in this paper is that if a function is (σ1)(-\sigma_{1})-strongly convex with σ10\sigma_{1}\geq 0, standard convex proximal methods are still applicable, provided the regularization is sufficiently large. The following trivial lemma, stated for later reference, captures this idea.

Throughout this paper, we use a fully non-asymptotic big-O notation to be clear about the convergence rates of the algorithms we analyze and to avoid confusion involving relative values of problem-dependent constants (such as d,L1,L2d,L_{1},L_{2}).

Throughout, we take S[0,)6\mathcal{S}\subset[0,\infty)^{6} to be the set of tuples (ϵ,δ,L1,L2,Δf,d)(\epsilon,\delta,L_{1},L_{2},\Delta_{f},d); sometimes we require the tuples to meet certain assumptions that we specify. The notation O~()\widetilde{O}(\cdot) hides logarithmic factors in problem parameters: we say that M1=O~(M2)M_{1}=\widetilde{O}(M_{2}) if M1=O(M2log(1+L1+L2+Δf+d+1/δ+1/ϵ))M_{1}=O(M_{2}\log(1+L_{1}+L_{2}+\Delta_{f}+d+1/\delta+1/\epsilon)).

Because we focus on gradient-based procedures, we measure the running time of our algorithms in terms of gradient operations, each of which we assume takes a (problem-dependent) amount of time Tgrad\mathsf{T}_{\rm grad}. The following assumption specifies this more precisely.

The following operations take O(Tgrad)O(\mathsf{T}_{\rm grad}) time:

Any arithmetic operation (addition, subtraction or multiplication) of two vectors of dimension at most dd.

Based on Assumption A, we call an algorithm Hessian free if its basic operations take time at most O(Tgrad)O(\mathsf{T}_{\rm grad}).

for some small h>0h>0. By Lemma 2.2, we immediately have

which allows sufficiently precise calculation by taking hh small. We assume infinite precision arithmetic in this paper: see discussion in Section 2.2.

In a number of concrete cases, Hessians have structure that allows efficient computation of the product v2f(x)vv\mapsto\nabla^{2}f(x)v. For example, in neural networks, one may compute 2f(x)v\nabla^{2}f(x)v using a back-propagation-like technique at the cost of at most two gradient evaluations . \clubsuit

With the basic lemmas and definitions in place, we now recapitulate some of the classical development of accelerated methods. First, the following pseudo-code gives Nesterov’s classical accelerated gradient descent method for strongly convex functions .

The method Accelerated-gradient-descent enjoys the following essentially standard guarantee when initialized at any iterate z1z_{1} satisfying f(z1)infxf(x)Δff(z_{1})-\inf_{x}f(x)\leq\Delta_{f}.

Proof. Let zz^{*} be the minimizer of ff. If ϵ24L12Δf/σ1\epsilon^{2}\geq 4L_{1}^{2}\Delta_{f}/\sigma_{1}, then

where inequality (i)(i) follows from smoothness of ff (Def. 1), inequality (ii)(ii) by the strong convexity of ff (Lemma 2.4), and inequality (iii)(iii) by the definition of Δf\Delta_{f}. Thus the iteration ends at j=1j=1.

For smaller ϵ\epsilon, we let κ=L1/σ11\kappa=L_{1}/\sigma_{1}\geq 1 denote the condition number for the problem. Then Nesterov [31, Theorem 2.2.2] shows that for k>1k>1

Taking any k1+κlog4L1κΔfϵ2k\geq 1+\sqrt{\kappa}\log\frac{4L_{1}\kappa\Delta_{f}}{\epsilon^{2}} yields

Noting that f(x)22L1(f(x)f(z))\left\|{\nabla f(x)}\right\|^{2}\leq 2L_{1}(f(x)-f(z^{*})) by Lemma 2.3, we obtain our result. ∎

2 Building block 2: fast eigenvector computation

The final building block we use is accelerated approximate leading eigenvector computation. We consider two types of approximate eigenvectors. By a relative ε\varepsilon-approximate leading eigenvector of a positive semidefinite (PSD) matrix HH, we mean a vector vv such that v=1\left\|{v}\right\|=1 and vTHv(1ε)λmax(H)v^{T}Hv\geq(1-\varepsilon)\lambda_{\max}(H); similarly, an additive ε\varepsilon-approximate leading eigenvector of HH satisfies v=1\left\|{v}\right\|=1 and vTHvλmax(H)εv^{T}Hv\geq\lambda_{\max}(H)-\varepsilon. A number of methods compute such (approximate) leading eigenvectors, including the Lanczos method . For concreteness, we state one lemma here, where in the lemma we let Thess\mathsf{T}_{\rm hess} denote the larger of the times required to compute the matrix-vector product HvHv or to add two vectors.

Notably, the Lanczos method [23, Theorem 3.2] achieves this complexity guarantee. While Lemma 2.6 relies on infinite precision arithmetic (the stability of the Lanczos method is an active area of research ), shift-and-invert preconditioning also achieves the convergence guarantee to within poly-logarithmic factors in bounded precision arithmetic. This procedure reduces computing the top eigenvector of the matrix HH to solving a sequence of linear systems, and using fast gradient descent to solve the linear systems guarantees the running time in Lemma 2.6. (See Section 8 of for the reduction and Theorem 4.1 of for another statement of the result.) For simplicity—because we do not focus on such precision issues—we use Lemma 2.6 in the sequel.

For later use, we include a corollary of Lemma 2.6 in application to finding minimum eigenvectors of the Hessian 2f(x)\nabla^{2}f(x) using matrix-vector multiplies. Recalling that ff is L1L_{1}-smooth, we know that the matrix M:=L1I2f(x)M:=L_{1}I-\nabla^{2}f(x) is PSD, and its eigenvalues are {L1Iλi}i=1d[0,2L1]\{L_{1}I-\lambda_{i}\}_{i=1}^{d}\subset[0,2L_{1}], where λi\lambda_{i} is the iith eigenvalue of 2f(x)\nabla^{2}f(x). The procedure referenced in Lemma 2.6 (Lanczos or another accelerated method) applied to the matrix MM thus, with probability at least 1δ1-\delta, provides a vector v^\widehat{v} with v^=1\|{\widehat{v}}\|=1 such that

in time O(Tgradε12logdδ)O(\mathsf{T}_{\rm grad}\varepsilon^{-\frac{1}{2}}\log\frac{d}{\delta}). Rearranging this, we have

and substituting ϵ/(2L1)\epsilon/(2L_{1}) for ε\varepsilon yields the following summarizing corollary.

Two structured non-convex problems

With our preliminary results established, in this section we turn to two methods that form the core of our approach. Roughly, our overall algorithm will be to alternate between finding directions of negative curvature of ff and solving structured sub-problems that are nearly convex, meaning that the smallest eigenvalue of the Hessian has a lower bound γ-\gamma, γ>0\gamma>0, where γL1\gamma\ll L_{1}. We turn to each of these pieces in turn.

The second main component of our general accelerated method is a procedure for finding stationary points of smooth non-convex functions that are not too non-convex. By not too non-convex, we mean γ\gamma-almost convexity, as in Def. 4, that is, that

where γ0\gamma\geq 0. The next procedure applies to such almost convex functions, and builds off of a suggestion of Nesterov to use regularization coupled with accelerated gradient descent to improve convergence guarantees for finding a stationary point of ff. The idea, as per Lemma 2.4, is to add a regularizing term of the form γzz02\gamma\|{z-z_{0}}\|^{2} to make the γ\gamma-almost convex function ff become γ\gamma-strongly convex. As we describe in the sequel, we solve a sequence j=1,2,j=1,2,\ldots of such proximal sub-problems

quickly using accelerated gradient descent. Whenever γL1\gamma\ll L_{1}, the regularized model gjg_{j} of ff has better fidelity to ff than the model f(z)+L12zzj2f(z)+\frac{L_{1}}{2}\left\|{z-z_{j}}\right\|^{2} (which is essentially what gradient descent attempts to minimize), allowing us to make greater progress in finding stationary points of ff. We now present the Almost-convex-AGD procedure.

Recalling the definition Δff(z1)infxf(x)\Delta_{f}\geq f(z_{1})-\inf_{x}f(x), we have the following convergence guarantee.

Before providing the proof, we remark that the runtime guarantee (5) is an improvement over the convergence guarantees of standard gradient descent—which scale as O(TgradΔfL1ϵ2)O(\mathsf{T}_{\rm grad}\Delta_{f}L_{1}\epsilon^{-2})—whenever γL1\gamma\ll L_{1}.

Proof. Because ff is σ1-\sigma_{1}-strongly convex and γσ1\gamma\geq\sigma_{1}, Lemma 2.4 guarantees that gjg_{j} is γ\gamma-strongly convex. This strong convexity also guarantees that gjg_{j} has a unique minimizer, which we denote zjz^{*}_{j}.

Thus we have gj(zj+1)gj(zj)g_{j}(z_{j+1})\leq g_{j}(z_{j}) and

Equation (6) shows that to bound the number of iterations of the algorithm it suffices to lower bound the differences zj+1zj\|z_{j+1}-z_{j}\|. Using the condition gj(zj+1)ϵγ50(L1+2γ)110ϵ\|\nabla g_{j}(z_{j+1})\|\leq\epsilon\sqrt{\frac{\gamma}{50(L_{1}+2\gamma)}}\leq\frac{1}{10}\epsilon, we have

where the inequality is a consequence of the triangle inequality. By our termination criterion, we know that if j+1<jj+1<j_{*} then f(zj+1)ϵ\|\nabla f(z_{j+1})\|\geq\epsilon and therefore zj+1zj9ϵ20γϵγ5\|z_{j+1}-z_{j}\|\geq\frac{9\epsilon}{20\gamma}\geq\frac{\epsilon}{\gamma\sqrt{5}}. Substituting this bound into (6) yields

Note that the method calls Accelerated-gradient-descent (Line 7) with accuracy parameter ϵ=ϵγ/(50(L1+2γ))\epsilon^{\prime}=\epsilon\sqrt{\gamma/(50(L_{1}+2\gamma))}; using γL1\gamma\leq L_{1} we may apply Lemma 2.5 to bound the running time of each call by

The method Almost-convex-AGD performs at most jj^{*} iterations (Eq. (7)), and combining the preceding display with this iteration bound yields the running time (5).

All that remains is to prove the progress bound (4). By application of the triangle inequality and Jensen’s inequality, we have

Combing this bound with the earlier progress guarantee (6) yields f(z1)f(zj)γjzjz12f(z_{1})-f(z_{j^{*}})\geq\frac{\gamma}{j^{*}}\|z_{j}^{*}-z_{1}\|^{2}, and since by (7) either j1j^{*}\leq 1 or j10γϵ2[f(z1)f(zj)]j^{*}\leq 10\frac{\gamma}{\epsilon^{2}}[f(z_{1})-f(z_{j^{*}})] the result follows. ∎

2 Exploiting negative curvature

Our first sub-routine either declares the problem locally “almost convex” or finds a direction of ff that has negative curvature, meaning a direction vv such that vT2f(x)v<0v^{T}\nabla^{2}f(x)v<0. The idea to make progress on ff by moving in directions of descent on the Hessian is of course well-known, and relies on the fact that if at a point zz the function ff is “very” non-convex, i.e. λmin(2f(z))α/2\lambda_{\min}(\nabla^{2}f(z))\leq-\alpha/2 for some α>0\alpha>0, then we can reduce the objective significantly (by a constant fraction of L22α3L_{2}^{-2}\alpha^{3} at least) by taking a step in a direction of negative curvature. Conversely, if λmin(2f(z))α/2\lambda_{\min}(\nabla^{2}f(z))\geq-\alpha/2, the function ff is “almost convex” in a neighborhood of zz, suggesting that gradient-like methods on ff directly should be effective. With this in mind, we present the routine Negative-curvature-descent, which, given a function ff, initial point z1z_{1}, and a few additional tolerance parameters, returns a vector zz decreasing ff substantially by moving in Hessian-based directions.

Furthermore, each iteration requires time at most

Proof. Assume that the method has not terminated at iteration kk. Let

denote the step size used at iteration kk, so that zk+1=zkηkvkz_{k+1}=z_{k}-\eta_{k}v_{k} as in Line 5. By the L2L_{2}-Lipschitz continuity of the Hessian, we have

Noting that ηkvkTf(zk)0\eta_{k}v_{k}^{T}\nabla f(z_{k})\geq 0 by construction, we rearrange the preceding inequality to obtain

where inequality (i)(i) uses that vkT2f(zk)vk>α/2|v_{k}^{T}\nabla^{2}f(z_{k})v_{k}|>\alpha/2, as the stopping criterion has not been met. Telescoping the above equation for k=1,2,,j1k=1,2,\ldots,j-1, we conclude that at the final iteration

We turn to inequality (9). Recall the definition of δ=δ1+12L22Δf/α3\delta^{\prime}=\frac{\delta}{1+12L_{2}^{2}\Delta_{f}/\alpha^{3}}, which certainly satisfies δδ/j\delta^{\prime}\leq\delta/j if jj is the final iteration of the algorithm (as the bound (8) is deterministic). Now, at the last iteration, we have by definition of the final iterate that vjT2f(zj)vjα2v_{j}^{T}\nabla^{2}f(z_{j})v_{j}\geq-\frac{\alpha}{2}, and thus, if vjv_{j} is an additive α/2\alpha/2-approximate smallest eigenvector, we have λmin(2f(zj))vjT2f(zj)vjα\lambda_{\min}(\nabla^{2}f(z_{j}))\geq v_{j}^{T}\nabla^{2}f(z_{j})v_{j}-\alpha. Applying a union bound, the probability that the approximate eigenvector method fails to return an α/2\alpha/2-approximate eigenvector in any iteration is bounded by δjδ\delta^{\prime}j\leq\delta, giving the result.

Finally, equation (10) is immediate by Corollary 2.7. ∎

An accelerated gradient method for non-convex optimization

Now that we have provided the two subroutines Negative-curvature-descent and Almost-convex-AGD, which (respectively) find directions of negative curvature and solve nearly convex problems, we combine them carefully to provide an accelerated gradient method for smooth non-convex optimization. The idea behind our Accelerated-non-convex-method is as follows. At the beginning of each iteration kk we use Negative-curvature-descent to make progress until we reach a point x^k\widehat{x}_{k} where the function is almost convex (Def. 4) in a neighborhood of the current iterate. For a parameter α0\alpha\geq 0, we define the convex penalty

where [t]+=max{t,0}\left[{t}\right]_{+}=\max\{t,0\}. We then modify the function f(x)f(x) by adding the penalty ρα\rho_{\alpha} and defining

The function fk(x)f_{k}(x) is globally almost convex, as we show in Lemma 4.1 to come, so that the method Almost-convex-AGD applied to the function fk(x)f_{k}(x) quickly reduces the objective ff. We trade between curvature minimization and accelerated gradient using the parameter α\alpha in the definition (11) of ρ\rho, which governs acceptable levels of non-convexity. By carefully choosing α\alpha, the combined method has convergence rate O~(ϵ7/4)\widetilde{O}(\epsilon^{-7/4}), which we we prove in Theorem 4.3.

Before coming to the theorem giving a formal guarantee for Accelerated-non-convex-method, we provide two technical lemmas showing that the internal subroutines are well-behaved. The first lemma confirms that the regularization technique (11) transforms a locally almost convex function into a globally almost convex function (Def. 4), so we can efficiently apply Almost-convex-AGD to it.

Proof. It is clear that ρ=ρα\rho=\rho_{\alpha} is convex, as it is an increasing convex function of a positive argument [9, Chapter 3.2]. We claim that ρ\rho is 4L14L_{1}-smooth. Indeed, the gradient

is continuous by inspection and differentiable except at x=αL2\left\|{x}\right\|=\frac{\alpha}{L_{2}}. For x<α/L2\|x\|<\alpha/L_{2}, we have 2ρ(x)=0\nabla^{2}\rho(x)=0, and for x>α/L2\|x\|>\alpha/L_{2} we have

which satisfies 02ρ(x)4L1I0\preceq\nabla^{2}\rho(x)\preceq 4L_{1}I for all xx. As ρ(x)\nabla\rho(x) is continuous, we conclude that ρ\rho is 4L14L_{1}-smooth. The L1L_{1}-smoothness of ff then implies that the sum f(x)+ρ(xx0)f(x)+\rho(x-x_{0}) is 5L15L_{1} smooth.

To argue almost convexity of f+ρf+\rho, we show that 2f(x)+2ρ(xx0)3αI\nabla^{2}f(x)+\nabla^{2}\rho(x-x_{0})\succeq-3\alpha I almost everywhere, which is equivalent to Definition 4 when the gradient is continuous. For xx0<2α/L2\|x-x_{0}\|<2\alpha/L_{2}, we have by Lipschitz continuity of 2f\nabla^{2}f that

which implies the result because ρ\rho is convex. For xx0>2α/L2\|x-x_{0}\|>2\alpha/L_{2}, inspection of the Hessian (12) shows that 2ρ(xx0)L1I\nabla^{2}\rho(x-x_{0})\succeq L_{1}I. Since 2f(x)L1I\nabla^{2}f(x)\succeq-L_{1}I almost everywhere by the L1L_{1}-smoothness of ff, we conclude that 2f(x)+2ρ(xx0)0\nabla^{2}f(x)+\nabla^{2}\rho(x-x_{0})\succeq 0 whenever 2f(x)\nabla^{2}f(x) exists. ∎

The next lemma provides a high probability guarantee on the correctness and number of iterations of Accelerated-non-convex-method. (There is randomness in the eigenvector computation subroutine invoked within Negative-curvature-descent.) As always, we let Δff(x1)infxf(x)\Delta_{f}\geq f(x_{1})-\inf_{x}f(x).

Let ff be L1L_{1}-smooth with L2L_{2}-Lipschitz continuous Hessian, ϵ>0\epsilon>0, δ(0,1)\delta\in(0,1), and α[0,L1]\alpha\in[0,L_{1}]. Then with probability at least 1δ1-\delta, the method Accelerated-non-convex-method(x1x_{1}, ff, ϵ\epsilon, L1L_{1}, L2L_{2}, α\alpha, Δf\Delta_{f}, δ\delta) terminates after tt iterations with f(x^t)ϵ\|\nabla f(\widehat{x}_{t})\|\leq\epsilon, where tt satisfies

Further, λmin(2f(x^k))2α\lambda_{\min}(\nabla^{2}f(\widehat{x}_{k}))\geq-2\alpha for all ktk\leq t.

Proof. Before beginning the proof proper, we provide a quick bound on the size of the difference between iterates x^k\widehat{x}_{k} and x^k1\widehat{x}_{k-1}, which will imply progress in function values across iterations of Alg. 1. In each iteration that the convergence criterion f(x^k)ϵ\left\|{\nabla f(\widehat{x}_{k})}\right\|\leq\epsilon is not met—that is, whenever f(x^k)>ϵ\left\|{\nabla f(\widehat{x}_{k})}\right\|>\epsilon—we have that

In inequality (i)(i) we used the triangle inequality and definition of fk1=f+ρ(xk1)f_{k-1}=f+\rho(\cdot-x_{k-1}) and inequality (ii)(ii) used that the call to Almost-convex-AGD returns x^k\widehat{x}_{k} with fk1(xk)ϵ/2\left\|{\nabla f_{k-1}(x_{k})}\right\|\leq\epsilon/2. Rearranging yields

where the equality is implied because ϵ>0\epsilon>0.

Now we consider two cases, the first the simpler case that α=L1\alpha=L_{1} is large enough that we never search for negative curvature, and the second that α<L1\alpha<L_{1} so that we find directions of negative curvature in the method.

In this case, we have that α=L1\alpha=L_{1}, so that xk=x^kx_{k}=\widehat{x}_{k} for all iterations kk (Line 7 of the algorithm). Assume that at iteration kk that the algorithm has not terminated, so f(x^k)ϵ\|\nabla f(\widehat{x}_{k})\|\geq\epsilon. Then inequality (14) gives ϵ4L1<x^kx^k1\frac{\epsilon}{4L_{1}}<\|\widehat{x}_{k}-\widehat{x}_{k-1}\|. By Lemma 4.1 we know that fkf_{k} is 3L13L_{1}-almost convex (Def. 4) and 5L15L_{1}-smooth; therefore we may apply Lemma 3.1 with γ=3α=3L1\gamma=3\alpha=3L_{1} to lower bound the progress of the call to Almost-convex-AGD in Line 13 of Alg. 1 to obtain

Telescoping this display, we have for any iteration ss at which the algorithm has not terminated that

which yields the second case of the bound (13). The inequality 2f(x^j)2αI\nabla^{2}f(\widehat{x}_{j})\succeq-2\alpha I holds trivially because ff is L1L_{1}-smooth.

Case 2: small α𝛼\alpha

In this case, we assume that α<L1\alpha<L_{1}. Let K=1+Δf(12L22α3+10L2αϵ)K=\lceil 1+\Delta_{f}(\frac{12L_{2}^{2}}{\alpha^{3}}+\frac{\sqrt{10}L_{2}}{\alpha\epsilon})\rceil and δ=δK\delta^{\prime\prime}=\frac{\delta}{K} as in line 2 of Alg. 1. By Lemma 3.2 and a union bound, with probability at least 1δ1-\delta, for all kKk\leq K the matrix inequality 2f(x^k)2αI\nabla^{2}f(\widehat{x}_{k})\succeq-2\alpha I holds, so that we perform our subsequent analysis (for kKk\leq K) conditional on this event without appealing to any randomness.

Equation (14) implies that at iteration 1<kK1<k\leq K exactly one of following three cases is true:

The termination criterion f(x^k)ϵ\|\nabla f(\widehat{x}_{k})\|\leq\epsilon holds and Alg. 1 terminates.

Negative-curvature-descent (Line 5) constructs x^kxk\widehat{x}_{k}\neq x_{k}, and (i) fails.

Neither (i) nor (ii) holds, and x^kx^k1α/L2\|\widehat{x}_{k}-\widehat{x}_{k-1}\|\geq\alpha/L_{2}.

We claim that in case (ii) or (iii), we have

Deferring the proof of claim (16), we note that it immediately gives a quick proof of the result. Assume, in order to obtain a contradiction that after KK iterations the algorithm has not terminated it follows that:

Substituting for K=1+Δf(12L22α3+10L2αϵ)K=\lceil 1+\Delta_{f}(\frac{12L_{2}^{2}}{\alpha^{3}}+\frac{\sqrt{10}L_{2}}{\alpha\epsilon})\rceil as in line 2 yields a contradiction and therefore the algorithm terminates after at most KK iterations which is the first case of the bound (13).

Let us now prove the claim (16). First, assume case (ii). Then Negative-curvature-descent requires at least two iterations, so Lemma 3.2 implies

Combining this with the fact that f(xk)f(x^k1)f(x_{k})\leq f(\widehat{x}_{k-1}) by the progress bound (4) in Lemma 3.1 (Almost-convex-AGD decreases function values), we have

Let us now consider the case that (iii) holds. By Lemma 4.1 we know that the constructed function fkf_{k} is 3α3\alpha-almost convex (Def. 4) and 5L15L_{1}-smooth, therefore we may apply Lemma 3.1 with γ=3α\gamma=3\alpha to lower bound the progress of the entire inner loop of Alg. 1 by

2 Main result

With Lemmas 4.1 and 4.2 in hand, we may finally present our main result.

and δ(0,1)\delta\in(0,1). Then with probability at least 1δ1-\delta, Accelerated-non-convex-method(x1x_{1}, ff, ϵ\epsilon, L1L_{1}, L2L_{2}, α\alpha, Δf\Delta_{f}, δ\delta) returns a point xx that satisfies

where τ=1+1/ϵ+1/δ+d+L1+L2+Δf\tau=1+1/\epsilon+1/\delta+d+L_{1}+L_{2}+\Delta_{f}.

Proof. We split this proof into two cases: (I) small alpha when α<L1\alpha<L_{1} and hence α=ϵ2Δf\alpha=\frac{\epsilon^{2}}{\Delta_{f}} or L2ϵ\sqrt{L_{2}\epsilon}, this is the non-trivial case requiring solution to a reasonably small accuracy; and (II) when α=L1\alpha=L_{1}, when the algorithm is roughly equivalent to gradient descent (and ϵ\epsilon is large enough that we do not require substantial accuracy).

We proceed in two steps. First, we bound the number of eigenvector calculations that Negative-curvature-descent performs by providing a progress guarantee for each of them using Lemma 3.2 and arguing that making too much progress is impossible. After this, we perform a similar calculation for the total number of gradient calculations throughout calls to Almost-convex-AGD, this time applying Lemma 3.1.

We begin by bounding the number of eigenvector calculations. When α<L1\alpha<L_{1}, its definition implies ϵmin{L12/L2,Δf1/2L11/2}\epsilon\leq\min\{L_{1}^{2}/L_{2},\Delta_{f}^{1/2}L_{1}^{1/2}\}. Let jkj_{k}^{*} be the total number of times the method Negative-curvature-descent invokes the eigenvector computation subroutine (Line 4 of Negative-curvature-descent) during iteration kk of the method Accelerated-non-convex-method, let kk^{*} denote the total number of iterations of Accelerated-non-convex-method, and define q:=k=1kjkq:=\sum_{k=1}^{k^{*}}j_{k}^{*} as the total number of eigenvector computations. Then by telescoping the bound (8) in Lemma 3.2 and using that f(xk)f(x^k1)f(x_{k})\leq f(\widehat{x}_{k-1}) by the progress bound (4) in Lemma 3.1 (Almost-convex-AGD decreases function values), we have

Substituting the bound on kk^{*} that Lemma 4.2 supplies, we see that with probability at least 1δ1-\delta,

where inequality (i)(i) follows by our construction that αϵ1/2L21/2\alpha\geq\epsilon^{1/2}L_{2}^{1/2}. Inequality (17) thus provides a bound on the total number of fast eigenvector calculations we require.

We use the bound (17) to bound the total cost of calls to Negative-curvature-descent. The tolerated failure probability δ\delta^{\prime\prime} defined in line 2 satisfies

so that log1δ=O(logτ)\log\frac{1}{\delta^{\prime\prime}}=O(\log\tau). By Lemma 3.2, Eq. (10), the cost of each iteration during Negative-curvature-descent is, using max{ϵ2Δf1,ϵL2}=α<L1\max\{\epsilon^{2}\Delta_{f}^{-1},\sqrt{\epsilon L_{2}}\}=\alpha<L_{1}, at most

Multiplying this time complexity by qq as bounded in expression (17) gives that the total cost of the calls to Negative-curvature-descent is

We now compute the total cost of calling Almost-convex-AGD. Using the time bound (5) of Lemma 3.1, the cost of calling Almost-convex-AGD in iteration kk with almost convexity parameter γ=3α\gamma=3\alpha is bounded by the sum of

We separately bound the total computational cost of each of the terms (19).

Using the bound k1+Δf16L21/2ϵ3/2k^{*}\leq 1+\Delta_{f}\frac{16L_{2}^{1/2}}{\epsilon^{3/2}} as in expression (17) for the total number of iterations of Alg. 1, we see that the first of the time bounds (19) yields identical total cost to the eigenvector computations (18), because γ12=O(α12)=O(1/ϵL2)\gamma^{-\frac{1}{2}}=O(\alpha^{-\frac{1}{2}})=O(1/\sqrt{\epsilon L_{2}}). Thus we consider the second term in expression (19). Using the fact that [fk(xk)fk(xk+1)][f(xk)f(xk+1)][f_{k}(x_{k})-f_{k}(x_{k+1})]\leq[f(x_{k})-f(x_{k+1})] by definition of xk+1x_{k+1} and the method Almost-convex-AGD, we telescope to find

Noting that by assumption that the almost convexity parameter γ=3α\gamma=3\alpha, we have γ/3=αL21/4ϵ1/4+ϵΔf1/2\sqrt{\gamma/3}=\sqrt{\alpha}\leq L_{2}^{1/4}\epsilon^{1/4}+\epsilon\Delta_{f}^{-1/2}, telescoping the second term of the bound (19) on the cost of Almost-convex-AGD immediately gives the total computational cost bound

over all calls of Almost-convex-AGD. This is evidently our desired result that the total computational cost when α<L1\alpha<L_{1} is (18).

Case II: Large α𝛼\alpha

When α=L1\alpha=L_{1}, the algorithm becomes roughly equivalent to gradient descent, because Negative-curvature-descent is not required, so that we need only bound the total computational cost of calls to Almost-convex-AGD. The bound (19) on the computational effort of each such call again applies, and noting that L1=α=3γL_{1}=\alpha=3\gamma in this case, we replace the bounds (19) with the two terms

As in Case I, we may telescope the second time bound to obtain total computational effort O(Tgrad(1+ΔfL1ϵ2)logτ)O(\mathsf{T}_{\rm grad}(1+\Delta_{f}\frac{L_{1}}{\epsilon^{2}})\log\tau), while applying the iteration bound (13) of Lemma 4.2 to the first term similarly yields the bound O(Tgrad(1+ΔfL1ϵ2)logτ)O(\mathsf{T}_{\rm grad}(1+\Delta_{f}\frac{L_{1}}{\epsilon^{2}})\log\tau) on the total computational cost. To conclude the proof we observe that α=L1\alpha=L_{1} implies L1max{(ϵL2)1/2,ϵ2/Δf}L_{1}\leq\max\{(\epsilon L_{2})^{1/2},\epsilon^{2}/\Delta_{f}\} and that L1max{(ϵL2)1/4L11/2,ϵ2/Δf}L_{1}\leq\max\{(\epsilon L_{2})^{1/4}L_{1}^{1/2},\epsilon^{2}/\Delta_{f}\}. Therefore,

We provide a bit of discussion to help explicate this result. Much of the complication in the statement of Theorem 4.3 is a consequence of our desire for generality in possible parameter values. In common settings in which points reasonably close to stationarity are desired—when the accuracy ϵ\epsilon is small enough—we may simplify the theorem substantially, as the following corollary demonstrates.

Let the conditions of Theorem 4.3 hold, and in addition assume that ϵΔf2/L2\epsilon\leq\sqrt{\Delta_{f}^{2}/L_{2}}. Then the total computational cost of Alg. 1 is at most

Conversely, it appears that accelerated gradient descent is more central to our approach. Indeed, the term involving γL1\sqrt{\gamma L_{1}} in the bound (13) is important, as it allows us to carefully trade “almost” convexity γ\gamma with accuracy ϵ\epsilon to achieve fast rates of convergence. Replacing the accelerated gradient descent method with gradient descent in Almost-convex-AGD eliminates the possibility for such optimal trading. Of course, our procedure would still produce output with the second order guarantee 2f(x)2ϵ1/2L21/2Id×d\nabla^{2}f(x)\succeq-2\epsilon^{1/2}L_{2}^{1/2}I_{d\times d}.

Accelerated (linear) convergence to local minimizers of strict-saddle functions

In this section, we show how to apply Accelerated-non-convex-method and Theorem 4.3 to find local minimizers for generic non-pathological smooth optimization problems with linear rates of convergence. Of course, it is in general NP-hard to even check if a point is a local minimizer of a smooth nonconvex optimization problem , so we require a few additional assumptions in this case. In general, second-order stationary points need not be local minima; consequently, we consider strict-saddle functions, which are functions such that all eigenvalues of the Hessian are non-zero at all critical points, so that second-order stationary points are indeed local minima. Such structural assumptions have been important in recent work on first-order methods for general smooth minimization , and in a sense “random” functions generally satisfy these conditions (cf. the discussion of Morse functions in the book ). To make our discussion formal, consider the following quantitative definition.

Some definitions of strict-saddle include a radius RR bounding the distance between any point xx satisfying f(x)ε\left\|{\nabla f(x)}\right\|\leq\varepsilon and λmin(2f(x))σ+\lambda_{\min}(\nabla^{2}f(x))\geq\sigma_{+} and a local minimizer x+x^{+} of ff, and they assume that ff is σ+\sigma_{+}-strongly convex in a ball of radius 2R2R around any local minimizer. Our assumption on the Lipschitz continuity of 2f\nabla^{2}f obviates the need for such conditions, allowing the following simplified definition.

With this definition in mind, we present Algorithm 2, which leverages Algorithm 1 to obtain linear convergence (in the desired accuracy ϵ\epsilon) to a local minimizer of strict-saddle functions. The algorithm proceeds in two phases, first finding a region of strong convexity, and in the second phase solving a regularized version of resulting locally convex problem in this region. That the first phase of Alg. 2 terminates in a neighborhood of a local optimum of ff, where ff is convex in this neighborhood, is an immediate consequence of the strict-saddle property coupled with the gradient and Hessian bounds of Theorem 4.3. We can then apply (accelerated) gradient descent to quickly find the local optimum, which we describe rigorously in the following theorem.

where τ=1+L1/σ1+1/δ+d+L2+Δf\tau^{\prime}=1+L_{1}/\sigma_{1}+1/\delta+d+L_{2}+\Delta_{f}. When ϵσ1216L2\epsilon\leq\frac{\sigma_{1}^{2}}{16L_{2}}, with the same probability there exists a local minimizer x+x^{\star}_{+} of ff such that

Proof. The result in the low accuracy regime in which ϵ>σ1216L2\epsilon>\frac{\sigma_{1}^{2}}{16L_{2}} is immediate by Theorem 4.3, and we therefore focus on the case that ϵσ1216L2\epsilon\leq\frac{\sigma_{1}^{2}}{16L_{2}}. We perform our analysis conditional on the event, which holds with probability at least 1δ1-\delta, that the guarantees of Theorem 4.3 hold. That is, that x+x_{+} generated in Line 5 satisfies

where τ\tau^{\prime} is as in the theorem statement. Equality (i)(i) is a consequence of the inequalities 1L1/σ11\leq\sqrt{L_{1}/\sigma_{1}} and 1+a+a2=O(1+a2)1+a+a^{2}=O(1+a^{2}) for a0a\geq 0.

In conjunction with Definition 7, the bounds (21) imply that 2f(x+)σ1I\nabla^{2}f(x_{+})\succeq\sigma_{1}I. Recalling Lemma 4.1, and the bound (12) from its proof, a trivial calculation involving the Lipschitz continuity of 2f\nabla^{2}f shows that f+(x)=f(x)+L1[xx+σ1/4L2]+2f_{+}(x)=f(x)+L_{1}\left[{\left\|{x-x_{+}}\right\|-\sigma_{1}/4L_{2}}\right]_{+}^{2} is σ1/2\sigma_{1}/2-strongly convex. Additionally, we have immediately that f+f_{+} is 5L15L_{1}-smooth.

Let x+x^{\star}_{+} be the unique global minimizer of f+f_{+}. By the strong convexity of f+f_{+}, we may bound the distance between x+x_{+} and x+x^{\star}_{+} (recall Lemma 2.3) by

where final inequality is immediate from the gradient bound (21). By construction, f+=ff_{+}=f on the ball {x:xx+σ1/4L2}\{x:\left\|{x-x_{+}}\right\|\leq\sigma_{1}/4L_{2}\}, and as x+x^{\star}_{+} belongs to the interior of this ball, it is a local minimizer of ff.

Let xx be the point produced by the call to Accelerated-gradient-descent. By Lemma 2.5, xx satisfies f+(x)ϵ\left\|{\nabla f_{+}(x)}\right\|\leq\epsilon and is computed in time

The strong convexity of f+f_{+} once more (Lemma 2.3) implies

which gives the distance bound in expression (20). Combining xx+σ18L2\left\|{x-x^{\star}_{+}}\right\|\leq\frac{\sigma_{1}}{8L_{2}} and x+x+σ18L2\left\|{x_{+}-x^{\star}_{+}}\right\|\leq\frac{\sigma_{1}}{8L_{2}}, we have that xx+σ14L2\left\|{x-x_{+}}\right\|\leq\frac{\sigma_{1}}{4L_{2}}, and therefore f(x)=f+(x)f(x)=f_{+}(x) and f(x)=f+(x)ϵ\left\|{\nabla f(x)}\right\|=\left\|{\nabla f_{+}(x)}\right\|\leq\epsilon. The functional bound (20) then follows from the L1L_{1}-smoothness of ff and that f(x+)=0\nabla f(x^{\star}_{+})=0, as

The running time guarantee follows by summing T1T_{1} and T2T_{2} above. ∎

Acknowledgment

OH was supported by the PACCAR INC fellowship. YC and JCD were partially supported by the SAIL-Toyota Center for AI Research. YC was partially supported by the Stanford Graduate Fellowship and the Numerical Technologies Fellowship. JCD was partially supported by the National Science Foundation award NSF-CAREER-1553086

References