"Convex Until Proven Guilty": Dimension-Free Acceleration of Gradient Descent on Non-Convex Functions

Yair Carmon, Oliver Hinder, John C. Duchi, Aaron Sidford

Introduction

Nesterov’s seminal 1983 accelerated gradient method has inspired substantial development of first-order methods for large-scale convex optimization. In recent years, machine learning and statistics have seen a shift toward large scale non-convex problems, including methods for matrix completion Koren et al. (2009), phase retrieval Candès et al. (2015); Wang et al. (2016), dictionary learning Mairal et al. (2008), and neural network training LeCun et al. (2015). In practice, techniques from accelerated gradient methods—namely, momentum—can have substantial benefits for stochastic gradient methods, for example, in training neural networks Rumelhart et al. (1986); Kingma and Ba (2015). Yet little of the rich theory of acceleration for convex optimization is known to transfer into non-convex optimization.

Optimization becomes more difficult without convexity, as gradients no longer provide global information about the function. Even determining if a stationary point is a local minimum is (generally) NP-hard Murty and Kabadi (1987); Nesterov (2000). It is, however, possible to leverage non-convexity to improve objectives in smooth optimization: moving in directions of negative curvature can guarantee function value reduction. We explore the interplay between negative curvature, smoothness, and acceleration techniques, showing how an understanding of the three simultaneously yields a method that provably accelerates convergence of gradient descent for a broad class of non-convex functions.

We consider the unconstrained minimization problem

Approximating the global minimum of ff to ϵ\epsilon-accuracy is generally intractable, requiring time exponential in dlog1ϵd\log\frac{1}{\epsilon} (Nemirovski and Yudin, 1983, §1.6). Instead, we seek a point xx that is ϵ\epsilon-approximately stationary, that is,

Finding stationary points is a canonical problem in nonlinear optimization Nocedal and Wright (2006), and while saddle points and local maxima are stationary, excepting pathological cases, descent methods that converge to a stationary point converge to a local minimum (Lee et al., 2016; Nemirovski, 1999, §3.2.2).

If we assume ff is convex, gradient descent satisfies the bound (2) after O(ϵ1)O(\epsilon^{-1}) gradient evaluations, and AGD improves this rate to O(ϵ1/2log1ϵ)O(\epsilon^{-1/2}\log\frac{1}{\epsilon}) Nesterov (2012). Without convexity, gradient descent is significantly worse, having worst-case complexity Θ(ϵ2)\Theta(\epsilon^{-2}) Cartis et al. (2010). More sophisticated gradient-based methods, including nonlinear conjugate gradient Hager and Zhang (2006) and L-BFGS Liu and Nocedal (1989) provide excellent practical performance, but their global convergence guarantees are no better than O(ϵ2)O(\epsilon^{-2}). Our work Carmon et al. (2016) and, independently, Agarwal et al. (2016), break this O(ϵ2)O(\epsilon^{-2}) barrier, obtaining the rate O(ϵ7/4logdϵ)O(\epsilon^{-7/4}\log\frac{d}{\epsilon}). Before we discuss this line of work in Section 1.3, we overview our contributions.

2 Our contributions

Underpinning our results is the observation that when we run Nesterov’s accelerated gradient descent (AGD) on any smooth function ff, one of two outcomes must follow:

AGD behaves as though ff was σ\sigma-strongly convex, satisfying inequality (2) in O(σ1/2log1ϵ)O(\sigma^{-1/2}\log\frac{1}{\epsilon}) iterations.

There exist points u,vu,v in the AGD trajectory that prove ff is “guilty” of not being σ\sigma-strongly convex,

The intuition behind these observations is that if inequality (3) never holds during the iterations of AGD, then ff “looks” strongly convex, and the convergence (a) follows. In Section 2 we make this observation precise, presenting an algorithm to monitor AGD and quickly find the witness pair u,vu,v satisfying (3) whenever AGD progresses more slowly than it does on strongly convex functions. We believe there is potential to apply this strategy beyond AGD, extending additional convex gradient methods to non-convex settings.

In Section 3 we propose a method that iteratively applies our monitored AGD algorithm to ff augmented by a proximal regularizer. We show that both outcomes (a) and (b) above imply progress minimizing ff, where in case (b) we make explicit use of the negative curvature that AGD exposes. These progress guarantees translate to an overall first-order oracle complexity of O(ϵ7/4log1ϵ)O(\epsilon^{-7/4}\log\frac{1}{\epsilon}), a strict improvement over the O(ϵ2)O(\epsilon^{-2}) rate of gradient descent. In Section 5 we report preliminary experimental results, showing a basic implementation of our method outperforms gradient descent but not nonlinear conjugate gradient.

As we show in Section 4, assuming Lipschitz continuous third derivatives instead of Lipschitz continuous Hessian allows us to increase the step size we take when exploiting negative curvature, making more function progress. Consequently, the complexity of our method improves to O(ϵ5/3log1ϵ)O(\epsilon^{-5/3}\log\frac{1}{\epsilon}). While the analysis of the third-order setting is more complex, the method remains essentially unchanged. In particular, we still use only first-order information, never computing higher-order derivatives.

3 Related work

Nesterov and Polyak (2006) show that cubic regularization of Newton’s method finds a point that satisfies the stationarity condition (2) in O(ϵ3/2)O(\epsilon^{-3/2}) evaluations of the Hessian. Given sufficiently accurate arithmetic operations, a Lipschitz continuous Hessian is approximable to arbitrary precision using finite gradient differences, and obtaining a full Hessian requires O(d)O(d) gradient evaluations. A direct implementation of the Nesterov-Polyak method with a first-order oracle therefore has gradient evaluation complexity O(ϵ3/2d)O(\epsilon^{-3/2}d), improving on gradient descent only if dϵ1/2d\ll\epsilon^{-1/2}, which may fail in high-dimensions.

In two recent papers, we (Carmon et al., 2016) and (independently) Agarwal et al. obtain better rates for first-order methods. Agarwal et al. (2016) propose a careful implementation of the Nesterov-Polyak method, using accelerated methods for fast approximate matrix inversion. In our earlier work, we employ a combination of (regularized) accelerated gradient descent and the Lanczos method. Both find a point that satisfies the bound (2) with probability at least 1δ1-\delta using O(ϵ7/4logdδϵ)O\left(\epsilon^{-7/4}\log\frac{d}{\delta\epsilon}\right) gradient and Hessian-vector product evaluations.

The primary conceptual difference between our approach and those of Carmon et al. and Agarwal et al. is that we perform no eigenvector search: we automatically find directions of negative curvature whenever AGD proves ff “guilty” of non-convexity. Qualitatively, this shows that explicit second orders information is unnecessary to improve upon gradient descent for stationary point computation. Quantitatively, this leads to the following improvements:

Our result is dimension-free and deterministic, with complexity independent of the ratio d/δd/\delta, compared to the logdδ\log\frac{d}{\delta} dependence of previous works. This is significant, as logdδ\log\frac{d}{\delta} may be comparable to ϵ1/4/log1ϵ\epsilon^{-1/4}/\log\frac{1}{\epsilon}, making it unclear whether the previous guarantees are actually better than those of gradient descent.

Our method uses only gradient evaluations, and does not require Hessian-vector products. In practice, Hessian-vector products may be difficult to implement and more expensive to compute than gradients.

Under third-order smoothness assumptions we improve our method to achieve O(ϵ5/3log1ϵ)O(\epsilon^{-5/3}\log\frac{1}{\epsilon}) rate. It is unclear how to extend previous approaches to obtain similar guarantees.

In distinction from the methods of Carmon et al. (2016) and Agarwal et al. (2016), our method provides no guarantees on positive definiteness of 2f(x)\nabla^{2}f(x); if initialized at a saddle point it will terminate immediately. However, as we further explain in Section C, we may combine our method with a fast eigenvector search to recover the approximate positive definiteness guarantee 2f(x)ϵI\nabla^{2}f(x)\succeq-\sqrt{\epsilon}I, even improving it to 2f(x)ϵ2/3I\nabla^{2}f(x)\succeq-{\epsilon}^{2/3}I using third-order smoothness, but at the cost of reintroducing randomization, Hessian-vector products and a logdδ\log\frac{d}{\delta} complexity term.

4 Preliminaries and notation

Algorithm components

We begin our development by presenting the two building blocks of our result: a monitored variation of AGD (Section 2.1) and a negative curvature descent step (Section 2.2) that we use when the monitored version of AGD certifies non-convexity. In Section 3, we combine these components to obtain an accelerated method for non-convex functions.

The main component in our approach is Alg. 1, AGD-until-proven-guilty. We take as input an LL-smooth function ff, conjectured to be σ\sigma-strongly convex, and optimize it with Nesterov’s accelerated gradient descent method for strongly convex functions (lines 3 and 4). At every iteration, the method invokes Certify-progress to test whether the optimization is progressing as it should for strongly convex functions, and in particular that the gradient norm is decreasing exponentially quickly (line 5). If the test fails, Find-witness-pair produces points u,vu,v proving that ff violates σ\sigma-strong convexity. Otherwise, we proceed until we find a point yy such that f(y)ε\left\|{\nabla f(y)}\right\|\leq\varepsilon.

The efficacy of our method is based on the following guarantee on the performance of AGD.

where κ=Lσ\kappa=\frac{L}{\sigma} and ψ(w)=f(y0)f(w)+σ2wy02\psi(w)=f(y_{0})-f(w)+\frac{\sigma}{2}\left\|{w-y_{0}}\right\|^{2}.

Proposition 1 is essentially a restatement of established results (Nesterov, 2004; Bubeck, 2014), where we take care to phrase the requirements on ff in terms of local inequalities, rather than a global strong convexity assumption. For completeness, we provide a proof of Proposition 1 in Section A.1.

With Proposition 1 in hand, we summarize the guarantees of Alg. 1 as follows.

where v=xjv=x_{j} for some 0j<t0\leq j<t and u=yju=y_{j} or u=wtu=w_{t} (defined on line 5 of AGD-until-proven-guilty). Moreover,

The bound (7) is clear for t=1t=1. For t>1t>1, the algorithm has not terminated at iteration t1t-1, and so we know that neither the condition in line 9 of AGD-until-proven-guilty nor the condition in line 5 of Certify-progress held at iteration t1t-1. Thus

which gives the bound (7) when rearranged.

since ψ(wt)=ψ(y0)=0\psi(w_{t})=\psi(y_{0})=0. Since this contradicts the progress bound (6), we obtain the certificate (8) by the contrapositive of Proposition 1: condition (5) must not hold for some 0s<t0\leq s<t, implying Find-witness-pair will return for some jsj\leq s.

Similarly, if wt=zt=yt1Lf(yt)w_{t}=z_{t}=y_{t}-\frac{1}{L}\nabla f(y_{t}) then by line 5 of Certify-progress we must have

Since ff is LL-smooth we have the standard progress guarantee (c.f. Nesterov (2004) §1.2.3) f(yt)f(zt)12Lf(yt)2f(y_{t})-f(z_{t})\geq\frac{1}{2L}\left\|{\nabla f(y_{t})}\right\|^{2}, again contradicting inequality (6).

To see that the bound (9) holds, note that f(ys)f(y0)f(y_{s})\leq f(y_{0}) for s=0,,t1s=0,\ldots,t-1 since condition 2 of Certify-progress did not hold. If u=yju=y_{j} for some 0j<t0\leq j<t then f(u)f(y0)f(u)\leq f(y_{0}) holds trivially. Alternatively, if ut=wt=ztu_{t}=w_{t}=z_{t} then condition 2 did not hold at time tt as well, so we have f(yt)f(y0)f(y_{t})\leq f(y_{0}) and also f(u)=f(zt)f(yt)12Lf(yt)2f(u)=f(z_{t})\leq f(y_{t})-\frac{1}{2L}\left\|{\nabla f(y_{t})}\right\|^{2} as noted above; therefore f(zt)f(y0)f(z_{t})\leq f(y_{0}). ∎

Before continuing, we make two remarks about implementation of Alg. 1.

As stated, the algorithm requires evaluation of two function gradients per iteration (at xtx_{t} and yty_{t}). Corollary 1 holds essentially unchanged if we execute line 9 of AGD-until-proven-guilty and lines 3-5 of Certify-progress only once every τ\tau iterations, where τ\tau is some fixed number (say 10). This reduces the number of gradient evaluations to 1+1τ1+\frac{1}{\tau} per iteration.

Direct implementation would require O(dt)O(d\cdot t) memory to store the sequences y0ty_{0}^{t}, x0tx_{0}^{t} and f(x0t)\nabla f(x_{0}^{t}) for later use by Find-witness-pair. Alternatively, Find-witness-pair can regenerate these sequences from their recursive definition while iterating over jj, reducing the memory requirement to O(d)O(d) and increasing the number of gradient and function evaluations by at most a factor of 2.

In addition, while our emphasis is on applying AGD-until-proven-guilty to non-convex problems, the algorithm has implications for convex optimization. For example, we rarely know the strong convexity parameter σ\sigma of a given function ff; to remedy this, O’Donoghue and Candès (2015) propose adaptive restart schemes. Instead, one may repeatedly apply AGD-until-proven-guilty and use the witnesses to update σ\sigma.

2 Using negative curvature

The second component of our approach is exploitation of negative curvature to decrease function values; in Section 3 we use AGD-until-proven-guilty to generate u,vu,v such that

a nontrivial violation of convexity (where α>0\alpha>0 is a parameter we control using a proximal term). By taking an appropriately sized step from uu in the direction ±(uv)\pm(u-v), Alg. 2 can substantially lower the function value near uu whenever the convexity violation (10) holds. The following basic lemma shows this essential progress guarantee.

We proceed in two parts; in the first part, we show that ff has negative curvature of at least α/2\alpha/2 in the direction δ=(uv)/uv\delta=(u-v)/\left\|{u-v}\right\| at the point uu. We then show that this negative curvature guarantees a gradient step with magnitude η\eta produces the required progress in function value.

Defining δ=(uv)/uv\delta=(u-v)/\left\|{u-v}\right\|, we obtain

where inequality (i)(i) follows by assumption (10). Let τ=infθ[0,uv]δT2f(v+θδ)δ\tau^{*}=\inf_{\theta\in[0,\left\|{u-v}\right\|]}\delta^{T}\nabla^{2}f(v+\theta\delta)\delta. Using that 0tτdτ=t2/2\int_{0}^{t}\tau d\tau=t^{2}/2, we substitute for the integrand to find that τ<α\tau^{*}<-\alpha, and using the Lipschitz continuity of 2f\nabla^{2}f yields

where we have used that uvα2L2\left\|{u-v}\right\|\leq\frac{\alpha}{2L_{2}} by assumption.

Using again the Lipschitz continuity of 2f\nabla^{2}f, we apply (4) with x0=u,θ0=0x_{0}=u,\theta_{0}=0 and θ=±η\theta=\pm\eta to obtain

The first order term must be negative for one of u+u_{+} and uu_{-}. Therefore, using inequality (12) and ηα/L2\eta\leq\alpha/L_{2}, we have f(z)=min{f(u+),f(u)}f(u)αη212f(z)=\min\{f(u_{+}),f(u_{-})\}\leq f(u)-\frac{\alpha\eta^{2}}{12} as desired. ∎

Accelerating non-convex optimization

We now combine the accelerated convergence guarantee of Corollary 1 and the non-convex progress guarantee of Lemma 1 to form Guarded-non-convex-AGD. The idea for the algorithm is as follows. Consider iterate k1k-1, denoted pk1p_{k-1}. We create a proximal function f^\hat{f} by adding the proximal term αxpk12\alpha\left\|{x-p_{k-1}}\right\|^{2} to ff. Applying AGD-until-proven-guilty to f^\hat{f} yields the sequences x0,,xtx_{0},\ldots,x_{t}, y0,,yty_{0},\ldots,y_{t} and possibly a non-convexity witnessing pair u,vu,v (line 3). If u,vu,v are not available, we set pk=ytp_{k}=y_{t} and continue to the next iteration. Otherwise, by Corollary 1, uu and vv certify that f^\hat{f} is not α\alpha strongly convex, and therefore that ff has negative curvature. Exploit-NC-pair then leverages this negative curvature, obtaining a point b(2)b^{(2)}. The next iterate pkp_{k} is the best out of y0,,yt,uy_{0},\ldots,y_{t},u and b(2)b^{(2)} in terms of function value.

The following central lemma provides a progress guarantee for each of the iterations of Alg. 3.

where the inequality is Eq. (14). Therefore, we conclude that inequality (10) must hold. To apply Lemma 1, we must control the distance between uu and vv. We present the following lemma, which shows how b(1)b^{(1)} acts as an insurance policy against uv\left\|{u-v}\right\| growing too large.

Deferring the proof of Lemma 3 briefly, we show how it yields Lemma 2. We set τ:=α8L2\tau:=\frac{\alpha}{8L_{2}} and consider two cases. First, if

then we are done, since f(pk)f(b(1))f(p_{k})\leq f(b^{(1)}). In the converse case, if f(b(1))f(y0)ατ2f(b^{(1)})\geq f(y_{0})-\alpha\tau^{2}, Lemma 3 implies uv4τα2L2\left\|{u-v}\right\|\leq 4\tau\leq\frac{\alpha}{2L_{2}}. Therefore, we can exploit the negative curvature in ff, as Lemma 1 guarantees (with η=αL2\eta=\frac{\alpha}{L_{2}}),

Proof of Lemma 3. We begin by noting that f^(yi)f^(y0)=f(y0)\hat{f}(y_{i})\leq\hat{f}(y_{0})=f(y_{0}) for i=1,,t1i=1,\ldots,t-1 by Corollary 1. Using f(yi)f(b(1))f(y0)ατ2f(y_{i})\geq f(b^{(1)})\geq f(y_{0})-\alpha\tau^{2} we therefore have

which implies yiy0τ\left\|{y_{i}-y_{0}}\right\|\leq\tau. By Corollary 1 we also have f^(u)f^(y0)\hat{f}(u)\leq\hat{f}(y_{0}), so we similarly obtain uy0τ\left\|{u-y_{0}}\right\|\leq\tau. Using xi=(1+ω)yiωyi1x_{i}=(1+\omega)y_{i}-\omega y_{i-1}, where ω=κ1κ+1(0,1)\omega=\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\in(0,1), we have by the triangle inequality that

for every i=1,,t1i=1,\ldots,t-1. Finally, since v=xjv=x_{j} for some 0jt10\leq j\leq t-1, this gives the last inequality of the lemma, as uvuy0+xjy04τ\left\|{u-v}\right\|\leq\left\|{u-y_{0}}\right\|+\left\|{x_{j}-y_{0}}\right\|\leq 4\tau. ∎

Lemma 2 shows we can accelerate gradient descent in a non-convex setting. Indeed, ignoring all problem-dependent constants, setting α=ϵ\alpha=\sqrt{\epsilon} in the bound (13) shows that we make Ω(ϵ3/2)\Omega(\epsilon^{3/2}) progress at every iteration of Guarded-non-convex-AGD, and consequently the number of iterations is bounded by O(ϵ3/2)O(\epsilon^{-3/2}). Arguing that calls to AGD-until-proven-guilty each require O(ϵ1/4log1ϵ)O(\epsilon^{-1/4}\log\frac{1}{\epsilon}) gradient computations yields the following complexity guarantee.

then Guarded-non-convex-AGD(ff, p0p_{0}, L1L_{1}, ϵ\epsilon, α\alpha, αL2\frac{\alpha}{L_{2}}) finds a point pKp_{K} such that f(pK)ϵ\left\|{\nabla{f(p_{K}})}\right\|\leq\epsilon with at most

We bound two quantities: the number of calls to AGD-until-proven-guilty, which we denote by KK, and the maximum number of steps AGD-until-proven-guilty performs when it is called, which we denote by TT. The overall number gradient evaluations is 2KT2KT, as we compute at most 2T2T gradients per iterations (at the points x0,,xt1x_{0},\ldots,x_{t-1} and y1,,yty_{1},\ldots,y_{t}).

The upper bound on KK is immediate from Lemma 2, as by telescoping the progress guarantee (13) we obtain

where the final inequality follows by substituting our choice (15), of α\alpha. We conclude that

Therefore, substituting ε=ϵ/10\varepsilon=\epsilon/10, L=L1+2αL=L_{1}+2\alpha and L=α=2L2ϵL=\alpha=2\sqrt{L_{2}\epsilon} into the guarantee (7) of Corollary 1,

We use ϵmin{Δf2/3L21/3,L12/(12L2)}\epsilon\leq\min\{\Delta_{f}^{2/3}L_{2}^{1/3},L_{1}^{2}/(12L_{2})\} to simplify the bounds on KK and TT. Using 1ΔfL21/2ϵ3/21\leq\Delta_{f}L_{2}^{1/2}\epsilon^{-3/2} simplifies the bound (17) to

Applying 1L1/(8L2ϵ)1\leq L_{1}/(8\sqrt{L_{2}\epsilon}) in the bound (18) gives

We conclude this section with two brief remarks.

The conditions on ϵ\epsilon guarantee that the bound (16) is non-trivial. If ϵL12/L2\epsilon\geq L_{1}^{2}/L_{2}, then gradient descent achieves better guarantees. Indeed, with step-size 1/L11/L_{1}, gradient descent satisfies f(x)ϵ\left\|{\nabla f(x)}\right\|\leq\epsilon within at most 2L1Δfϵ2\frac{2L_{1}\Delta_{f}}{\epsilon^{2}} iterations (cf. Nesterov, 2004, Eq. 1.2.13). Substituting ϵL12/L2\epsilon\geq L_{1}^{2}/L_{2} therefore yields

Alternatively, if ϵ>102/3Δf2/3L21/3\epsilon>10^{2/3}\Delta_{f}^{2/3}L_{2}^{1/3} then we have by inequality (17) that K<2K<2, so Alg. 3 halts after at most a single iteration. Nevertheless, the bounds (17) and (18) hold for any ϵ0\epsilon\geq 0.

While we state Theorem 1 in terms of gradient evaluation count, a similar bound holds for function evaluations as well. Indeed, inspection of our method reveals that each iteration of Alg. 3 evaluates the function and not the gradient at at most the three points u,u+u,u_{+} and uu_{-}; both complexity measures are therefore of the same order.

Incorporating third-order smoothness

In this section, we show that when third-order derivatives are Lipschitz continuous, we can improve the convergence rate of Alg. 3 by modifying two of its subroutines. In Section 4.1 we introduce a modified version of Exploit-NC-pair that can decrease function values further using third-order smoothness. In Section 4.2 we change Find-best-iterate to provide a guarantee that f(v)f(v) is never too large. We combine these two results in Section 4.3 and present our improved complexity bounds.

Our first observation is that third-order smoothness allows us to take larger steps and make greater progress when exploiting negative curvature, as the next lemma formalizes.

the last inequality using h(0)=f(u)h(0)=f(u), h(0)=α2h^{\prime\prime}(0)=-\frac{\alpha}{2} and η23αL3\eta^{2}\leq\frac{3\alpha}{L_{3}}. That f(u+ηˉδ)=h(ηˉ)f(u+\bar{\eta}\delta)=h(\bar{\eta}) gives the result. ∎

Comparing Lemma 4 to the second part of the proof of Lemma 1, we see that second-order smoothness with optimal η\eta guarantees α3/(12L22)\alpha^{3}/(12L_{2}^{2}) function decrease, while third-order smoothness guarantees a 3α2/(8L3)3\alpha^{2}/(8L_{3}) decrease. Recalling Theorem 1, where α\alpha scales as a power of ϵ\epsilon, this is evidently a significant improvement. Additionally, this benefit is essentially free: there is no increase in computational cost and no access to higher order derivatives. Examining the proof, we see that the result is rooted in the anti-symmetry of the odd-order terms in the Taylor expansion. This rules out extending this idea to higher orders of smoothness, as they contain symmetric fourth order terms.

Extending this insight to the setting of Lemma 1 is complicated by the fact that, at relevant scales of uv\left\|{u-v}\right\|, it is no longer possible to guarantee that there is negative curvature at either uu or vv. Nevertheless, we are able to show that a small modification of Exploit-NC-pair achieves the required progress.

We prove Lemma 5 in Section B.1 in the supplementary material; it is essentially a more technical version of the proof of Lemma 4, where we address the asymmetry of condition (10) by taking steps of different sizes from uu and vv.

2 Bounding the function values of the iterates using cubic interpolation

An important difference between Lemmas 1 and 5 is that the former guarantees lower objective value than f(u)f(u), while the latter only improves max{f(v),f(u)}\max\{f(v),f(u)\}. We invoke these lemmas for v=xjv=x_{j} for some xjx_{j} produced by AGD-until-proven-guilty, but Corollary 1 only bounds the function value at yjy_{j} and ww; f(xj)f(x_{j}) might be much larger than f(y0)f(y_{0}), rendering the progress guaranteed by Lemma 5 useless. Fortunately, we are able show that whenever this happens, there must be a point on the line that connects xj,yjx_{j},y_{j} and yj1y_{j-1} for which the function value is much lower than f(y0)f(y_{0}). We take advantage of this fact in Alg. 5, where we modify Find-best-iterate to consider additional points, so that whenever the iterate it finds is not much better than y0y_{0}, then f(xj)f(x_{j}) is guaranteed to be close to f(y0)f(y_{0}). We formalize this claim in the following lemma, which we prove in Section B.2.

3 An improved rate of convergence

With our algorithmic and analytic upgrades established, we are ready to state the enhanced performance guarantees for Guarded-non-convex-AGD, where from here on we assume that Exploit-NC-pair3 and Find-best-iterate3 subsume Exploit-NC-pair and Find-best-iterate, respectively.

The proof of Lemma 7 is essentially identical to the proof of Lemma 2, where we replace Lemma 1 with Lemmas 5 and 6 and set τ=α/(32L3)\tau=\sqrt{\alpha/(32L_{3})}. For completeness, we give a full proof in Section B.3. The gradient evaluation complexity guarantee for third-order smoothness then follows precisely as in our proof of Theorem 1; see Sec. B.4 for a proof of the following

Guarded-non-convex-AGD(ff, p0p_{0}, L1L_{1}, ϵ\epsilon, α\alpha, 2αL3\sqrt{\frac{2\alpha}{L_{3}}}) finds a point pKp_{K} such that f(pK)ϵ\left\|{\nabla{f(p_{K}})}\right\|\leq\epsilon and requires at most

While achieving the guarantees that Theorem 2 provides requires some modification of our algorithms, these do not come at the expense of the convergence guarantees of Theorem 1 when we have only second order smoothness. That is, the results of Theorem 1 remain valid even with the algorithmic modifications of this section, and Alg. 3 transitions between smoothness regimes by varying the scaling of α\alpha and η\eta with ϵ\epsilon.

Preliminary experiments

The primary purpose of this paper is to demonstrate the feasibility of acceleration for non-convex problems using only first-order information. Given the long history of development of careful schemes for non-linear optimization, it is unrealistic to expect a simple implementation of the momentum-based Algorithm 3 to outperform state-of-the-art methods such as non-linear conjugate gradients and L-BFGS. It is important, however, to understand the degree of non-convexity in problems we encounter in practice, and to investigate the efficacy of the negative curvature detection-and-exploitation scheme we propose.

Toward this end, we present two experiments: (1) fitting a non-linear regression model and (2) training a small neural network. In these experiments we compare a basic implementation of Alg. 3 with a number baseline optimization methods: gradient descent (GD), non-linear conjugate gradients (NCG) Hager and Zhang (2006), Accelerated Gradient Descent (AGD) with adaptive restart O’Donoghue and Candès (2015) (RAGD), and a crippled version of Alg. 3 without negative curvature exploitation (C-Alg. 3). We compare the algorithms on the number of gradient steps, but note that the number of oracle queries per step varies between methods. We provide implementation details in Section D.1.

For our first experiment, we study robust linear regression with the smooth biweight loss Beaton and Tukey (1974),

The function ϕ\phi is a robust modification of the quadratic loss; it is approximately quadratic for small errors, but insensitive to larger errors. For 1,000 independent experiments, we randomly generate problem data to create a highly non-convex problem as follows. We set d=30d=30 and m=60m=60, and we draw aiiidN(0,Id)a_{i}\stackrel{{\scriptstyle\rm iid}}{{\sim}}\mathcal{N}(0,I_{d}). We generate bb as follows. We first draw a “ground truth” vector zN(0,4Id)z\sim\mathcal{N}(0,4I_{d}). We then set b=Az+3ν1+ν2b=Az+3\nu_{1}+\nu_{2}, where ν1N(0,Im)\nu_{1}\sim\mathcal{N}(0,I_{m}) and the elements of ν2\nu_{2} are i.i.d. Bernoulli(0.3)(0.3). These parameters make the problem substantially non-convex.

In Figure 1 we plot aggregate convergence time statistics, as well as gradient norm and function value trajectories for a single representative problem instance. The figure shows that gradient descent and C-Alg. 3 (which does not exploit curvature) converge more slowly than the other methods. When C-Alg. 3 stalls it is detecting negative curvature, which implies the stalling occurs around saddle points. When negative curvature exploitation is enabled, Alg. 3 is faster than RAGD, but slower than NCG. In this highly non-convex problem, different methods often converge to local minima with (sometimes significantly) different function values. However, each method found the “best” local minimum in a similar fraction of the generated instances, so there does not appear to be a significant difference in the ability of the methods to find “good” local minima in this problem ensemble.

For the second experiment we fit a neural network modelOur approach in its current form is inapplicable to training neural networks of modern scale, as it requires computation of exact gradients. comprising three fully-connected hidden layers containing 2020, 1010 and 55 units, respectively, on the MNIST handwritten digits dataset LeCun et al. (1998) (see Section D.2). Figure 2 shows a substantial performance gap between gradient descent and the other methods, including Alg. 3. However, this is not due to negative curvature exploitation; in fact, Alg. 3 never detects negative curvature in this problem, implying AGD never stalls. Moreover, RAGD never restarts. This suggests that the loss function ff is “effectively convex” in large portions of the training trajectory, consistent with the empirical observations of Goodfellow et al. (2015); this phenomenon may merit further investigation.

We conclude that our approach can augment AGD in the presence of negative curvature, but that more work is necessary to make it competitive with established methods such as non-linear conjugate gradients. For example, adaptive schemes for setting α,η\alpha,\eta and L1L_{1} must be developed. However, the success of our method may depend on whether AGD stalls at all in real applications of non-convex optimization.

Acknowledgment

OH was supported by the PACCAR INC fellowship. YC and JCD were partially supported by the SAIL-Toyota Center for AI Research and NSF-CAREER award 1553086. YC was partially supported by the Stanford Graduate Fellowship and the Numerical Technologies Fellowship.

References

Supplementary material

The proof is closely based on the proof of Theorem 3.18 of Bubeck (2014), which itself is based on the estimate sequence technique of Nesterov (2004). We modify the proof slightly to avoid arguments that depend on the global minimum of ff. This enables using inequalities (5) to prove the result, instead of σ\sigma-strong convexity of the function ff.

We define σ\sigma-strongly convex quadratic functions Φs\Phi_{s} by induction as

Using (5) with u=wu=w, straightforward induction shows that

We now prove (26) by induction. Note that it is true at s=0s=0 since x0=y0x_{0}=y_{0} is the global minimizer of Φ0\Phi_{0}. We have,

where inequality (a) follows from the definition ys+1=xs1Lf(xs)y_{s+1}=x_{s}-\frac{1}{L}\nabla f(x_{s}) and the LL-smoothness of ff, inequality (b) is the induction hypothesis and inequality (c) is assumption (5) with u=ysu=y_{s}.

Past this point the proof is identical to the proof of Theorem 3.18 of Bubeck (2014), but we continue for sake of completeness.

To complete the induction argument we now need to show that:

Note that 2Φs=σIn\nabla^{2}\Phi_{s}=\sigma I_{n} (immediate by induction) and therefore

Since by definition Φs+1(vs+1)=0\Phi_{s+1}(v_{s+1})=0, we have

Using (24), evaluating evaluating Φs+1\Phi_{s+1} at xsx_{s} gives,

Examining this equation, it is seen that vsxs=κ(xsys)v_{s}-x_{s}=\sqrt{\kappa}(x_{s}-y_{s}) implies (27) and therefore concludes the proof of Proposition 1. We establish the relation vsxs=κ(xsys)v_{s}-x_{s}=\sqrt{\kappa}(x_{s}-y_{s}) by induction,

where the first equality comes from (28), the second from the induction hypothesis, the third from the definition of ys+1y_{s+1} and the last one from the definition of xs+1x_{s+1}. ∎

Appendix B Proofs from Section 4

We begin by proving the following normalized version of Lemma 5.

for some A0A\geq 0. Then for any ρ4\rho\geq 4

where ρ=ρ(ρ+2)2\rho^{\prime}=\sqrt{\rho(\rho+2)}-2.

where the equality follows from the definition (2+ρ)2=ρ(2+ρ)(2+\rho^{\prime})^{2}=\rho(2+\rho). We lower bound ρ(2+ρ)\rho^{\prime}(2+\rho^{\prime}) as

The fact that either (35) or (36) must hold implies (31).

With the auxiliary Lemma 8, we prove Lemma 5.

Additionally, since ff has L3L_{3}-Lipschitz third order derivatives, hh^{\prime\prime\prime} is 116L3uv4:=L\frac{1}{16}L_{3}\left\|{u-v}\right\|^{4}:=L Lipschitz continuous, so we may apply Lemma 8 at ρ=2η/uv4\rho=2\eta/\left\|{u-v}\right\|\geq 4. Letting δ=(uv)/uv\delta=(u-v)/\left\|{u-v}\right\|, we note that h(1ρ)=f(vηδ)h(1-\rho)=f(v-\eta\delta). Similarly, for 2+ρ=ρ(2+ρ)2+\rho^{\prime}=\sqrt{\rho(2+\rho)} we have h(1+ρ)=f(u+ηδ)h(1+\rho^{\prime})=f(u+\eta^{\prime}\delta) with η\eta^{\prime} given in line 2 of Exploit-NC-pair3. The result is now immediate from (31), as

where in the last transition we have used η2αL3\eta\leq\sqrt{\frac{2\alpha}{L_{3}}}. ∎

B.2 Proof of Lemma 6

We first state and prove a normalized version of the central argument in the proof of Lemma 6

Let 0j<t0\leq j<t be such that v=xjv=x_{j} (such jj always exists by Corollary 1). If j=0j=0 then xj=y0x_{j}=y_{0} and the result is trivial, so we assume j1j\geq 1. Let

where 0<ω<10<\omega<1 is defined in line 1 of AGD-until-proven-guilty, and we have used the guarantee max{f(yj1),f(yj)}f(y0)\max\{f(y_{j-1}),f(y_{j})\}\leq f(y_{0}) from Corollary 1. Moreover, by the Lipschitz continuity of the third derivatives of ff, hh^{\prime\prime\prime} is L3yjyj14L_{3}\left\|{y_{j}-y_{j-1}}\right\|^{4}-Lipschitz continuous. Therefore, we can apply Lemma 9 with A=C=0A=C=0 and B=D=ατ2B=D=\alpha\tau^{2} at θ=ω\theta=\omega and obtain

To complete the proof, we note that Lemma 3 guarantees yjyj1yjy0+yj1y02τ\left\|{y_{j}-y_{j-1}}\right\|\leq\left\|{y_{j}-y_{0}}\right\|+\left\|{y_{j-1}-y_{0}}\right\|\leq 2\tau and therefore

where we have used τ2α/(16L3)\tau^{2}\leq\alpha/(16L_{3}). ∎

B.3 Proof of Lemma 7

Therefore, we can use Lemma 5 (with η\eta as defined above) to show that

By Corollary 1, f(u)f^(u)f^(y0)=f(pk1)f(u)\leq\hat{f}(u)\leq\hat{f}(y_{0})=f(p_{k-1}). Moreover, since f(b(1))f(y0)ατ2f(b^{(1)})\geq f(y_{0})-\alpha\tau^{2} and τ=α32L3\tau=\sqrt{\frac{\alpha}{32L_{3}}}, we may apply Lemma 6 to obtain

B.4 Proof of Theorem 2

The proof proceeds exactly like the proof of Theorem 1. As argued there, the number of gradient evaluations is at most 2KT2KT, where KK is number of iterations of Guarded-non-convex-AGD and TT is the maximum amount of steps performed in any call to AGD-until-proven-guilty.

We derive the upper bound on KK directly from Lemma 7, as by telescoping (13) we obtain

where the last transition follows from substituting (22), our choice of α\alpha. We therefore conclude that

where log+()\log_{+}(\cdot) is shorthand for max{0,log()}\max\{0,\log(\cdot)\}.

Finally, we use ϵ2/3min{Δf1/2L31/6,L1/(8L31/3)}\epsilon^{2/3}\leq\min\{\Delta_{f}^{1/2}L_{3}^{1/6},L_{1}/(8L_{3}^{1/3})\} to simplify the bounds on KK and TT. Using 1ΔfL31/3ϵ4/31\leq\Delta_{f}L_{3}^{1/3}\epsilon^{-4/3} reduces (17) to

Applying 1L1/(8L31/3ϵ2/3)1\leq L_{1}/(8{L_{3}^{1/3}\epsilon^{2/3}}) on (18) gives

Appendix C Adding a second-order guarantee

In this section, we sketch how to obtain simultaneous guarantees on the gradient and minimum eigenvalue of the Hessian. We use the O~()\widetilde{O}(\cdot) notation to hide logarithmic dependence on ϵ\epsilon, Lipschitz constants Δf,L1,L2,L3\Delta_{f},L_{1},L_{2},L_{3} and a high probability confidence parameter δ(0,1)\delta\in(0,1), as well as lower order polynomial terms in ϵ1\epsilon^{-1}.

Using approximate eigenvector computation, we can efficiently generate a direction of negative curvature, unless the Hessian is almost positive semi-definite. More explicitly, there exist methods of the form Approx-Eig(ff, xx, L1L_{1}, α\alpha, δ\delta) that require O~(L1/αlogd)\widetilde{O}(\sqrt{L_{1}/\alpha}\log d) Hessian-vector products to produce a unit vector vv such that whenever 2f(x)αI\nabla^{2}f(x)\succeq-\alpha I, with probability at least 1δ1-\delta we have vT2f(x)vα/2v^{T}\nabla^{2}f(x)v\leq-\alpha/2, e.g. the Lanczos method (see additional discussion in (Carmon et al., 2016, §2.2)). Whenever a unit vector vv satisfying vT2f(x)vα/2v^{T}\nabla^{2}f(x)v\leq-\alpha/2 is available, we can use it to make function progress. If 2f\nabla^{2}f is L2L_{2}-Lipschitz continuous then by Lemma 1 f(x±αL2v)<f(x)α312L22f(x\pm\frac{\alpha}{L_{2}}v)<f(x)-\frac{\alpha^{3}}{12L_{2}^{2}} where by f(x±z)f(x\pm z) we mean min{f(x+z),f(xz)}\min\{f(x+z),f(x-z)\}. If instead ff has L3L_{3}-Lipschitz continuous third-order derivatives then by Lemma 4, f(x±2αL3v)<f(x)α24L3f(x\pm\sqrt{\frac{2\alpha}{L_{3}}}v)<f(x)-\frac{\alpha^{2}}{4L_{3}}.

We can combine Approx-Eig with Algorithm 3 that finds a point with a small gradient as follows:

As discussed above, under third order smoothness , η=2α/L3\eta=\sqrt{{2\alpha}/{L_{3}}} guarantees that the step (41c) makes at least α2/(4L3)\alpha^{2}/(4L_{3}) function progress whenever vkT2f(z^k)vkα/2v_{k}^{T}\nabla^{2}f(\hat{z}_{k})v_{k}\leq-\alpha/2. Therefore the above iteration can run at most O~(ΔfL3/α2)\widetilde{O}(\Delta_{f}L_{3}/\alpha^{2}) times before vkT2f(z^k)vkα/2v_{k}^{T}\nabla^{2}f(\hat{z}_{k})v_{k}\geq-\alpha/2 is satisfied. Whenever vkT2f(z^k)vkα/2v_{k}^{T}\nabla^{2}f(\hat{z}_{k})v_{k}\geq-\alpha/2, with probability 1δk1-\delta^{\prime}\cdot k we have the Hessian guarantee 2f(z^k)αI\nabla^{2}f(\hat{z}_{k})\succeq-\alpha I. Moreover, f(z^k)ϵ\left\|{\nabla f(\hat{z}_{k})}\right\|\leq\epsilon always holds. Thus, by setting α=L31/3ϵ2/3\alpha=L_{3}^{1/3}\epsilon^{2/3} we obtain the required second order stationarity guarantee upon termination of the iterations (41).

It remains to bound the computational cost of the method, with α=L31/3ϵ2/3\alpha=L_{3}^{1/3}\epsilon^{2/3}. The total number of Hessian-vector products required by Approx-Eig is,

Moreover, it is readily seen from the proof of Theorem 2 that every evaluation of (41a) requires at most

gradient and function evaluations. By telescoping the first term and multiplying the second by O~(ΔfL3/α2)\widetilde{O}(\Delta_{f}L_{3}/\alpha^{2}), we guarantee f(x)ϵ\left\|{\nabla f(x)}\right\|\leq\epsilon and 2f(x)L31/3ϵ2/3I\nabla^{2}f(x)\succeq-L_{3}^{1/3}\epsilon^{2/3}I in at most O~(ΔfL11/2L31/6ϵ5/3logd)\widetilde{O}(\Delta_{f}L_{1}^{1/2}L_{3}^{1/6}\epsilon^{-5/3}\log d) function, gradient and Hessian-vector product evaluations.

The argument above is the same as the one used to prove Theorem 4.3 of Carmon et al. (2016), but our improved guarantees under third order smoothness allows us get a better ϵ\epsilon dependence for the complexity and lower bound on the Hessian in that regime. If instead we use the second order smoothness setting, we recover exactly the guarantees of Carmon et al. (2016); Agarwal et al. (2016), namely f(x)ϵ\left\|{\nabla f(x)}\right\|\leq\epsilon and 2f(x)L21/2ϵ1/2I\nabla^{2}f(x)\succeq-L_{2}^{1/2}\epsilon^{1/2}I in at most O~(ΔfL11/2L21/4ϵ7/4logd)\widetilde{O}(\Delta_{f}L_{1}^{1/2}L_{2}^{1/4}\epsilon^{-7/4}\log d) function, gradient and Hessian-vector product evaluations.

Finally, we remark that the above analysis would still apply if in (41a) we replace Guarded-non-convex-AGD with any method with a run -time guarantee of the form (42). The resulting method will guarantee whatever the original method does, and also 2f(x)αI\nabla^{2}f(x)\succeq-\alpha I. In particular, if the first method guarantees a small gradient, the combined method guarantees convergence to second-order stationary points.

Appendix D Experiment details

Both gradient descent and AGD are based on gradients steps of the form

In practice L1L_{1} is often unknown and non-uniform, and therefore needs to be estimated adaptively. A common approach is backtracking line search, which we use for conjugate gradient. However, combining line search with AGD without invalidating its performance guarantees would involve non-trivial modification of the proposed method. Therefore, for the rest of the methods we keep an estimate of L1L_{1}, and double it whenever the gradient steps fails to make sufficient progress. That is, whenever

we set L12L1L_{1}\leftarrow 2L_{1} and try again. In all experiments we start with L1=1L_{1}=1, which underestimates the actual smoothness of ff by 2-3 orders of magnitude. We call our scheme for setting L1L_{1} semi-adaptive, since we only increase L1L_{1}, and therefore do not adapt to situations where the function becomes more smooth as optimization progresses. Thus, we avoid painstaking tuning of L1L_{1} while preserving the ‘fixed step-size’ nature of our approach, as L1L_{1} is only doubled a small number of times.

We implement Guarded-non-convex-AGD with the following modifications, indented to make it more practical without substantially compromising its theoretical properties.

We use the semi-adaptive scheme described above to set LL. Specifically, whenever the gradient steps in lines 3 and 3 of AGD-until-proven-guilty and Certify-progress respectively fail, we double LL until it succeeds, terminate AGD-until-proven-guilty and multiply L1L_{1} by the same factor.

We make the input parameters for AGD-until-proven-guilty dynamic. In particular, we set ϵ=f(pk1)/10\epsilon^{\prime}=\left\|{\nabla f(p_{k-1})}\right\|/10 and use α=σ=C1f(pk1)2/3\alpha=\sigma=C_{1}\left\|{\nabla f(p_{k-1})}\right\|^{2/3}, where C1C_{1} is a hyper-parameter. We use the same value of α\alpha to construct f^\hat{f}. This makes our implementation independent on the final desired accuracy ϵ\epsilon.

Since this inequality is a clear convexity violation, we return wt=ytw_{t}=y_{t} whenever it holds. We find that this substantially increases our method’s capability of detecting negative curvature; most of the non-convexity detection in the first experiment is due to this check.

for the 2t2t points of the form v=xjv=x_{j} and u=yju=y_{j} or u=wtu=w_{t}, with 0j<t0\leq j<t, where here we use the original ff rather than f^\hat{f} given to AGD-until-proven-guilty. We discard all pairs with αv,u<0\alpha_{v,u}<0 (no evidence of negative curvature), and select the 5 pairs with highest value of αv,u\alpha_{v,u}. For each selected pair v,uv,u, we exploit negative curvature by testing all the points of the form {z±ηδ}\{z\pm\eta\delta\} with δ=(uv)/uv\delta=(u-v)/\left\|{u-v}\right\|, z{v,u}z\in\{v,u\} and η\eta in a grid of 10 points log-uniformly spaced between 0.01uv0.01\left\|{u-v}\right\| and 100(u+v)100(\left\|{u}\right\|+\left\|{v}\right\|).

The hyper-parameter C1C_{1} was tuned separately for each experiment by searching on a small grid. For the regression experiment the tuning was performed on different problem instances (different seeds) than the ones reported in Fig. 1. For the neural network training problem the tuning was performed on a subsample of 10% one reported in Fig. 2. The specific parameters used were C1=0.01C_{1}=0.01 for regression and C1=0.1C_{1}=0.1 for neural network training.

This method is identical to the one described above, except that at every iteration pkp_{k} is set to b(1)b^{(1)} produced by Find-best-iterate3 (i.e. the output of negative curvature exploitation is never used). We used the same hyper-parameters described above.

Gradient descent descent is simply (43), with yt+1=xt+1y_{t+1}=x_{t+1}, where the semi-adaptive scheme is used to set L1L_{1}.

We use the accelerated gradient descent scheme of Beck and Teboulle (2009) with ωt=t/(t+3)\omega_{t}=t/(t+3). We use the restart scheme given by O’Donoghue and Candès (2015) where if f(yt)>f(yt1)f(y_{t})>f(y_{t-1}) then we restart the algorithm from the point yty_{t}. For the gradient steps we use the same semi-adaptive procedure described above and also restart the algorithm whenever the L1L_{1} estimate changes (restarts performed for this reason are not shown in Fig. 1 and 2).

The method is given by the following recursion Polak and Ribière (1969),

where δ0=0\delta_{0}=0 and ηt\eta_{t} is found via backtracking line search, as follows. If δTf(xt)0\delta^{T}\nabla f(x_{t})\geq 0 we set δt=f(xt)\delta_{t}=-\nabla f(x_{t}) (truncating the recursion). We set ηt=2ηt1\eta_{t}=2\eta_{t-1} and then check whether

holds. If it does we keep the value of ηt\eta_{t}, and if it does not we set ηt=ηt/2\eta_{t}=\eta_{t}/2 and repeat. The key difference from the semi-adaptive scheme used for the rest of the methods is the initialization ηt=2ηt1\eta_{t}=2\eta_{t-1}, that allows the step size to grow. Performing line search is crucial for conjugate gradient to succeed, as otherwise it cannot produce approximately conjugate directions. If instead we use the semi-adaptive step size scheme, performance becomes very similar to that of gradient descent.

In the figures, the x-axis is set to the number of steps performed by the methods. We do this because it enables a one-to-one comparison between the steps of the restarted AGD and Algorithm 3. However, Algorithm 3 requires twice the number of gradient evaluations per step of the other algorithms. Furthermore, the number of function evaluations of Algorithm 3 increases substantially when we exploit negative curvature, due to our naive grid search procedure. Nonetheless, we believe it is possible to derive a variation of our approach that performs only one gradient computation per step, and yet maintains similar performance (see remark after Corollary 1, and that effective negative curvature exploitation can be carried out with only few function evaluations, using a line search.

While the rest of the methods tested require one gradient evaluation per step, the required number of function evaluations differs. GD requires only one function evaluation per step, while RAGD evaluates ff twice per step (at xtx_{t} and yty_{t}); the number of additional function evaluations due to the semi-adaptive scheme is negligible. NCG is expected to require more function evaluations due to its use of a backtracking line search. In the first experiment, NCG required 2 function evaluations per step on average, indicating that its L1L_{1} estimate was stable for long durations. Alg. 3 required 5.3 function evaluations per step (on average over the 1,000 problem instances, with standard deviation 0.5), putting the amortized cost of our crude negative curvature exploitation scheme at 3.3 function evaluations per step.

D.2 Neural network training

The function ff is the average cross-entropy loss of 10-way prediction of class labels from input features. The prediction if formed by applying softmax on the output of a neural network with three hidden layers of 20, 10 and 5 units and tanh\tanh activations. To obtain data features we perform the following preprocessing, where the training examples are treated as 28228^{2} dimensional vectors. First, each example is separately normalized to zero mean and unit variance. Then, the 282×28228^{2}\times 28^{2} data covariance matrix is formed, and a projection to the 10 principle components is found via eigen-decomposition. The projection is then applied to the training set, and then each of the 10 resulting features is normalized to have zero mean and unit variance across the training set. The resulting model has d=545d=545 parameters and underfits the 60,000 examples training set. We randomly initialize the weights according the well-known scaling proposed by Glorot and Bengio (2010). We repeated the experiment for 10 different initializations of the weights, and all results were consistent with those reported in Fig. 2.