Efficient approaches for escaping higher order saddle points in non-convex optimization

Anima Anandkumar, Rong Ge

Introduction

Recent trend in applied machine learning has been dominated by the use of large-scale non-convex optimization, e.g. deep learning. However, analyzing non-convex optimization in high dimensions is very challenging. Current theoretical results are mostly negative regarding the hardness of reaching the globally optimal solution.

Less attention is paid to the issue of reaching a locally optimal solution. In fact, even this is computationally hard in the worst case (Nie, 2015). The hardness arises due to diversity and ubiquity of critical points in high dimensions. In addition to local optima, the set of critical points also consists of saddle points, which possess directions along which the objective value improves. Since the objective function can be arbitrarily bad at these points, it is important to develop strategies to escape them, in order to reach a local optimum.

The problem of saddle points is compounded in high dimensions. Due to curse of dimensionality, the number of saddle points grows exponentially for many problems of interest, e.g. (Auer et al., 1996; Cartwright and Sturmfels, 2013; Auffinger et al., 2013). Ordinary gradient descent can be stuck in a saddle point for an arbitrarily long time before making progress. A few recent works have addressed this issue, either by incorporating second order Hessian information (Nesterov and Polyak, 2006) or through noisy stochastic gradient descent (Ge et al., 2015). These works however require the Hessian matrix at the saddle point to have a strictly negative eigenvalue, termed as the strict saddle condition. The time to escape the saddle point depends (polynomially) on this negative eigenvalue. Some structured problems such as complete dictionary learning, phase retrieval and orthogonal tensor decomposition possess this property (Sun et al., 2015).

On the other hand, for problems without the strict saddle property, the above techniques can converge to a saddle point, which is disguised as a local minimum when only first and second order information is used. We address this problem in this work, and extend the notion of second order optimality to higher order optimality conditions. We propose a new efficient algorithm that is guaranteed to converge to a third order local minimum, and show that it is NP-hard to find a fourth order local minimum.

Our results are relevant for a wide range of non-convex problems which possess degenerate critical points. At these points, the Hessian matrix is singular. Such points arise due to symmetries in the optimization problem, e.g., permutation symmetry in a multi-layer neural network. Singularities also arise in over-specified models, where the model capacity (such as the number of neurons in neural networks) exceeds the complexity of the target function. Here, certain neurons can be eliminated (i.e. have weights set to zero), and such critical points possess the so-called elimination singularity (Wei et al., 2008). Alternatively, two neurons can have the same weight, and this is known as overlap singularity (Wei et al., 2008). The Hessian matrix is singular at such critical points. This behavior is limited not just to neural networks, but has also been studied in overspecified Gaussian mixtures, radial basis function networks, ARMA models of time series (Amari et al., 2006; Wei et al., 2008), and student-teacher networks, also known as soft committee models (Saad and Solla, 1995; Inoue et al., 2003).

The current trend in practice is to incorporate overspecified models (Giles, 2001). Theoretically, bad local optima are guaranteed to disappear in neural networks under massive levels of overspecification (Safran and Shamir, 2015). On the other hand, as discussed above, the saddle point problem is compounded in these overspecified models. Empirically, the presence of singular saddle points is found to slow down learning substantially (Saad and Solla, 1995; Inoue et al., 2003; Amari et al., 2006; Wei et al., 2008). Intuitively, these singular saddle points are surrounded by plateaus or flat regions with a sub-optimal objective value. For these regions neither the gradient or Hessian information can lead to a direction that improves the function value. Therefore they can “fool” the (ordinary) first and second order algorithms and they may stuck there for long periods of time. Higher order derivatives are needed to classify the point as either a local optimum or a saddle point. In this work, we tackle this challenging problem of escaping such higher order saddle points.

We call a point xx a p\mboxthp^{{\mbox{\tiny th}}} order local minimum if for any nearby point yy f(x)f(y)o(xyp)f(x)-f(y)\leq o(\|x-y\|^{p}) (see Definition 1).

We give a necessary and sufficient condition for a point xx to be a third order local minimum (see Section 4). Similar conditions (for even higher order) have been discussed in previous works, however their algorithmic implications were not known. We design an algorithm that is guaranteed to find a third order local minimum.

(Informal) There is an algorithm that always converges to a third order local minimum (see Theorem 9). Also, in polynomial time the algorithm can find a point that is “similar” to a third order local minimum (see Theorem 8).

By “similar” we mean the point xx approximately satisfies the necessary and sufficient condition for third order local minimum (see Definition 4): the gradient f(x)\nabla f(x) is small, Hessian 2f(x)\nabla^{2}f(x) is almost positive semidefinite (p.s.d) and in every subspace where the Hessian is small, the norm of the third order derivatives is also small.

To the best of our knowledge this is the first algorithm that is guaranteed to converge to a third order local minimum. The algorithm alternates between a second order step (which we use cubic regularization(Nesterov and Polyak, 2006)) and a third order step. The third order step first identifies a “competitive subspace” where the third order derivative has a much larger norm than the second order. It then tries to find a good direction in this subspace to make improvement. For more details see Section 5.

We also show that it is NP-hard to find a fourth order local minimum:

(Informal) Even for a well-behaved function, it is NP-hard to find a fourth order local minimum (see Theorem 10).

2 Related Work

A popular approach to overcoming saddle points is to incorporate second order information. However, the popular second order approach of Newton’s method is not suitable since it converges to an arbitrary critical point, and does not distinguish between a local minimum and a saddle point. Directions along negative values of the Hessian matrix help in escaping the saddle point. A simple solution is then to use these directions, whenever gradient descent improvements are small (which signals the approach towards a critical point) (Frieze et al., 1996; Vempala and Xiao, 2011).

A more elegant framework is the so-called trust region method (Dauphin et al., 2014; Sun et al., 2015) which involves optimizing the second order Taylor’s approximation of the objective function in a local neighborhood of the current point. Intuitively, this objective “switches” smoothly between first order and second order updates. Nesterov and Polyak (2006) propose adding a cubic regularization term to this Taylor’s approximation. In a beautiful result, they show that in each step, this cubic regularized objective can be solved optimally due to hidden convexity and overall, the algorithm converges to a local optimum in bounded time. We give an overview of this algorithm in Section 3. Baes (2009) generalizes this idea to use higher order Taylor expansion, however the optimization problem is intractable even for third order Taylor expansion with quartic regularizer. Ge et al. (2015) recently showed that it is possible to escape saddle points using only first order information based on noisy stochastic gradient descent (SGD) in polynomial time in high dimensions. Lee et al. (2016) showed that even without adding noise, in the limit gradient descent converges to (second order) local minimum with random initialization. In many applications, these first-order algorithms are far cheaper than the computation of the Hessian eigenvectors. Nie (2015) propose using the hierarchy of semi-definite relaxations to compute all the local optima which satisfy first and second order necessary conditions based on semi-definite relaxations.

All the above works deal with local optimality based on second order conditions. When the Hessian matrix is singular and p.s.d., higher order derivatives are required to determine whether it is a local optimum or a saddle point. Higher order optimality conditions, both necessary and sufficient, have been characterized before, e.g. (Bernstein, 1984; Warga, 1986). But these conditions are not efficiently computable, and it is NP-hard to determine local optimality, given such information about higher order derivatives (Murty and Kabadi, 1987).

Preliminaries

In this section we first introduce the classifications of saddle points. Next, as we often work with third order derivatives, and we treat it as a order 3 tensor, we introduce the necessary notations for tensors.

For such smooth function f(x)f(x), we say xx is a critical point if f(x)=0\nabla f(x)=\vec{0}. Traditionally, critical points are classified into four cases according to the Hessian matrix:

(Local Minimum) All eigenvalues of 2f(x)\nabla^{2}f(x) are positive.

(Local Maximum) All eigenvalues of 2f(x)\nabla^{2}f(x) are negative.

(Strict saddle) 2f(x)\nabla^{2}f(x) has at least one positive and one negative eigenvalues.

(Degenerate) 2f(x)\nabla^{2}f(x) has either nonnegative or nonpositive eigenvalues, with some eigenvalues equal to 0.

As we shall see later in Section 3, for the first three cases second order algorithms can either find a direction to reduce the function value (in case of local maximum or strict saddle), or correct asserting that the current point is a local minimum. However, second order algorithms cannot handle degenerate saddle points.

Degeneracy of Hessian indicates the presence of a gutter structure, where a set of connected points all have the same value, and all are local minima, maxima or saddle points (Dauphin et al., 2014). See for example Figure 1 (c) (d).

If the Hessian at a critical point xx is p.s.d., even if it has 0 eigenvalues we can say the point is a second order local minimum: for any yy that is sufficiently close to xx, we have f(x)f(y)=o(xy2)f(x)-f(y)=o(\|x-y\|^{2}). That is, although there might be a vector yy that makes the function value decrease, the amount of decrease is a lower order term compared to xy2\|x-y\|^{2}. In this paper we consider higher order local minimum:

A critical point xx is a pp-th order local minimum, if there exists constants C,ϵ>0C,\epsilon>0 such that for every yy with yxϵ\|y-x\|\leq\epsilon,

Every critical point is a first order local minimum, and every point that satisfies the second order necessary condition (f(x)=0,2f(x)0\nabla f(x)=0,\nabla^{2}f(x)\succeq 0) is a second order local minimum.

2 Matrix and Tensor Notations

In this decomposition viv_{i}’s are orthonormal vectors, and λi\lambda_{i}’s are the eigenvalues of MM. We always assume λ1λ2λn\lambda_{1}\geq\lambda_{2}\geq\ldots\geq\lambda_{n}. We use λ1(M)\lambda_{1}(M) to denote its largest eigenvalue and λn(M)\lambda_{n}(M) to denote its smallest eigenvalue. By the property of symmetric matrices we also know M=max{λ1(M),λn(M)}\|M\|=\max\{|\lambda_{1}(M)|,|\lambda_{n}(M)|\}. We use MF\|M\|_{F} to denote the Frobenius norm of the matrix MF=i,j[n]Mi,j2\|M\|_{F}=\sqrt{\sum_{i,j\in[n]}M_{i,j}^{2}}.

The third order derivative is represented by a n×n×nn\times n\times n tensor TT. We use the following multilinear notation to simplify the notations of tensors:

The Frobenius norm of a tensor TT is defined similarly as matrices: TF=i,j,k[n]Ti,j,k2\|T\|_{F}=\sqrt{\sum_{i,j,k\in[n]}T_{i,j,k}^{2}}. The spectral norm (also called injective norm) of a tensor is defined as

We say a tensor is symmetric if Ti,j,k=Tπ(i,j,k)T_{i,j,k}=T_{\pi(i,j,k)} for any permutation of the indices. For symmetric tensors the spectral norm is also equal to T=maxu=1T(u,u,u)\|T\|=\max_{\|u\|=1}T(u,u,u). In both cases it is NP-hard to compute the spectral norm of a tensor(Hillar and Lim, 2013).

We will often need to project a tensor TT to a subspace P{\cal P}. Let PP be the projection matrix to the subspace PP, we use the notation \mboxProjPT\mbox{Proj}_{\cal P}T which denotes T(P,P,P)T(P,P,P). Intuitively, [T(P,P,P)]u,v,w=T(Pu,Pv,Pw)[T(P,P,P)]_{u,v,w}=T(Pu,Pv,Pw), that is, the projected tensor applied to vector u,v,wu,v,w is equivalent to the original tensor applied to the projection of u,v,wu,v,w.

Overview of Nestorov’s Cubic Regularization

In this section we review the guarantees of Nesterov’s Cubic Regularization algorithm(Nesterov and Polyak, 2006). We will use this algorithm as a key step later in Section 5, and prove analogous results for third order local minimum.

The algorithm requires the first two order derivatives exist and the following smoothness constraint:

At a point xx, the algorithm tries to find a nearby point zz that optimizes the degree two Taylor’s expansion: f(x)+f(x),zx+12(zx)(2f(x))(zx)f(x)+\langle\nabla f(x),z-x\rangle+\frac{1}{2}(z-x)^{\top}(\nabla^{2}f(x))(z-x), with the cubic distance R6zx3\frac{R}{6}\|z-x\|^{3} as a regularizer. See Algorithm 1 for one iteration of the algorithm. The final algorithm generates a sequence of points x(0),x(1),x(2),x^{(0)},x^{(1)},x^{(2)},\ldots where x(i+1)=\mboxCubicReg(x(i))x^{(i+1)}=\mbox{CubicReg}(x^{(i)}).

The optimization problem that Algorithm 1 tries to solve may seem difficult, as it has a cubic regularizer zx3\|z-x\|^{3}. However, Nesterov and Polyak (2006) showed that it is possible to solve this optimization problem in polynomial time.

For each point, define μ(z)\mu(z) to measure how close the point zz is to satisfying the second order optimality condition:

μ(z)=max{1Rf(z),23Rλn2f(z)}\mu(z)=\max\left\{\sqrt{\frac{1}{R}\|\nabla f(z)\|},-\frac{2}{3R}\lambda_{n}\nabla^{2}f(z)\right\}

When μ(z)=0\mu(z)=0 we know f(z)=0\nabla f(z)=0 and 2f(z)0\nabla^{2}f(z)\succeq 0, which satisfies the second order necessary conditions (and in fact implies that zz is a second order local minimum). When μ(z)\mu(z) is small we can say that the point zz approximately satisfies the second order optimality condition.

For one step of the algorithm the following guarantees can be provenAll of guarantees we stated here correspond to setting the regularizer RR to be exactly equal to the smoothness in Assumption 1.

(Nesterov and Polyak, 2006) Suppose z=\mboxCubicRegularize(x)z=\mbox{CubicRegularize}(x), then zxμ(z)\|z-x\|\geq\mu(z) and f(z)f(x)Rzx3/12f(z)\leq f(x)-R\|z-x\|^{3}/12.

Using Theorem 3, Nesterov and Polyak (2006) can get strong convergence results for the sequence x(0),x(1),x(2),x^{(0)},x^{(1)},x^{(2)},\ldots

(Nesterov and Polyak, 2006) If f(x)f(x) is bounded below by f(x)f(x^{*}), then limiμ(x(i))=0\lim_{i\to\infty}\mu(x^{(i)})=0, and for any t1t\geq 1 we have

This theorem shows that within first tt iterations, we can find a point that “looks similar” to a second order local minimum in the sense that gradient is small and Hessian does not have a negative eigenvalue with large absolute value. It is also possible to prove stronger guarantees for the limit points of the sequence:

(Nesterov and Polyak, 2006) If the level set L(x(0)):={xf(x)f(x(0))}{\cal L}(x^{(0)}):=\{x|f(x)\leq f(x^{(0)})\} is bounded, then the following limit exists

The set XX^{*} of the limit points of this sequence is non-empty. Moreover this is a connected set such that for any xXx\in X^{*} we have

Therefore the algorithm always converges to a set of points that are all second order local minima.

Third Order Necessary Condition

In this section we present a condition for a point to be a third order local minimum, and show that it is necessary and sufficient for a class of smooth functions. Proofs are deferred to Appendix A.1.

All the functions we consider satisfies the following natural smoothness conditions

Under this assumption, we state our conditions for a point to be a third order local minimum.

A point xx satisfy third-order necessary condition, if

For any uu that satisfy u(2f(x))u=0u^{\top}(\nabla^{2}f(x))u=0, [3f(x)](u,u,u)=0[\nabla^{3}f(x)](u,u,u)=0.

We first note that this condition can be verified in polynomial time.

Conditions in Definition 4 can be verified in polynomial time given the gradients f(x),2f(x)\nabla f(x),\nabla^{2}f(x) and 3f(x)\nabla^{3}f(x).

It is easy to check whether f(x)=0\nabla f(x)=0 and 2f(x)0\nabla^{2}f(x)\succeq 0. We can also use SVD to compute the subspace P{\cal P} such that u(2f(x))u=0u^{\top}(\nabla^{2}f(x))u=0 if and only if uPu\in{\cal P}.

Now we can compute the projection of 3f(x)\nabla^{3}f(x) in the subspace P{\cal P}, and we claim the third condition is violated if and only if the projection is nonzero.

Given a function ff that satisfies Assumption 2, a point xx is third order optimal if and only if it satisfies Condition 4.

Before proving the theorem, we first show a bound on f(y)f(y) and a Taylor’s expansion of ff at point xx.

The Lemma can be proved by integrating over the third order derivatives three times and bounding the differences. Details are deferred to Appendix A.1.

This lemmas allow us to ignore the fourth order term yx4\|y-x\|^{4} and focus on the order 3 Taylor expansion when yx\|y-x\| is small. To prove Theorem 6, intuitively, the “only if” direction (local minimum to necessary condition) is easy because if any condition in Definition 4 is violated, we can use that particular derivative to find a direction that improves the function value. For the “if” direction (necessary condition to third order local minimum), the main challenge is to balance the contribution we get from the positive part of the Hessian matrix and the third order derivatives. For details see Appendix A.1.

Algorithm for Finding Third Order Optimal Points

We design an algorithm that is guaranteed to converge to a third order local minimum. Throughout this section we assume both Assumptions 1 and 2 Note that we actually only cares about a level set L={xf(x)f(x(0))}{\cal L}=\{x|f(x)\leq f(x^{(0)})\}, as long as this set is bounded Assumptions 1 follows from Assumption 2.

The main intuition of the algorithm is similar to the proof of Theorem 6: the algorithm tries to make improvements using first, second or third order information. However, the nature of the third order condition makes it challenging for the algorithm to guarantee progress.

Consider a potential local minimum point xx. It is very easy to check whether f(x)0\nabla f(x)\neq 0 or λmin(2f(x))<0\lambda_{min}(\nabla^{2}f(x))<0, and to make progress using the corresponding directions. However, to verify Condition 3 in Definition 4, we need to do it in the right subspace.

The naïve guess is that we should take the eigensubspace of 2f(x)\nabla^{2}f(x) with eigenvalue at most 0. However, this is not correct because even if xx is a second order local minimum that does not satisfy the third order condition, it is still possible to have a sequence of x(i)x^{(i)}’s that converge to xx with 2f(x(i))\nabla^{2}f(x^{(i)}) all be strictly positive definite. Hence all the x(i)x^{(i)}’s appear to satisfy Condition 3 in Definition 4. We do not want to the algorithm to spend too much time around this point xx, so we need to identify a subspace that may have some positive eigenvalues. In order to make sure we can find a vector the contribution from third order term is larger than the second order term, we define competitive subspace below:

For any symmetric matrix MM, let its eigendecomposition be M=i=1nλiviviM=\sum_{i=1}^{n}\lambda_{i}v_{i}v_{i}^{\top} (where λi\lambda_{i}’s are eigenvalues and vi=1\|v_{i}\|=1), we use Sτ(M){\cal S}_{\tau}(M) to denote the span of eigenvectors with eigenvalue at most τ\tau. That is

For any Q>0Q>0, and any point zz, let the competitive subspace S(z){\cal S}(z) be the largest eigensubspace Sτ(2f(z)){\cal S}_{\tau}(\nabla^{2}f(z)), such that if we let CQ(z)C_{Q}(z) be the norm of the third order derivatives in this subspace

If no such subspace exists then let S(z){\cal S}(z) be empty and CQ(z)=0C_{Q}(z)=0.

Similar to μ(z)\mu(z) as in Definition 3, CQ(z)C_{Q}(z) can be viewed as how Condition 3 in Definition 4 is satisfied approximately. If both μ(z)\mu(z) and CQ(z)C_{Q}(z) are then the point zz satisfies third order necessary conditions.

Intuitively, competitive subspace is a subspace where the eigenvalues of the Hessian are small, but the Frobenius norm of the third order derivative is large. Therefore we are likely to make progress using the third order information. The parameters in Definition 6 are set so that if there is a unit vector uS(z)u\in{\cal S}(z) such that [3f(z)](u,u,u)\mboxProjS(z)3f(z)F/Q[\nabla^{3}f(z)](u,u,u)\geq\|\mbox{Proj}_{{\cal S}(z)}\nabla^{3}f(z)\|_{F}/Q (see Theorem 7), then we can find a new point where the sum of second, third and fourth order term can be bounded (see Lemma 2).

The competitive subspace in Definition 6 can be computed in polynomial time, see Algorithm 4. The main idea is that we can compute the eigendecomposition of the Hessian 2f(z)=i=1nλivivi\nabla^{2}f(z)=\sum_{i=1}^{n}\lambda_{i}v_{i}v_{i}^{\top}, and then there are only nn different subspaces (\mboxspan{vn},\mboxspan{vn1,vn},\mbox{span}\{v_{n}\},\mbox{span}\{v_{n-1},v_{n}\}, ,\mboxspan{v1,v2,vn}\ldots,\mbox{span}\{v_{1},v_{2},\ldots v_{n}\}). We can enumerate over all of them, and check for which subspaces the norm of the third order derivative is large.

Now we are ready to state the algorithm. The algorithm is a combination of the cubic regularization algorithm and a third order step that tries to use the third order derivative in order to improve the function value in the competitive subspace.

Suppose we have the following approximation guarantee for Algorithm 3

There is a universal constant BB such that the expected number of iterations of Algorithm 3 is at most 22, and the output of Approx is a unit vector uu that satisfies T(u,u,u)\mboxProjSTF/QT(u,u,u)\geq\|\mbox{Proj}_{\cal S}T\|_{F}/Q for Q=Bn1.5Q=Bn^{1.5}.

The proof of this theorem follows directly from anti-concentration (see Appendix A.2. Notice that there are other algorithms that can potentially give better approximation (lower value of QQ) which will improve the rate of our algorithm. However in this paper we do not try to optimize over dependencies over the dimension nn, that is left as an open problem.

By the choice of the parameters in the algorithm, we can get the following guarantee (which is analogous to Theorem 3):

If CQ(z)Q(24ϵ1L)1/3C_{Q}(z)\geq Q(24\epsilon_{1}L)^{1/3}, uu is a unit vector in S(z){\cal S}(z) and [3f(z)](u,u,u)\mboxProjS(z)3f(z)F/Q[\nabla^{3}f(z)](u,u,u)\geq\|\mbox{Proj}_{{\cal S}(z)}\nabla^{3}f(z)\|_{F}/Q. Let x=zCQ(z)/LQux^{\prime}=z-C_{Q}(z)/LQ\cdot u. then we have

Let ϵ=CQ(z)/LQ\epsilon=C_{Q}(z)/LQ, then by Lemma 1 we know

Here ϵ1=f(z)\epsilon_{1}=\|\nabla f(z)\|, and ϵ2CQ(z)212LQ2\epsilon_{2}\leq\frac{C_{Q}(z)^{2}}{12LQ^{2}} by the construction of the subspace.

By the choice of parameters, we know the terms ϵ1ϵ,ϵ2ϵ2/2,Lϵ4/24\epsilon_{1}\epsilon,\epsilon_{2}\epsilon^{2}/2,L\epsilon^{4}/24 are all bounded by ϵ3CQ(z)24Q\frac{\epsilon^{3}C_{Q}(z)}{24Q}, therefore

Using this Lemma, and Theorem 3 for cubic regularization, we can show that both progress measure goes to 0 as the number of steps increase (this is analogous to Theorem 4).

Suppose the algorithm starts at f(x0)f(x_{0}), and ff has global min at f(x)f(x^{*}). Then in one of the tt iterations we have

μ(z)(12(f(x0)f(x)Rt)1/3\mu(z)\leq\left(\frac{12(f(x_{0})-f(x^{*})}{Rt}\right)^{1/3}.

CQ(z)max{Q(24f(z)L)1/3,Q(24L3(f(x0)f(x))t)1/4}.C_{Q}(z)\leq\max\left\{Q(24\|\nabla f(z)\|L)^{1/3},Q\left(\frac{24L^{3}(f(x_{0})-f(x^{*}))}{t}\right)^{1/4}\right\}.

Recall μ(z)=max{1Rf(z),23Rλn2f(z)}\mu(z)=\max\left\{\sqrt{\frac{1}{R}\|\nabla f(z)\|},-\frac{2}{3R}\lambda_{n}\nabla^{2}f(z)\right\} is intuitively measuring how much first and second order progress the algorithm can make. The value CQ(z)C_{Q}(z) as defined in Definition 6 is a measure of how much third order progress the algorithm can make. The theorem shows both values goes to 0 as tt increases (note that even the first term Q(24f(z)L)1/3Q(24\|\nabla f(z)\|L)^{1/3} in the bound for CQ(z)C_{Q}(z) goes to 0 because the f(z)\|\nabla f(z)\| goes to 0).

By the guarantees of Theorem 3 and Lemma 2, we know the sequence of points x(0),z(0),,x(i),z(i),x^{(0)},z^{(0)},\ldots,x^{(i)},z^{(i)},\ldots has non-increasing function values. Also,

So there must be an iteration where f(x(i))f(x(i1))f(x0)f(x)tf(x^{(i)})-f(x^{(i-1)})\leq\frac{f(x_{0})-f(x^{*})}{t}.

If μ(z)>(12(f(x0)f(x)Rt)1/3\mu(z)>\left(\frac{12(f(x_{0})-f(x^{*})}{Rt}\right)^{1/3}, then Theorem 3 implies f(x(i1))f(z(i1))>f(x0)f(x)tf(x^{(i-1)})-f(z^{(i-1)})>\frac{f(x_{0})-f(x^{*})}{t}, which is impossible.

On the other hand if CQ(z)max{Q(24f(z)L)1/3,Q(24L3(f(x0)f(x))t)1/4}C_{Q}(z)\leq\max\left\{Q(24\|\nabla f(z)\|L)^{1/3},Q\left(\frac{24L^{3}(f(x_{0})-f(x^{*}))}{t}\right)^{1/4}\right\}, then the third order step makes progress, and we know f(z(i1))f(x(i))>f(x0)f(x)tf(z^{(i-1)})-f(x^{(i)})>\frac{f(x_{0})-f(x^{*})}{t}, which is again impossible.

We can also show that when tt goes to infinity the algorithm converges to a third order local minimum (similar to Theorem 5).

When tt goes to infinity, the values f(x(t))f(x^{(t)}) converge. If the level set L(f(x0))={xf(x)f(x0)}{\cal L}(f(x_{0}))=\{x|f(x)\leq f(x_{0})\} is compact, then the sequence of points x(t),z(t)x^{(t)},z^{(t)} has nonempty limit points, and every limit point xx satisfies the third order necessary conditions.

By Theorem 3 and Lemma 2, we know the function value is non-increasing, and it has a lowerbound f(x)f(x^{*}), so the value must converge.

The existence of limit points is guaranteed by the compactness of the level set. The only thing left to prove is that every limit point xx must satisfy the third order necessary conditions.

Notice that f(x(0))limtf(x(t)i=0Rμ(z(i))312+CQ(z(i))424L3Q4f(x^{(0)})-\lim_{t\to\infty}f(x^{(t)}\geq\sum_{i=0}^{\infty}\frac{R\mu(z^{(i)})^{3}}{12}+\frac{C_{Q}(z^{(i)})^{4}}{24L^{3}Q^{4}}, so limiμ(z(i)=0\lim_{i\to\infty}\mu(z^{(i)}=0 and limiCQ(z(i))=0\lim_{i\to\infty}C_{Q}(z^{(i)})=0. Also we know further limiz(i)x(i)=0\lim_{i\to\infty}\|z^{(i)}-x^{(i)}\|=0. Therefore wlog a limit point xx is also a limit point of sequence zz, and limif(z)=0\lim_{i\to\infty}\|\nabla f(z)\|=0. Also we know H=2f(x)H=\nabla^{2}f(x) is PSD, because otherwise points near xx will have nonzero μ(z(i)\mu(z^{(i)} and xx cannot be a limit point.

Now we only need to check the third order condition. Assume towards contradiction that third order condition is not true. The we know the Hessian has a subspace P{\cal P} with eigenvalues, and the third order derivative has norm at least ϵ\epsilon in this subspace. By matrix perturbation theory, when zz is very close to xx, P{\cal P} is very close to Sϵ(z){\cal S}_{\epsilon}(z) for ϵ0\epsilon\to 0, on the other hand the third order tensor also converges to 3f(x)\nabla^{3}f(x) (by Lipschitz condition), so Sϵ(z){\cal S}_{\epsilon}(z) will eventually be a competitive subspace and CQ(z)C_{Q}(z) is at least ϵ/2\epsilon/2 for all zz. However this is impossible as limiCQ(z(i))=0\lim_{i\to\infty}C_{Q}(z^{(i)})=0.

Note that not all third order local minimum can be the limit point for Algorithm 2. This is because if f(x)f(x) has very large third order derivatives but relatively smaller Hessian, even though the Hessian might be positive definite (so xx is in fact a local minimum), Algorithm 2 may still find a non-empty competitive subspace, and will be able to reduce the function value and escape from the saddle point. An example is for the function f(x)=x2100x3+x4f(x)=x^{2}-100x^{3}+x^{4}, x=0x=0 is a local minimum but the algorithm can escape from that and find the global minimum.

In the most general case it is hard to get a convergence rate for the algorithm because the function may have higher order local minima. However, if the function has nice properties then it is possible to prove polynomial rates of convergence.

We say a function is strict third order saddle, if there exists constants α,c1,c2,c3,c4>0\alpha,c_{1},c_{2},c_{3},c_{4}>0 such that for any point xx one of the following is true:

There is a local minimum xx^{*} such that xxc4\|x-x^{*}\|\leq c_{4} and the function is α\alpha-strongly convex restricted to the region {xxx2c4}\{x|\|x-x^{*}\|\leq 2c_{4}\}.

This is a generalization of the strict saddle functions defined in Ge et al. (2015). Even if a function has degenerate saddle points, it may still satisfy this condition.

When t\mboxpoly(n,L,R,Q,f(x0)f(x))max{(1/c1)1.5,(1/c2)3,(1/c3)4.5}t\geq\mbox{poly}(n,L,R,Q,f(x_{0})-f(x^{*}))\max\{(1/c_{1})^{1.5},(1/c_{2})^{3},(1/c_{3})^{4.5}\}, there must be a point z(i)z^{(i)} with iti\leq t that is in case 4 in Definition 7.

Therefore, when t\mboxpoly(n,L,R,Q,f(x0)f(x))max{(1/c1)1.5,(1/c2)3,(1/c3)4.5}t\geq\mbox{poly}(n,L,R,Q,f(x_{0})-f(x^{*}))\max\{(1/c_{1})^{1.5},(1/c_{2})^{3},(1/c_{3})^{4.5}\}, the point zz must satisfy

Therefore the first three cases in Definition 7 cannot happen and zz must be near a local minimum.

Hardness for Finding a fourth order Local Minimum

In this section we show it is hard to find a fourth order local minimum even if the function we consider is very well-behaved.

We say a function ff is well-behaved if it is infinite-order differentiable, and satisfies:

f(x)f(x) has a global minimizer at some point x1\|x\|\leq 1.

f(x)f(x) has bounded first 5 derivatives for x1\|x\|\leq 1.

For any direction x=1\|x\|=1, f(tx)f(tx) is increasing for t1t\geq 1.

It is NP-hard to find a fourth order local minimum of a function f(x)f(x), even if ff is guaranteed to be well-behaved.

The main idea of the proof comes from the fact that we cannot even verify the nonnegativeness of a degree 4 polynomial (hence there are cases where we cannot verify whether a point is a fourth order local minimum or not).

Nesterov (2000); Hillar and Lim (2013) It is NP-hard to tell whether a degree 4 homogeneous polynomial f(x)f(x) is nonnegative.

The NP hardness for nonnegativeness of degree 4 polynomial has been proved has been proved in several ways. In Nesterov (2000) the reduction is from the SUBSET SUM problem, which results in a polynomial that can have exponentially large coefficients and does not rule out FPTAS. However, the reduction in Hillar and Lim (2013) relies on the hardness of copositive matrices, which in turn depends on the hardness of INDEPENDENT SET(Dickinson and Gijben, 2014). This reduction gives a polynomial whose coefficients can be bounded by \mboxpoly(n)\mbox{poly}(n), and a polynomial gap that rules out FPTAS.

To prove Theorem 10 we only need to reduce the nonnegativeness problem in Theorem 11 to the problem of finding a fourth order local minimum. We can convert a degree 4 polynomial to a well behaved function by adding a degree 6 regularizer x6\|x\|^{6}. We shall show when the degree 4 polynomial is nonnegative the 0\vec{0} point is the only fourth order local minimum; when the degree 4 polynomial has negative directions then every fourth order local minimum must have negative function value. The details are deferred to Section A.3.

Conclusion

Complicated structures of saddle points are a major problem for optimization algorithms. In this paper we investigate the possibilities of using higher order derivatives in order to avoid degenerate saddle points. We give the first algorithm that is guaranteed to find a 3rd order local minimum, which can solve some problems caused by degenerate saddle points. However, we also show that the same ideas cannot be generalized to higher orders.

There are still many open problems related to degenerate saddle points and higher order optimization algorithms. Are there interesting class of functions that satisfies the strict 3rd order saddle property (Definition 7)? Can we design a 3rd order optimization algorithm for constrained optimization? We hope this paper inspires more research in these directions and eventually design efficient optimization algorithms whose performance do not suffer from degenerate saddle points.

References

Appendix A Omitted Proofs

The proof follows from integration from xx to yy repeatedly.

By the Lipschitz condition on third order derivative, we know

where h(u)=[0u(3f(x+v(yx))3f(x))dv](yx)h(u)=\left[\int_{0}^{u}(\nabla^{3}f(x+v(y-x))-\nabla^{3}f(x))dv\right](y-x), so h(u)FL2xy2\|h(u)\|_{F}\leq\frac{L}{2}\|x-y\|^{2}.

Now we use the integral for the gradient of ff:

Let g(t)=[0th(u)du](yx)g(t)=\left[\int_{0}^{t}h(u)du\right](y-x), by the bound on h(u)h(u) we know g(t)16xy3\|g(t)\|\leq\frac{1}{6}\|x-y\|^{3}. Finally, we have

The last term is bounded by yx01g(t)dtL24xy4\|y-x\|\int_{0}^{1}\|g(t)\|dt\leq\frac{L}{24}\|x-y\|^{4}.

Given a function ff that satisfies Assumption 2, a point xx is third order optimal if and only if it satisfies Condition 4.

(necessary condition \to third order minimal) By Lemma 1 we know

Now let α\alpha be the smallest nonzero eigenvalue of 2f(x)\nabla^{2}f(x). Let UU be nullspace of 2f(x)\nabla^{2}f(x) and VV be the orthogonal subspace. We break 3f(x)\nabla^{3}f(x) into two tensors G1G_{1} and G2G_{2}, where G1G_{1} is the projection to VVVV\otimes V\otimes V, VVUV\otimes V\otimes U (and its symmetries), and G2G_{2} is the projection to VUUV\otimes U\otimes U (and its symmetries). Note that 3f(x)=G1+G2\nabla^{3}f(x)=G_{1}+G_{2} because the projection on UUUU\otimes U\otimes U is 0 by the third condition. Let β\beta be the max injective norm of G1G_{1} and G2G_{2}.

Now we know for any uUu\in U and vVv\in V,

Now, if ϵ<β/α\epsilon<\beta/\alpha, because u2ϵ\|u\|_{2}\leq\epsilon it is easy to see the sum of first two terms is at least 13αv22\frac{1}{3}\alpha\|v\|_{2}^{2}. Now we can take the mininum of

The minimum is achieved when v=u2β/α\|v\|=\|u\|^{2}\beta/\alpha and the minimum value is u4β2/6α-\|u\|^{4}\beta^{2}/6\alpha. Therefore when u+vβ/α\|u+v\|\leq\beta/\alpha we have

(third order minimal\tonecessary condition) Assume towards contradiction that the necessary condition is not satisfied, but the point xx is third order local optimal.

If the necessary condition is not satisfied, then one of the three cases happens:

In the first case the gradient f(x)0\nabla f(x)\neq 0. In this case, if we let LL^{\prime} be an upperbound the operator norms of the second and third order derivative, then we know

When ϵf(x)1\epsilon\|\nabla f(x)\|\leq 1 and ϵ(2L/3+L/24)1/2\epsilon(2L^{\prime}/3+L/24)\leq 1/2, we have

Therefore the point cannot be a third order local minimum.

In the second case, f(x)=0\nabla f(x)=0, but λmin2f(x)<0\lambda_{min}\nabla^{2}f(x)<0. Let u=1\|u\|=1 be a unit vector such that u(2f(x))u=c<0u^{\top}(\nabla^{2}f(x))u=-c<0. Let LL^{\prime} be the injective norm of 3f(x)\nabla^{3}f(x), then

Therefore whenever ϵ<min{3c/L,3c/4L}\epsilon<\min\{\sqrt{3c/L},3c/4L^{\prime}\} we have f(x+ϵu)f(x)cϵ24f(x+\epsilon u)\leq f(x)-\frac{c\epsilon^{2}}{4}. The point xx cannot be a third order local minimum.

The third case is if f(x)=0\nabla f(x)=0, 2f(x)\nabla^{2}f(x) is positive semidefinite, but there is a direction u=1\|u\|=1 such that u(2f(x))u=0u^{\top}(\nabla^{2}f(x))u=0, but [3f(x)](u,u,u)0[\nabla^{3}f(x)](u,u,u)\neq 0. Without loss of generality we assume [3f(x)](u,u,u)=c>0[\nabla^{3}f(x)](u,u,u)=c>0 (if it is negative we take u-u), then

Therefore whenever ϵ<2c/L\epsilon<2c/L we have f(x+ϵu)f(x)cϵ3/12f(x+\epsilon u)\leq f(x)-c\epsilon^{3}/12 so xx cannot be a third order optimal.

A.2 Algorithm for Competitive Subspace, Proof of Theorem 7

There is a universal constant BB such that the expected number of iterations of Algorithm 3 is at most 22, and the output of Approx is a unit vector uu that satisfies T(u,u,u)\mboxProjSTF/QT(u,u,u)\geq\|\mbox{Proj}_{\cal S}T\|_{F}/Q for Q=Bn1.5Q=Bn^{1.5}.

We use the anti-concentration property for Gaussian random variables

In our case d=3d=3 and we can choose some universal constant ϵ\epsilon such that the probability of p(x)p(x) being small is bounded by 1/31/3. It is easy to check that the variance is lowerbounded by the Frobenius norm squared, so

On the other hand with high probability we know the norm of the Gaussian u^\hat{u} is at most 2n2\sqrt{n}. Therefore with probability at least 1/21/2, T(u^,u^,u^)ϵ\mboxProjSTF|T(\hat{u},\hat{u},\hat{u})|\geq\epsilon\|\mbox{Proj}_{\cal S}T\|_{F} and u^2n\|\hat{u}\|\leq 2\sqrt{n}, therefore T(u,u,u)ϵ8n1.5\mboxProjSTF|T(u,u,u)|\geq\frac{\epsilon}{8n^{1.5}}\|\mbox{Proj}_{\cal S}T\|_{F}. Choosing B=8/ϵB=8/\epsilon implies the theorem.

A.3 Proof of Theorem 10

It is NP-hard to find a fourth order local minimum of a function f(x)f(x), even if ff is guaranteed to be well-behaved.

We reduce the problem of verifying nonnegativenss of degree 4 polynomial to the problem of finding fourth order local minimum.

Now we define the function g(x)=f(x)+x6g(x)=f(x)+\|x\|^{6}. We first show that this function is well-behaved.

Next we show if f(x)f(x) is nonnegative, then 0\vec{0} is the unique fourth order local minimizer.

If f(x)f(x) is nonnegative, then 0\vec{0} is the unique fourth order local minimizer of g(x)g(x).

Suppose x0x\neq 0 is a local minimizer of g(x)g(x) of order at least 1. Let u=x/xu=x/\|x\|. We consider the function g(tu)=f(u)t4+t6g(tu)=f(u)t^{4}+t^{6}. Clearly the only first order local minimizer of g(tu)g(tu) is at t=0t=0. Therefore xx cannot be a first order local minimizer of g(x)g(x).

Finally, we show if f(x)f(x) has a negative direction, then all the local minimizer of g(x)g(x) must have negative value in ff.

If f(x)f(x) is negative for some xx, then if xx is a fourth order local minimum of g(x)g(x) then f(x)<0f(x)<0.

Suppose x0x\neq 0 is a fourth order local minimum of g(x)g(x). Then at least t=1t=1 should be a fourth order local minimum of g(tx)=f(x)t4+t6x6g(tx)=f(x)t^{4}+t^{6}\|x\|^{6}. This is only possible if f(x)<0f(x)<0.

On the other hand, for x=0x=0, suppose z=1\|z\|=1 is a direction where f(z)<0f(z)<0, then f(x)f(x+tz)=f(z)t4t6=Ω(t4)f(x)-f(x+tz)=f(z)t^{4}-t^{6}=\Omega(t^{4}), so x=0x=0 is not a fourth order local minimum.

The theorem follows immediately from the three claims.