Efficient approaches for escaping higher order saddle points in non-convex optimization

Anima Anandkumar, Rong Ge

Introduction

Recent trend in applied machine learning has been dominated by the use of large-scale non-convex optimization, e.g. deep learning. However, analyzing non-convex optimization in high dimensions is very challenging. Current theoretical results are mostly negative regarding the hardness of reaching the globally optimal solution.

Less attention is paid to the issue of reaching a locally optimal solution. In fact, even this is computationally hard in the worst case (Nie, 2015). The hardness arises due to diversity and ubiquity of critical points in high dimensions. In addition to local optima, the set of critical points also consists of saddle points, which possess directions along which the objective value improves. Since the objective function can be arbitrarily bad at these points, it is important to develop strategies to escape them, in order to reach a local optimum.

The problem of saddle points is compounded in high dimensions. Due to curse of dimensionality, the number of saddle points grows exponentially for many problems of interest, e.g. (Auer et al., 1996; Cartwright and Sturmfels, 2013; Auffinger et al., 2013). Ordinary gradient descent can be stuck in a saddle point for an arbitrarily long time before making progress. A few recent works have addressed this issue, either by incorporating second order Hessian information (Nesterov and Polyak, 2006) or through noisy stochastic gradient descent (Ge et al., 2015). These works however require the Hessian matrix at the saddle point to have a strictly negative eigenvalue, termed as the strict saddle condition. The time to escape the saddle point depends (polynomially) on this negative eigenvalue. Some structured problems such as complete dictionary learning, phase retrieval and orthogonal tensor decomposition possess this property (Sun et al., 2015).

On the other hand, for problems without the strict saddle property, the above techniques can converge to a saddle point, which is disguised as a local minimum when only first and second order information is used. We address this problem in this work, and extend the notion of second order optimality to higher order optimality conditions. We propose a new efficient algorithm that is guaranteed to converge to a third order local minimum, and show that it is NP-hard to find a fourth order local minimum.

Our results are relevant for a wide range of non-convex problems which possess degenerate critical points. At these points, the Hessian matrix is singular. Such points arise due to symmetries in the optimization problem, e.g., permutation symmetry in a multi-layer neural network. Singularities also arise in over-specified models, where the model capacity (such as the number of neurons in neural networks) exceeds the complexity of the target function. Here, certain neurons can be eliminated (i.e. have weights set to zero), and such critical points possess the so-called elimination singularity (Wei et al., 2008). Alternatively, two neurons can have the same weight, and this is known as overlap singularity (Wei et al., 2008). The Hessian matrix is singular at such critical points. This behavior is limited not just to neural networks, but has also been studied in overspecified Gaussian mixtures, radial basis function networks, ARMA models of time series (Amari et al., 2006; Wei et al., 2008), and student-teacher networks, also known as soft committee models (Saad and Solla, 1995; Inoue et al., 2003).

The current trend in practice is to incorporate overspecified models (Giles, 2001). Theoretically, bad local optima are guaranteed to disappear in neural networks under massive levels of overspecification (Safran and Shamir, 2015). On the other hand, as discussed above, the saddle point problem is compounded in these overspecified models. Empirically, the presence of singular saddle points is found to slow down learning substantially (Saad and Solla, 1995; Inoue et al., 2003; Amari et al., 2006; Wei et al., 2008). Intuitively, these singular saddle points are surrounded by plateaus or flat regions with a sub-optimal objective value. For these regions neither the gradient or Hessian information can lead to a direction that improves the function value. Therefore they can “fool” the (ordinary) first and second order algorithms and they may stuck there for long periods of time. Higher order derivatives are needed to classify the point as either a local optimum or a saddle point. In this work, we tackle this challenging problem of escaping such higher order saddle points.

We call a point $x$ a $p^{{\mbox{\tiny th}}}$ order local minimum if for any nearby point $y$ $f(x)-f(y)\leq o(\|x-y\|^{p})$ (see Definition 1).

We give a necessary and sufficient condition for a point $x$ to be a third order local minimum (see Section 4). Similar conditions (for even higher order) have been discussed in previous works, however their algorithmic implications were not known. We design an algorithm that is guaranteed to find a third order local minimum.

(Informal) There is an algorithm that always converges to a third order local minimum (see Theorem 9). Also, in polynomial time the algorithm can find a point that is “similar” to a third order local minimum (see Theorem 8).

By “similar” we mean the point $x$ approximately satisfies the necessary and sufficient condition for third order local minimum (see Definition 4): the gradient $\nabla f(x)$ is small, Hessian $\nabla^{2}f(x)$ is almost positive semidefinite (p.s.d) and in every subspace where the Hessian is small, the norm of the third order derivatives is also small.

To the best of our knowledge this is the first algorithm that is guaranteed to converge to a third order local minimum. The algorithm alternates between a second order step (which we use cubic regularization(Nesterov and Polyak, 2006)) and a third order step. The third order step first identifies a “competitive subspace” where the third order derivative has a much larger norm than the second order. It then tries to find a good direction in this subspace to make improvement. For more details see Section 5.

We also show that it is NP-hard to find a fourth order local minimum:

(Informal) Even for a well-behaved function, it is NP-hard to find a fourth order local minimum (see Theorem 10).

2 Related Work

A popular approach to overcoming saddle points is to incorporate second order information. However, the popular second order approach of Newton’s method is not suitable since it converges to an arbitrary critical point, and does not distinguish between a local minimum and a saddle point. Directions along negative values of the Hessian matrix help in escaping the saddle point. A simple solution is then to use these directions, whenever gradient descent improvements are small (which signals the approach towards a critical point) (Frieze et al., 1996; Vempala and Xiao, 2011).

A more elegant framework is the so-called trust region method (Dauphin et al., 2014; Sun et al., 2015) which involves optimizing the second order Taylor’s approximation of the objective function in a local neighborhood of the current point. Intuitively, this objective “switches” smoothly between first order and second order updates. Nesterov and Polyak (2006) propose adding a cubic regularization term to this Taylor’s approximation. In a beautiful result, they show that in each step, this cubic regularized objective can be solved optimally due to hidden convexity and overall, the algorithm converges to a local optimum in bounded time. We give an overview of this algorithm in Section 3. Baes (2009) generalizes this idea to use higher order Taylor expansion, however the optimization problem is intractable even for third order Taylor expansion with quartic regularizer. Ge et al. (2015) recently showed that it is possible to escape saddle points using only first order information based on noisy stochastic gradient descent (SGD) in polynomial time in high dimensions. Lee et al. (2016) showed that even without adding noise, in the limit gradient descent converges to (second order) local minimum with random initialization. In many applications, these first-order algorithms are far cheaper than the computation of the Hessian eigenvectors. Nie (2015) propose using the hierarchy of semi-definite relaxations to compute all the local optima which satisfy first and second order necessary conditions based on semi-definite relaxations.

All the above works deal with local optimality based on second order conditions. When the Hessian matrix is singular and p.s.d., higher order derivatives are required to determine whether it is a local optimum or a saddle point. Higher order optimality conditions, both necessary and sufficient, have been characterized before, e.g. (Bernstein, 1984; Warga, 1986). But these conditions are not efficiently computable, and it is NP-hard to determine local optimality, given such information about higher order derivatives (Murty and Kabadi, 1987).

Preliminaries

In this section we first introduce the classifications of saddle points. Next, as we often work with third order derivatives, and we treat it as a order 3 tensor, we introduce the necessary notations for tensors.

For such smooth function $f(x)$ , we say $x$ is a critical point if $\nabla f(x)=\vec{0}$ . Traditionally, critical points are classified into four cases according to the Hessian matrix:

(Local Minimum) All eigenvalues of $\nabla^{2}f(x)$ are positive.

(Local Maximum) All eigenvalues of $\nabla^{2}f(x)$ are negative.

(Strict saddle) $\nabla^{2}f(x)$ has at least one positive and one negative eigenvalues.

(Degenerate) $\nabla^{2}f(x)$ has either nonnegative or nonpositive eigenvalues, with some eigenvalues equal to 0.

As we shall see later in Section 3, for the first three cases second order algorithms can either find a direction to reduce the function value (in case of local maximum or strict saddle), or correct asserting that the current point is a local minimum. However, second order algorithms cannot handle degenerate saddle points.

Degeneracy of Hessian indicates the presence of a gutter structure, where a set of connected points all have the same value, and all are local minima, maxima or saddle points (Dauphin et al., 2014). See for example Figure 1 (c) (d).

If the Hessian at a critical point $x$ is p.s.d., even if it has 0 eigenvalues we can say the point is a second order local minimum: for any $y$ that is sufficiently close to $x$ , we have $f(x)-f(y)=o(\|x-y\|^{2})$ . That is, although there might be a vector $y$ that makes the function value decrease, the amount of decrease is a lower order term compared to $\|x-y\|^{2}$ . In this paper we consider higher order local minimum:

A critical point $x$ is a $p$ -th order local minimum, if there exists constants $C,\epsilon>0$ such that for every $y$ with $\|y-x\|\leq\epsilon$ ,

Every critical point is a first order local minimum, and every point that satisfies the second order necessary condition ( $\nabla f(x)=0,\nabla^{2}f(x)\succeq 0$ ) is a second order local minimum.

2 Matrix and Tensor Notations

In this decomposition $v_{i}$ ’s are orthonormal vectors, and $\lambda_{i}$ ’s are the eigenvalues of $M$ . We always assume $\lambda_{1}\geq\lambda_{2}\geq\ldots\geq\lambda_{n}$ . We use $\lambda_{1}(M)$ to denote its largest eigenvalue and $\lambda_{n}(M)$ to denote its smallest eigenvalue. By the property of symmetric matrices we also know $\|M\|=\max\{|\lambda_{1}(M)|,|\lambda_{n}(M)|\}$ . We use $\|M\|_{F}$ to denote the Frobenius norm of the matrix $\|M\|_{F}=\sqrt{\sum_{i,j\in[n]}M_{i,j}^{2}}$ .

The third order derivative is represented by a $n\times n\times n$ tensor $T$ . We use the following multilinear notation to simplify the notations of tensors:

The Frobenius norm of a tensor $T$ is defined similarly as matrices: $\|T\|_{F}=\sqrt{\sum_{i,j,k\in[n]}T_{i,j,k}^{2}}$ . The spectral norm (also called injective norm) of a tensor is defined as

We say a tensor is symmetric if $T_{i,j,k}=T_{\pi(i,j,k)}$ for any permutation of the indices. For symmetric tensors the spectral norm is also equal to $\|T\|=\max_{\|u\|=1}T(u,u,u)$ . In both cases it is NP-hard to compute the spectral norm of a tensor(Hillar and Lim, 2013).

We will often need to project a tensor $T$ to a subspace ${\cal P}$ . Let $P$ be the projection matrix to the subspace $P$ , we use the notation $\mbox{Proj}_{\cal P}T$ which denotes $T(P,P,P)$ . Intuitively, $[T(P,P,P)]_{u,v,w}=T(Pu,Pv,Pw)$ , that is, the projected tensor applied to vector $u,v,w$ is equivalent to the original tensor applied to the projection of $u,v,w$ .

Overview of Nestorov’s Cubic Regularization

In this section we review the guarantees of Nesterov’s Cubic Regularization algorithm(Nesterov and Polyak, 2006). We will use this algorithm as a key step later in Section 5, and prove analogous results for third order local minimum.

The algorithm requires the first two order derivatives exist and the following smoothness constraint:

At a point $x$ , the algorithm tries to find a nearby point $z$ that optimizes the degree two Taylor’s expansion: $f(x)+\langle\nabla f(x),z-x\rangle+\frac{1}{2}(z-x)^{\top}(\nabla^{2}f(x))(z-x)$ , with the cubic distance $\frac{R}{6}\|z-x\|^{3}$ as a regularizer. See Algorithm 1 for one iteration of the algorithm. The final algorithm generates a sequence of points $x^{(0)},x^{(1)},x^{(2)},\ldots$ where $x^{(i+1)}=\mbox{CubicReg}(x^{(i)})$ .

The optimization problem that Algorithm 1 tries to solve may seem difficult, as it has a cubic regularizer $\|z-x\|^{3}$ . However, Nesterov and Polyak (2006) showed that it is possible to solve this optimization problem in polynomial time.

For each point, define $\mu(z)$ to measure how close the point $z$ is to satisfying the second order optimality condition:

$\mu(z)=\max\left\{\sqrt{\frac{1}{R}\|\nabla f(z)\|},-\frac{2}{3R}\lambda_{n}\nabla^{2}f(z)\right\}$

When $\mu(z)=0$ we know $\nabla f(z)=0$ and $\nabla^{2}f(z)\succeq 0$ , which satisfies the second order necessary conditions (and in fact implies that $z$ is a second order local minimum). When $\mu(z)$ is small we can say that the point $z$ approximately satisfies the second order optimality condition.

For one step of the algorithm the following guarantees can be provenAll of guarantees we stated here correspond to setting the regularizer $R$ to be exactly equal to the smoothness in Assumption 1.

(Nesterov and Polyak, 2006) Suppose $z=\mbox{CubicRegularize}(x)$ , then $\|z-x\|\geq\mu(z)$ and $f(z)\leq f(x)-R\|z-x\|^{3}/12$ .

Using Theorem 3, Nesterov and Polyak (2006) can get strong convergence results for the sequence $x^{(0)},x^{(1)},x^{(2)},\ldots$

(Nesterov and Polyak, 2006) If $f(x)$ is bounded below by $f(x^{*})$ , then $\lim_{i\to\infty}\mu(x^{(i)})=0$ , and for any $t\geq 1$ we have

This theorem shows that within first $t$ iterations, we can find a point that “looks similar” to a second order local minimum in the sense that gradient is small and Hessian does not have a negative eigenvalue with large absolute value. It is also possible to prove stronger guarantees for the limit points of the sequence:

(Nesterov and Polyak, 2006) If the level set ${\cal L}(x^{(0)}):=\{x|f(x)\leq f(x^{(0)})\}$ is bounded, then the following limit exists

The set $X^{*}$ of the limit points of this sequence is non-empty. Moreover this is a connected set such that for any $x\in X^{*}$ we have

Therefore the algorithm always converges to a set of points that are all second order local minima.

Third Order Necessary Condition

In this section we present a condition for a point to be a third order local minimum, and show that it is necessary and sufficient for a class of smooth functions. Proofs are deferred to Appendix A.1.

All the functions we consider satisfies the following natural smoothness conditions

Under this assumption, we state our conditions for a point to be a third order local minimum.

A point $x$ satisfy third-order necessary condition, if

For any $u$ that satisfy $u^{\top}(\nabla^{2}f(x))u=0$ , $[\nabla^{3}f(x)](u,u,u)=0$ .

We first note that this condition can be verified in polynomial time.

Conditions in Definition 4 can be verified in polynomial time given the gradients $\nabla f(x),\nabla^{2}f(x)$ and $\nabla^{3}f(x)$ .

It is easy to check whether $\nabla f(x)=0$ and $\nabla^{2}f(x)\succeq 0$ . We can also use SVD to compute the subspace ${\cal P}$ such that $u^{\top}(\nabla^{2}f(x))u=0$ if and only if $u\in{\cal P}$ .

Now we can compute the projection of $\nabla^{3}f(x)$ in the subspace ${\cal P}$ , and we claim the third condition is violated if and only if the projection is nonzero.

Given a function $f$ that satisfies Assumption 2, a point $x$ is third order optimal if and only if it satisfies Condition 4.

Before proving the theorem, we first show a bound on $f(y)$ and a Taylor’s expansion of $f$ at point $x$ .

The Lemma can be proved by integrating over the third order derivatives three times and bounding the differences. Details are deferred to Appendix A.1.

This lemmas allow us to ignore the fourth order term $\|y-x\|^{4}$ and focus on the order 3 Taylor expansion when $\|y-x\|$ is small. To prove Theorem 6, intuitively, the “only if” direction (local minimum to necessary condition) is easy because if any condition in Definition 4 is violated, we can use that particular derivative to find a direction that improves the function value. For the “if” direction (necessary condition to third order local minimum), the main challenge is to balance the contribution we get from the positive part of the Hessian matrix and the third order derivatives. For details see Appendix A.1.

Algorithm for Finding Third Order Optimal Points

We design an algorithm that is guaranteed to converge to a third order local minimum. Throughout this section we assume both Assumptions 1 and 2 Note that we actually only cares about a level set ${\cal L}=\{x|f(x)\leq f(x^{(0)})\}$ , as long as this set is bounded Assumptions 1 follows from Assumption 2.

The main intuition of the algorithm is similar to the proof of Theorem 6: the algorithm tries to make improvements using first, second or third order information. However, the nature of the third order condition makes it challenging for the algorithm to guarantee progress.

Consider a potential local minimum point $x$ . It is very easy to check whether $\nabla f(x)\neq 0$ or $\lambda_{min}(\nabla^{2}f(x))<0$ , and to make progress using the corresponding directions. However, to verify Condition 3 in Definition 4, we need to do it in the right subspace.

The naïve guess is that we should take the eigensubspace of $\nabla^{2}f(x)$ with eigenvalue at most 0. However, this is not correct because even if $x$ is a second order local minimum that does not satisfy the third order condition, it is still possible to have a sequence of $x^{(i)}$ ’s that converge to $x$ with $\nabla^{2}f(x^{(i)})$ all be strictly positive definite. Hence all the $x^{(i)}$ ’s appear to satisfy Condition 3 in Definition 4. We do not want to the algorithm to spend too much time around this point $x$ , so we need to identify a subspace that may have some positive eigenvalues. In order to make sure we can find a vector the contribution from third order term is larger than the second order term, we define competitive subspace below:

For any symmetric matrix $M$ , let its eigendecomposition be $M=\sum_{i=1}^{n}\lambda_{i}v_{i}v_{i}^{\top}$ (where $\lambda_{i}$ ’s are eigenvalues and $\|v_{i}\|=1$ ), we use ${\cal S}_{\tau}(M)$ to denote the span of eigenvectors with eigenvalue at most $\tau$ . That is

For any $Q>0$ , and any point $z$ , let the competitive subspace ${\cal S}(z)$ be the largest eigensubspace ${\cal S}_{\tau}(\nabla^{2}f(z))$ , such that if we let $C_{Q}(z)$ be the norm of the third order derivatives in this subspace

If no such subspace exists then let ${\cal S}(z)$ be empty and $C_{Q}(z)=0$ .

Similar to $\mu(z)$ as in Definition 3, $C_{Q}(z)$ can be viewed as how Condition 3 in Definition 4 is satisfied approximately. If both $\mu(z)$ and $C_{Q}(z)$ are then the point $z$ satisfies third order necessary conditions.

Intuitively, competitive subspace is a subspace where the eigenvalues of the Hessian are small, but the Frobenius norm of the third order derivative is large. Therefore we are likely to make progress using the third order information. The parameters in Definition 6 are set so that if there is a unit vector $u\in{\cal S}(z)$ such that $[\nabla^{3}f(z)](u,u,u)\geq\|\mbox{Proj}_{{\cal S}(z)}\nabla^{3}f(z)\|_{F}/Q$ (see Theorem 7), then we can find a new point where the sum of second, third and fourth order term can be bounded (see Lemma 2).

The competitive subspace in Definition 6 can be computed in polynomial time, see Algorithm 4. The main idea is that we can compute the eigendecomposition of the Hessian $\nabla^{2}f(z)=\sum_{i=1}^{n}\lambda_{i}v_{i}v_{i}^{\top}$ , and then there are only $n$ different subspaces ( $\mbox{span}\{v_{n}\},\mbox{span}\{v_{n-1},v_{n}\},$ $\ldots,\mbox{span}\{v_{1},v_{2},\ldots v_{n}\}$ ). We can enumerate over all of them, and check for which subspaces the norm of the third order derivative is large.

Now we are ready to state the algorithm. The algorithm is a combination of the cubic regularization algorithm and a third order step that tries to use the third order derivative in order to improve the function value in the competitive subspace.

Suppose we have the following approximation guarantee for Algorithm 3

There is a universal constant $B$ such that the expected number of iterations of Algorithm 3 is at most $2$ , and the output of Approx is a unit vector $u$ that satisfies $T(u,u,u)\geq\|\mbox{Proj}_{\cal S}T\|_{F}/Q$ for $Q=Bn^{1.5}$ .

The proof of this theorem follows directly from anti-concentration (see Appendix A.2. Notice that there are other algorithms that can potentially give better approximation (lower value of $Q$ ) which will improve the rate of our algorithm. However in this paper we do not try to optimize over dependencies over the dimension $n$ , that is left as an open problem.

By the choice of the parameters in the algorithm, we can get the following guarantee (which is analogous to Theorem 3):

If $C_{Q}(z)\geq Q(24\epsilon_{1}L)^{1/3}$ , $u$ is a unit vector in ${\cal S}(z)$ and $[\nabla^{3}f(z)](u,u,u)\geq\|\mbox{Proj}_{{\cal S}(z)}\nabla^{3}f(z)\|_{F}/Q$ . Let $x^{\prime}=z-C_{Q}(z)/LQ\cdot u$ . then we have

Let $\epsilon=C_{Q}(z)/LQ$ , then by Lemma 1 we know

Here $\epsilon_{1}=\|\nabla f(z)\|$ , and $\epsilon_{2}\leq\frac{C_{Q}(z)^{2}}{12LQ^{2}}$ by the construction of the subspace.

By the choice of parameters, we know the terms $\epsilon_{1}\epsilon,\epsilon_{2}\epsilon^{2}/2,L\epsilon^{4}/24$ are all bounded by $\frac{\epsilon^{3}C_{Q}(z)}{24Q}$ , therefore

Using this Lemma, and Theorem 3 for cubic regularization, we can show that both progress measure goes to 0 as the number of steps increase (this is analogous to Theorem 4).

Suppose the algorithm starts at $f(x_{0})$ , and $f$ has global min at $f(x^{*})$ . Then in one of the $t$ iterations we have

$\mu(z)\leq\left(\frac{12(f(x_{0})-f(x^{*})}{Rt}\right)^{1/3}$ .

$C_{Q}(z)\leq\max\left\{Q(24\|\nabla f(z)\|L)^{1/3},Q\left(\frac{24L^{3}(f(x_{0})-f(x^{*}))}{t}\right)^{1/4}\right\}.$

Recall $\mu(z)=\max\left\{\sqrt{\frac{1}{R}\|\nabla f(z)\|},-\frac{2}{3R}\lambda_{n}\nabla^{2}f(z)\right\}$ is intuitively measuring how much first and second order progress the algorithm can make. The value $C_{Q}(z)$ as defined in Definition 6 is a measure of how much third order progress the algorithm can make. The theorem shows both values goes to 0 as $t$ increases (note that even the first term $Q(24\|\nabla f(z)\|L)^{1/3}$ in the bound for $C_{Q}(z)$ goes to 0 because the $\|\nabla f(z)\|$ goes to 0).

By the guarantees of Theorem 3 and Lemma 2, we know the sequence of points $x^{(0)},z^{(0)},\ldots,x^{(i)},z^{(i)},\ldots$ has non-increasing function values. Also,

So there must be an iteration where $f(x^{(i)})-f(x^{(i-1)})\leq\frac{f(x_{0})-f(x^{*})}{t}$ .

If $\mu(z)>\left(\frac{12(f(x_{0})-f(x^{*})}{Rt}\right)^{1/3}$ , then Theorem 3 implies $f(x^{(i-1)})-f(z^{(i-1)})>\frac{f(x_{0})-f(x^{*})}{t}$ , which is impossible.

On the other hand if $C_{Q}(z)\leq\max\left\{Q(24\|\nabla f(z)\|L)^{1/3},Q\left(\frac{24L^{3}(f(x_{0})-f(x^{*}))}{t}\right)^{1/4}\right\}$ , then the third order step makes progress, and we know $f(z^{(i-1)})-f(x^{(i)})>\frac{f(x_{0})-f(x^{*})}{t}$ , which is again impossible.

We can also show that when $t$ goes to infinity the algorithm converges to a third order local minimum (similar to Theorem 5).

When $t$ goes to infinity, the values $f(x^{(t)})$ converge. If the level set ${\cal L}(f(x_{0}))=\{x|f(x)\leq f(x_{0})\}$ is compact, then the sequence of points $x^{(t)},z^{(t)}$ has nonempty limit points, and every limit point $x$ satisfies the third order necessary conditions.

By Theorem 3 and Lemma 2, we know the function value is non-increasing, and it has a lowerbound $f(x^{*})$ , so the value must converge.

The existence of limit points is guaranteed by the compactness of the level set. The only thing left to prove is that every limit point $x$ must satisfy the third order necessary conditions.

Notice that $f(x^{(0)})-\lim_{t\to\infty}f(x^{(t)}\geq\sum_{i=0}^{\infty}\frac{R\mu(z^{(i)})^{3}}{12}+\frac{C_{Q}(z^{(i)})^{4}}{24L^{3}Q^{4}}$ , so $\lim_{i\to\infty}\mu(z^{(i)}=0$ and $\lim_{i\to\infty}C_{Q}(z^{(i)})=0$ . Also we know further $\lim_{i\to\infty}\|z^{(i)}-x^{(i)}\|=0$ . Therefore wlog a limit point $x$ is also a limit point of sequence $z$ , and $\lim_{i\to\infty}\|\nabla f(z)\|=0$ . Also we know $H=\nabla^{2}f(x)$ is PSD, because otherwise points near $x$ will have nonzero $\mu(z^{(i)}$ and $x$ cannot be a limit point.

Now we only need to check the third order condition. Assume towards contradiction that third order condition is not true. The we know the Hessian has a subspace ${\cal P}$ with eigenvalues, and the third order derivative has norm at least $\epsilon$ in this subspace. By matrix perturbation theory, when $z$ is very close to $x$ , ${\cal P}$ is very close to ${\cal S}_{\epsilon}(z)$ for $\epsilon\to 0$ , on the other hand the third order tensor also converges to $\nabla^{3}f(x)$ (by Lipschitz condition), so ${\cal S}_{\epsilon}(z)$ will eventually be a competitive subspace and $C_{Q}(z)$ is at least $\epsilon/2$ for all $z$ . However this is impossible as $\lim_{i\to\infty}C_{Q}(z^{(i)})=0$ .

Note that not all third order local minimum can be the limit point for Algorithm 2. This is because if $f(x)$ has very large third order derivatives but relatively smaller Hessian, even though the Hessian might be positive definite (so $x$ is in fact a local minimum), Algorithm 2 may still find a non-empty competitive subspace, and will be able to reduce the function value and escape from the saddle point. An example is for the function $f(x)=x^{2}-100x^{3}+x^{4}$ , $x=0$ is a local minimum but the algorithm can escape from that and find the global minimum.

In the most general case it is hard to get a convergence rate for the algorithm because the function may have higher order local minima. However, if the function has nice properties then it is possible to prove polynomial rates of convergence.

We say a function is strict third order saddle, if there exists constants $\alpha,c_{1},c_{2},c_{3},c_{4}>0$ such that for any point $x$ one of the following is true:

There is a local minimum $x^{*}$ such that $\|x-x^{*}\|\leq c_{4}$ and the function is $\alpha$ -strongly convex restricted to the region $\{x|\|x-x^{*}\|\leq 2c_{4}\}$ .

This is a generalization of the strict saddle functions defined in Ge et al. (2015). Even if a function has degenerate saddle points, it may still satisfy this condition.

When $t\geq\mbox{poly}(n,L,R,Q,f(x_{0})-f(x^{*}))\max\{(1/c_{1})^{1.5},(1/c_{2})^{3},(1/c_{3})^{4.5}\}$ , there must be a point $z^{(i)}$ with $i\leq t$ that is in case 4 in Definition 7.

Therefore, when $t\geq\mbox{poly}(n,L,R,Q,f(x_{0})-f(x^{*}))\max\{(1/c_{1})^{1.5},(1/c_{2})^{3},(1/c_{3})^{4.5}\}$ , the point $z$ must satisfy

Therefore the first three cases in Definition 7 cannot happen and $z$ must be near a local minimum.

Hardness for Finding a fourth order Local Minimum

In this section we show it is hard to find a fourth order local minimum even if the function we consider is very well-behaved.

We say a function $f$ is well-behaved if it is infinite-order differentiable, and satisfies:

$f(x)$ has a global minimizer at some point $\|x\|\leq 1$ .

$f(x)$ has bounded first 5 derivatives for $\|x\|\leq 1$ .

For any direction $\|x\|=1$ , $f(tx)$ is increasing for $t\geq 1$ .

It is NP-hard to find a fourth order local minimum of a function $f(x)$ , even if $f$ is guaranteed to be well-behaved.

The main idea of the proof comes from the fact that we cannot even verify the nonnegativeness of a degree 4 polynomial (hence there are cases where we cannot verify whether a point is a fourth order local minimum or not).

Nesterov (2000); Hillar and Lim (2013) It is NP-hard to tell whether a degree 4 homogeneous polynomial $f(x)$ is nonnegative.

The NP hardness for nonnegativeness of degree 4 polynomial has been proved has been proved in several ways. In Nesterov (2000) the reduction is from the SUBSET SUM problem, which results in a polynomial that can have exponentially large coefficients and does not rule out FPTAS. However, the reduction in Hillar and Lim (2013) relies on the hardness of copositive matrices, which in turn depends on the hardness of INDEPENDENT SET(Dickinson and Gijben, 2014). This reduction gives a polynomial whose coefficients can be bounded by $\mbox{poly}(n)$ , and a polynomial gap that rules out FPTAS.

To prove Theorem 10 we only need to reduce the nonnegativeness problem in Theorem 11 to the problem of finding a fourth order local minimum. We can convert a degree 4 polynomial to a well behaved function by adding a degree 6 regularizer $\|x\|^{6}$ . We shall show when the degree 4 polynomial is nonnegative the $\vec{0}$ point is the only fourth order local minimum; when the degree 4 polynomial has negative directions then every fourth order local minimum must have negative function value. The details are deferred to Section A.3.

Conclusion

Complicated structures of saddle points are a major problem for optimization algorithms. In this paper we investigate the possibilities of using higher order derivatives in order to avoid degenerate saddle points. We give the first algorithm that is guaranteed to find a 3rd order local minimum, which can solve some problems caused by degenerate saddle points. However, we also show that the same ideas cannot be generalized to higher orders.

There are still many open problems related to degenerate saddle points and higher order optimization algorithms. Are there interesting class of functions that satisfies the strict 3rd order saddle property (Definition 7)? Can we design a 3rd order optimization algorithm for constrained optimization? We hope this paper inspires more research in these directions and eventually design efficient optimization algorithms whose performance do not suffer from degenerate saddle points.

References

Appendix A Omitted Proofs

The proof follows from integration from $x$ to $y$ repeatedly.

By the Lipschitz condition on third order derivative, we know

where $h(u)=\left[\int_{0}^{u}(\nabla^{3}f(x+v(y-x))-\nabla^{3}f(x))dv\right](y-x)$ , so $\|h(u)\|_{F}\leq\frac{L}{2}\|x-y\|^{2}$ .

Now we use the integral for the gradient of $f$ :

Let $g(t)=\left[\int_{0}^{t}h(u)du\right](y-x)$ , by the bound on $h(u)$ we know $\|g(t)\|\leq\frac{1}{6}\|x-y\|^{3}$ . Finally, we have

The last term is bounded by $\|y-x\|\int_{0}^{1}\|g(t)\|dt\leq\frac{L}{24}\|x-y\|^{4}$ .

Given a function $f$ that satisfies Assumption 2, a point $x$ is third order optimal if and only if it satisfies Condition 4.

(necessary condition $\to$ third order minimal) By Lemma 1 we know

Now let $\alpha$ be the smallest nonzero eigenvalue of $\nabla^{2}f(x)$ . Let $U$ be nullspace of $\nabla^{2}f(x)$ and $V$ be the orthogonal subspace. We break $\nabla^{3}f(x)$ into two tensors $G_{1}$ and $G_{2}$ , where $G_{1}$ is the projection to $V\otimes V\otimes V$ , $V\otimes V\otimes U$ (and its symmetries), and $G_{2}$ is the projection to $V\otimes U\otimes U$ (and its symmetries). Note that $\nabla^{3}f(x)=G_{1}+G_{2}$ because the projection on $U\otimes U\otimes U$ is 0 by the third condition. Let $\beta$ be the max injective norm of $G_{1}$ and $G_{2}$ .

Now we know for any $u\in U$ and $v\in V$ ,

Now, if $\epsilon<\beta/\alpha$ , because $\|u\|_{2}\leq\epsilon$ it is easy to see the sum of first two terms is at least $\frac{1}{3}\alpha\|v\|_{2}^{2}$ . Now we can take the mininum of

The minimum is achieved when $\|v\|=\|u\|^{2}\beta/\alpha$ and the minimum value is $-\|u\|^{4}\beta^{2}/6\alpha$ . Therefore when $\|u+v\|\leq\beta/\alpha$ we have

(third order minimal $\to$ necessary condition) Assume towards contradiction that the necessary condition is not satisfied, but the point $x$ is third order local optimal.

If the necessary condition is not satisfied, then one of the three cases happens:

In the first case the gradient $\nabla f(x)\neq 0$ . In this case, if we let $L^{\prime}$ be an upperbound the operator norms of the second and third order derivative, then we know

When $\epsilon\|\nabla f(x)\|\leq 1$ and $\epsilon(2L^{\prime}/3+L/24)\leq 1/2$ , we have

Therefore the point cannot be a third order local minimum.

In the second case, $\nabla f(x)=0$ , but $\lambda_{min}\nabla^{2}f(x)<0$ . Let $\|u\|=1$ be a unit vector such that $u^{\top}(\nabla^{2}f(x))u=-c<0$ . Let $L^{\prime}$ be the injective norm of $\nabla^{3}f(x)$ , then

Therefore whenever $\epsilon<\min\{\sqrt{3c/L},3c/4L^{\prime}\}$ we have $f(x+\epsilon u)\leq f(x)-\frac{c\epsilon^{2}}{4}$ . The point $x$ cannot be a third order local minimum.

The third case is if $\nabla f(x)=0$ , $\nabla^{2}f(x)$ is positive semidefinite, but there is a direction $\|u\|=1$ such that $u^{\top}(\nabla^{2}f(x))u=0$ , but $[\nabla^{3}f(x)](u,u,u)\neq 0$ . Without loss of generality we assume $[\nabla^{3}f(x)](u,u,u)=c>0$ (if it is negative we take $-u$ ), then

Therefore whenever $\epsilon<2c/L$ we have $f(x+\epsilon u)\leq f(x)-c\epsilon^{3}/12$ so $x$ cannot be a third order optimal.

A.2 Algorithm for Competitive Subspace, Proof of Theorem 7

We use the anti-concentration property for Gaussian random variables

In our case $d=3$ and we can choose some universal constant $\epsilon$ such that the probability of $p(x)$ being small is bounded by $1/3$ . It is easy to check that the variance is lowerbounded by the Frobenius norm squared, so

On the other hand with high probability we know the norm of the Gaussian $\hat{u}$ is at most $2\sqrt{n}$ . Therefore with probability at least $1/2$ , $|T(\hat{u},\hat{u},\hat{u})|\geq\epsilon\|\mbox{Proj}_{\cal S}T\|_{F}$ and $\|\hat{u}\|\leq 2\sqrt{n}$ , therefore $|T(u,u,u)|\geq\frac{\epsilon}{8n^{1.5}}\|\mbox{Proj}_{\cal S}T\|_{F}$ . Choosing $B=8/\epsilon$ implies the theorem.

A.3 Proof of Theorem 10

It is NP-hard to find a fourth order local minimum of a function $f(x)$ , even if $f$ is guaranteed to be well-behaved.

We reduce the problem of verifying nonnegativenss of degree 4 polynomial to the problem of finding fourth order local minimum.

Now we define the function $g(x)=f(x)+\|x\|^{6}$ . We first show that this function is well-behaved.

Next we show if $f(x)$ is nonnegative, then $\vec{0}$ is the unique fourth order local minimizer.

If $f(x)$ is nonnegative, then $\vec{0}$ is the unique fourth order local minimizer of $g(x)$ .

Suppose $x\neq 0$ is a local minimizer of $g(x)$ of order at least 1. Let $u=x/\|x\|$ . We consider the function $g(tu)=f(u)t^{4}+t^{6}$ . Clearly the only first order local minimizer of $g(tu)$ is at $t=0$ . Therefore $x$ cannot be a first order local minimizer of $g(x)$ .

Finally, we show if $f(x)$ has a negative direction, then all the local minimizer of $g(x)$ must have negative value in $f$ .

If $f(x)$ is negative for some $x$ , then if $x$ is a fourth order local minimum of $g(x)$ then $f(x)<0$ .

Suppose $x\neq 0$ is a fourth order local minimum of $g(x)$ . Then at least $t=1$ should be a fourth order local minimum of $g(tx)=f(x)t^{4}+t^{6}\|x\|^{6}$ . This is only possible if $f(x)<0$ .

On the other hand, for $x=0$ , suppose $\|z\|=1$ is a direction where $f(z)<0$ , then $f(x)-f(x+tz)=f(z)t^{4}-t^{6}=\Omega(t^{4})$ , so $x=0$ is not a fourth order local minimum.

The theorem follows immediately from the three claims.