How to Escape Saddle Points Efficiently

Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, Michael I. Jordan

Introduction

where η>0\eta>0 is a step size. Gradient descent and its variants (e.g., stochastic gradient) are widely used in machine learning applications due to their favorable computational properties. This is notably true in the deep learning setting, where gradients can be computed efficiently via back-propagation (Rumelhart et al., 1988).

In non-convex settings, however, convergence to first-order stationary points is not satisfactory. For non-convex functions, first-order stationary points can be global minima, local minima, saddle points or even local maxima. Finding a global minimum can be hard, but fortunately, for many non-convex problems, it is sufficient to find a local minimum. Indeed, a line of recent results show that, in many problems of interest, either all local minima are global minima (e.g., in tensor decomposition (Ge et al., 2015), dictionary learning (Sun et al., 2016a), phase retrieval (Sun et al., 2016b), matrix sensing (Bhojanapalli et al., 2016; Park et al., 2016), matrix completion (Ge et al., 2016), and certain classes of deep neural networks (Kawaguchi, 2016)). Moreover, there are suggestions that in more general deep newtorks most of the local minima are as good as global minima (Choromanska et al., 2014).

On the other hand, saddle points (and local maxima) can correspond to highly suboptimal solutions in many problems (see, e.g., Jain et al., 2015; Sun et al., 2016b). Furthermore, Dauphin et al. (2014) argue that saddle points are ubiquitous in high-dimensional, non-convex optimization problems, and are thus the main bottleneck in training neural networks. Standard analysis of gradient descent cannot distinguish between saddle points and local minima, leaving open the possibility that gradient descent may get stuck at saddle points, either asymptotically or for a sufficiently long time so as to make training times for arriving at a local minimum infeasible. Ge et al. (2015) showed that by adding noise at each step, gradient descent can escape all saddle points in a polynomial number of iterations, provided that the objective function satisfies the strict saddle property (see Assumption A2). Lee et al. (2016) proved that under similar conditions, gradient descent with random initialization avoids saddle points even without adding noise. However, this result does not bound the number of steps needed to reach a local minimum.

Though these results establish that gradient descent can find local minima in a polynomial number of iterations, they are still far from being efficient. For instance, the number of iterations required in Ge et al. (2015) is at least Ω(d4)\Omega(d^{4}), where dd is the underlying dimension. This is significantly suboptimal compared to rates of convergence to first-order stationary points, where the iteration complexity is dimension-free. This motivates the following question: Can gradient descent escape saddle points and converge to local minima in a number of iterations that is (almost) dimension-free?

In order to answer this question formally, this paper investigates the complexity of finding ϵ\epsilon-second-order stationary points. For ρ\rho-Hessian Lipschitz functions (see Definition 5), these points are defined as (Nesterov and Polyak, 2006):

Under the assumption that all saddle points are strict (i.e., for any saddle point xs\mathbf{x}_{s}, λmin(2f(xs))<0\lambda_{\min}(\nabla^{2}f(\mathbf{x}_{s}))<0), all second-order stationary points (ϵ=0\epsilon=0) are local minima. Therefore, convergence to second-order stationary points is equivalent to convergence to local minima.

As many real learning problems present strong local geometric properties, similar to strong convexity in the global setting (see, e.g. Bhojanapalli et al., 2016; Sun and Luo, 2016; Zheng and Lafferty, 2016), it is important to note that our analysis naturally takes advantage of such local structure. We show that when local strong convexity is present, the ϵ\epsilon-dependence goes from a polynomial rate, 1/ϵ21/\epsilon^{2}, to linear convergence, log(1/ϵ)\log(1/\epsilon). As an example, we show that sharp global convergence rates can be obtained for matrix factorization as a direct consequence of our analysis.

This paper presents the first sharp analysis that shows that (perturbed) gradient descent finds an approximate second-order stationary point in at most polylog(d)polylog(d) iterations, thus escaping all saddle points efficiently. Our main technical contributions are as follows:

Under a strict-saddle condition (see Assumption A2), this convergence result directly applies for finding local minima. This means that gradient descent can escape all saddle points with only logarithmic overhead in runtime.

When the function has local structure, such as local strong convexity (see Assumption A3.a), the above results can be further improved to linear convergence. We give sharp rates that are comparable to previous problem-specific local analysis of gradient descent with smart initialization (see Section 1.2).

All the above results rely on a new characterization of the geometry around saddle points: points from where gradient descent gets stuck at a saddle point constitute a thin “band.” We develop novel techniques to bound the volume of this band. As a result, we can show that after a random perturbation the current point is very unlikely to be in the “band”; hence, efficient escape from the saddle point is possible (see Section 5).

2 Related Work

Over the past few years, there have been many problem-specific convergence results for non-convex optimization. One line of work requires a smart initialization algorithm to provide a coarse estimate lying inside a local neighborhood, from which popular local search algorithms enjoy fast local convergence (see, e.g., Netrapalli et al., 2013; Candes et al., 2015; Sun and Luo, 2016; Bhojanapalli et al., 2016). While there are not many results that show global convergence for non-convex problems, Jain et al. (2015) show that gradient descent yields global convergence rates for matrix square-root problems. Although these results give strong guarantees, the analyses are heavily tailored to specific problems, and it is unclear how to generalize them to a wider class of non-convex functions.

For general non-convex optimization, there are a few previous results on finding second-order stationary points. These results can be divided into the following three categories, where, for simplicity of presentation, we only highlight dependence on dimension dd and ϵ\epsilon, assuming that all other problem parameters are constant from the point of view of iteration complexity:

Hessian-based: Traditionally, only second-order optimization methods were known to converge to second-order stationary points. These algorithms rely on computing the Hessian to distinguish between first- and second-order stationary points. Nesterov and Polyak (2006) designed a cubic regularization algorithm which converges to an ϵ\epsilon-second-order stationary point in O(1/ϵ1.5)O(1/\epsilon^{1.5}) iterations. Trust region algorithms (Curtis et al., 2014) can also achieve the same performance if the parameters are chosen carefully. These algorithms typically require the computation of the inverse of the full Hessian per iteration, which can be very expensive.

Hessian-vector-product-based: A number of recent papers have explored the possibility of using only Hessian-vector products instead of full Hessian information in order to find second-order stationary points. These algorithms require a Hessian-vector product oracle: given a function ff, a point x\mathbf{x} and a direction u\mathbf{u}, the oracle returns 2f(x)u\nabla^{2}f(\mathbf{x})\cdot\mathbf{u}. Agarwal et al. (2016) and Carmon et al. (2016) presented accelerated algorithms that can find an ϵ\epsilon-second-order stationary point in O(logd/ϵ7/4)O(\log d/\epsilon^{7/4}) steps. Also, Carmon and Duchi (2016) showed by running gradient descent as a subroutine to solve the subproblem of cubic regularization (which requires Hessian-vector product oracle), it is possible to find an ϵ\epsilon-second-order stationary pointin O(logd/ϵ2)O(\log d/\epsilon^{2}) iterations. In many applications such an oracle can be implemented efficiently, in roughly the same complexity as the gradient oracle. Also, when the function has a Hessian Lipschitz property such an oracle can be approximated by differentiating the gradients at two very close points (although this may suffer from numerical issues, thus is seldom used in practice).

Gradient-based: Another recent line of work shows that it is possible to converge to a second-order stationary point without any use of the Hessian. These methods feature simple computation per iteration (only involving gradient operations), and are closest to the algorithms used in practice. Ge et al. (2015) showed that stochastic gradient descent could converge to a second-order stationary point in poly(d/ϵ)\text{poly}(d/\epsilon) iterations, with polynomial of order at least four. This was improved in Levy (2016) to O(d3poly(1/ϵ))O(d^{3}\cdot\text{poly}(1/\epsilon)) using normalized gradient descent. The current paper improves on both results by showing that perturbed gradient descent can actually find an ϵ\epsilon-second-order stationary point in O(polylog(d)/ϵ2)O(\text{polylog}(d)/\epsilon^{2}) steps, which matches the guarantee for converging to first-order stationary points up to polylog factors.

Preliminaries

In this section, we will first introduce our notation, and then present some definitions and existing results in optimization which will be used later.

2 Gradient Descent

A twice-differentiable function f()f(\cdot) is α\alpha-strongly convex if x, λmin(2f(x))α\forall\mathbf{x},~{}\lambda_{\min}(\nabla^{2}f(\mathbf{x}))\geq\alpha

Such smoothness guarantees imply that the gradient can not change too rapidly, and strong convexity ensures that there is a unique stationary point (and hence a global minimum). Standard analysis using these two properties shows that gradient descent converges linearly to a global optimum x\mathbf{x}^{\star} (see e.g. (Bubeck et al., 2015)).

In a more general setting, we no longer have convexity, let alone strong convexity. Though global optima are difficult to achieve in such a setting, it is possible to analyze convergence to first-order stationary points.

For a differentiable function f()f(\cdot), we say that x\mathbf{x} is a first-order stationary point if f(x)=0\|{\nabla f(\mathbf{x})}\|=0; we also say x\mathbf{x} is an ϵ\epsilon-first-order stationary point if f(x)ϵ\|{\nabla f(\mathbf{x})}\|\leq\epsilon.

Note that the iteration complexity does not depend explicitly on intrinsic dimension; in the literature this is referred to as “dimension-free optimization.”

A first-order stationary point can be either a local minimum or a saddle point or a local maximum. For minimization problems, saddle points and local maxima are undesirable, and we abuse nomenclature to call both of them “saddle points” in this paper. The formal definition is as follows:

For a differentiable function f()f(\cdot), we say that x\mathbf{x} is a local minimum if x\mathbf{x} is a first-order stationary point, and there exists ϵ>0\epsilon>0 so that for any y\mathbf{y} in the ϵ\epsilon-neighborhood of x\mathbf{x}, we have f(x)f(y)f(\mathbf{x})\leq f(\mathbf{y}); we also say x\mathbf{x} is a saddle point if x\mathbf{x} is a first-order stationary point but not a local minimum. For a twice-differentiable function f()f(\cdot), we further say a saddle point x\mathbf{x} is strict (or non-degenerate) if λmin(2f(x))<0\lambda_{\min}(\nabla^{2}f(\mathbf{x}))<0.

For a twice-differentiable function f()f(\cdot), we know a saddle point x\mathbf{x} must satify λmin(2f(x))0\lambda_{\min}(\nabla^{2}f(\mathbf{x}))\leq 0. Intuitively, for saddle point x\mathbf{x} to be strict, we simply rule out the undetermined case λmin(2f(x))=0\lambda_{\min}(\nabla^{2}f(\mathbf{x}))=0, where Hessian information alone is not enough to check whether x\mathbf{x} is a local minimum or saddle point. In most non-convex problems, saddle points are undesirable.

To escape from saddle points and find local minima in a general setting, we move both the assumptions and guarantees in Theorem 2 one order higher. In particular, we require the Hessian to be Lipschitz:

A twice-differentiable function f()f(\cdot) is ρ\rho-Hessian Lipschitz if:

That is, Hessian can not change dramatically in terms of spectral norm. We also generalize the definition of first-order stationary point to higher order:

For a ρ\rho-Hessian Lipschitz function f()f(\cdot), we say that x\mathbf{x} is a second-order stationary point if f(x)=0\|{\nabla f(\mathbf{x})}\|=0 and λmin(2f(x))0\lambda_{\min}(\nabla^{2}f(\mathbf{x}))\geq 0; we also say x\mathbf{x} is ϵ\epsilon-second-order stationary point if:

Second-order stationary points are very important in non-convex optimization because when all saddle points are strict, all second-order stationary points are exactly local minima.

Note that the literature sometime defines ϵ\epsilon-second-order stationary point by two independent error terms; i.e., letting f(x)ϵg\|{\nabla f(\mathbf{x})}\|\leq\epsilon_{g} and λmin(2f(x))ϵH\lambda_{\min}(\nabla^{2}f(\mathbf{x}))\geq-\epsilon_{H}. We instead follow the convention of Nesterov and Polyak (2006) by choosing ϵH=ρϵg\epsilon_{H}=\sqrt{\rho\epsilon_{g}} to reflect the natural relations between the gradient and the Hessian. This definition of ϵ\epsilon-second-order stationary point can also differ by reparametrization (and scaling), e.g. Nesterov and Polyak (2006) use ϵ=ϵ/ρ\epsilon^{\prime}=\sqrt{\epsilon/\rho}. We choose our parametrization so that the first requirement of ϵ\epsilon-second-order stationary point coincides with the requirement of ϵ\epsilon-first-order stationary point, for a fair comparison of our result with Theorem 2.

Main Result

In this section we show that it possible to modify gradient descent in a simple way so that the resulting algorithm will provably converge quickly to a second-order stationary point.

We first state the assumptions that we require.

The Hessian Lipschitz condition ensures that the function is well-behaved near a saddle point, and the small perturbation we add will suffice to allow the subsequent gradient updates to escape from the saddle point. More formally, we have:

We believe that the dependence on at least one logd\log d factor in the iteration complexity is unavoidable in the non-convex setting, as our result can be directly applied to the principal component analysis problem, for which the best known runtimes (for the power method or Lanczos method) incur a logd\log d factor. Establishing this formally is still an open question however.

If we are able to make stronger assumptions on the objective function we are able to strengthen our main result. This further analysis is presented in the next section.

In many real applications, objective functions further admit the property that all saddle points are strict (Ge et al., 2015; Sun et al., 2016a, b; Bhojanapalli et al., 2016; Ge et al., 2016). In this case, all second-order stationary points are local minima and hence convergence to second-order stationary points (Theorem 3) is equivalent to convergence to local minima.

To state this result formally, we introduce a robust version of the strict saddle property (cf. Ge et al., 2015):

Function f()f(\cdot) is (θ,γ,ζ)(\theta,\gamma,\zeta)-strict saddle. That is, for any x\mathbf{x}, at least one of following holds:

λmin(2f(x))γ\lambda_{\min}(\nabla^{2}f(\mathbf{x}))\leq-\gamma.

x\mathbf{x} is ζ\zeta-close to X\mathcal{X}^{\star} — the set of local minima.

Note although Corollary 4 only explicitly asserts that the output will lie within some fixed radius ζ\zeta from a local minimum. In many real applications, we can further write ζ\zeta as a function ζ()\zeta(\cdot) of gradient threshold θ\theta, so that when θ\theta decreases, ζ(θ)\zeta(\theta) decreases linearly or polynomially depending on θ\theta. Meanwhile, parameter γ\gamma is always nondecreasing when θ\theta decreases due to the nature of this strict saddle definition. Therefore, in these cases, the above corollary further gives a convergence rate to a local minimum.

2 Functions with Strong Local Structure

The convergence rate in Theorem 3 is polynomial in ϵ\epsilon, which is similar to that of Theorem 2, but is worse than the rate of Theorem 1 because of the lack of strong convexity. Although global strong convexity does not hold in the non-convex setting that is our focus, in many machine learning problems the objective function may have a favorable local structure in the neighborhood of local minima (Ge et al., 2015; Sun et al., 2016a, b; Sun and Luo, 2016). Exploiting this property can lead to much faster convergence (linear convergence) to local minima. One such property that ensures such convergence is a local form of smoothness and strong convexity:

In a ζ\zeta-neighborhood of the set of local minima X\mathcal{X}^{\star}, the function f()f(\cdot) is α\alpha-strongly convex, and β\beta-smooth.

In a ζ\zeta-neighborhood of the set of local minima X\mathcal{X}^{\star}, the function f()f(\cdot) satisfies a (α,β)(\alpha,\beta)-regularity condition if for any x\mathbf{x} in this neighborhood:

Here PX()\mathcal{P}_{\mathcal{X}^{\star}}(\cdot) is the projection on to the set X\mathcal{X}^{\star}. Note (α,β)(\alpha,\beta)-regularity condition is more general and is directly implied by standard β\beta-smooth and α\alpha-strongly convex conditions. This regularity condition commonly appears in low-rank problems such as matrix sensing and matrix completion, and has been used in Bhojanapalli et al. (2016); Zheng and Lafferty (2016), where local minima form a connected set, and where the Hessian is strictly positive only with respect to directions pointing outside the set of local minima.

Gradient descent naturally exploits local structure very well. In Algorithm 3, we first run Algorithm 2 to output a point within the neighborhood of a local minimum, and then perform standard gradient descent with step size 1β\frac{1}{\beta}. We can then prove the following theorem:

Theorem 5 says that if strong local structure is present, the convergence rate can be boosted to linear convergence (log1ϵ\log\frac{1}{\epsilon}). In this theorem we see that sequence of iterations can be decomposed into two phases. In the first phase, perturbed gradient descent finds a ζ\zeta-neighborhood by Corollary 4. In the second phase, standard gradient descent takes us from ζ\zeta to ϵ\epsilon-close to a local minimum. Standard gradient descent and Assumption A3.a (or A3.b) make sure that the iterate never steps out of a ζ\zeta-neighborhood in this second phase, giving a result similar to Theorem 1 with linear convergence.

Finally, we note our choice of local conditions (Assumption A3.a and A3.b) are not special. The interested reader can refer to Karimi et al. (2016) for other relaxed and alternative notions of convexity, which can also be potentially combined with Assumptions \refas:smoothLipand\refas:strictsaddle\ref{as:smooth_Lip}and\ref{as:strict_saddle} to yield convergence results of a similar flavor as that of Theorem 5.

Example — Matrix Factorization

As a simple example to illustrate how to apply our general theorems to specific non-convex optimization problems, we consider a symmetric low-rank matrix factorization problem, based on the following objective function:

The following two lemmas show that the objective function in Eq. (2) satisfies the geometric assumptions A1, A2,and A3.b. Moreover, all local minima are global minima.

For any Γσ1\Gamma\geq\sigma^{\star}_{1}, the function f(U)f(\mathbf{U}) defined in Eq. (2) is 8Γ8\Gamma-smooth and 12Γ1/212\Gamma^{1/2}-Hessian Lipschitz, inside the region {UU2<Γ}\{\mathbf{U}|\|{\mathbf{U}}\|^{2}<\Gamma\}.

For function f(U)f(\mathbf{U}) defined in Eq.(2), all local minima are global minima. The set of global minima is X={VRRR=RR=I}\mathcal{X}^{\star}=\{\mathbf{V}^{\star}\mathbf{R}|\mathbf{R}\mathbf{R}^{\top}=\mathbf{R}^{\top}\mathbf{R}=\mathbf{I}\}. Furthermore, f(U)f(\mathbf{U}) satisfies:

(124(σr)3/2,13σr,13(σr)1/2)(\frac{1}{24}(\sigma^{\star}_{r})^{3/2},\frac{1}{3}\sigma^{\star}_{r},\frac{1}{3}(\sigma^{\star}_{r})^{1/2})-strict saddle property.

(23σr,10σ1)(\frac{2}{3}\sigma^{\star}_{r},10\sigma^{\star}_{1})-regularity condition in 13(σr)1/2\frac{1}{3}(\sigma^{\star}_{r})^{1/2} neighborhood of X\mathcal{X}^{\star}.

One caveat is that since the objective function is actually a fourth-order polynomial with respect to U\mathbf{U}, the smoothness and Hessian Lipschitz parameters from Lemma 6 naturally depend on U\|{\mathbf{U}}\|. Fortunately, we can further show that gradient descent (even with perturbation) does not increase U\|{\mathbf{U}}\| beyond O(max{U0,(σ1)1/2})O(\max\{\|{\mathbf{U}_{0}}\|,(\sigma^{\star}_{1})^{1/2}\}). Then, applying Theorem 5 gives:

There exists an absolute constant cmaxc_{\max} such that the following holds. For the objective function in Eq. (2), for any δ>0\delta>0 and constant ccmaxc\leq c_{\max}, and for Γ1/2:=2max{U0,3(σ1)1/2}\Gamma^{1/2}\mathrel{\mathop{:}}=2\max\{\|{\mathbf{U}_{0}}\|,3(\sigma^{\star}_{1})^{1/2}\}, the output of PGDli(U0,8Γ,12Γ1/2,(σr)2108Γ1/2,c,δ,rΓ22,10σ1)\text{PGDli}(\mathbf{U}_{0},8\Gamma,12\Gamma^{1/2},\frac{(\sigma^{\star}_{r})^{2}}{108\Gamma^{1/2}},c,\delta,\frac{r\Gamma^{2}}{2},10\sigma^{\star}_{1}), will be ϵ\epsilon-close to the global minimum set X\mathcal{X}^{\star}, with probability 1δ1-\delta, after the following number of iterations:

Theorem 8 establishes global convergence of perturbed gradient descent from an arbitrary initial point U0\mathbf{U}_{0}, including exact saddle points. Suppose we initialize at U0=0\mathbf{U}_{0}=0, then our iteration complexity becomes:

Proof Sketch for Theorem 3

In this section we will present the key ideas underlying the main result of this paper (Theorem 3). We will first argue the correctness of Theorem 3 given two important intermediate lemmas. Then we turn to the main lemma, which establishes that gradient descent can escape from saddle points quickly. We present full proofs of all these results in Appendix A. Throughout this section, we use η,r,gthres,fthres\eta,r,g_{\text{thres}},f_{\text{thres}} and tthrest_{\text{thres}} as defined in Algorithm 2.

Recall that an ϵ\epsilon-second-order stationary point is a point with a small gradient, and where the Hessian does not have a significant negative eigenvalue. Suppose we are currently at an iterate xt\mathbf{x}_{t} that is not an ϵ\epsilon-second-order stationary point; i.e., it does not satisfy the above properties. There are two possibilities:

Gradient is large: f(xt)gthres\|{\nabla f(\mathbf{x}_{t})}\|\geq g_{\text{thres}}, or

Around saddle point: f(xt)gthres\|{\nabla f(\mathbf{x}_{t})}\|\leq g_{\text{thres}} and λmin(2f(xt))ρϵ\lambda_{\min}(\nabla^{2}f(\mathbf{x}_{t}))\leq-\sqrt{\rho\epsilon}.

The following two lemmas address these two cases respectively. They guarantee that perturbed gradient descent will decrease the function value in both scenarios.

(informal) Assume that f()f(\cdot) satisfies A1, If xt\mathbf{x}_{t} satisfies f(xt)gthres\|{\nabla f(\mathbf{x}_{t})}\|\leq g_{\text{thres}} and λmin(2f(xt))ρϵ\lambda_{\min}(\nabla^{2}f(\mathbf{x}_{t}))\leq-\sqrt{\rho\epsilon}, then adding one perturbation step followed by tthrest_{\text{thres}} steps of gradient descent, we have f(xt+tthres)f(xt)fthresf(\mathbf{x}_{t+t_{\text{thres}}})-f(\mathbf{x}_{t})\leq-f_{\text{thres}} with high probability.

We see that Algorithm 2 is designed so that Lemma 10 can be directly applied. According to these two lemmas, perturbed gradient descent will decrease the function value either in the case of a large gradient, or around strict saddle points. Computing the average decrease per step in function value yields the total iteration complexity. Since Algorithm 2 only terminate when the function value decreases too slowly, this guarantees that the output must be ϵ\epsilon-second-order stationary point (see Appendix A for formal proofs).

2 Main Lemma: Escaping from Saddle Points Quickly

The proof of Lemma 9 is straightforward and follows from traditional analysis. The key technical contribution of this paper is the proof of Lemma 10, which gives a new characterization of the geometry around saddle points.

The major challenge here is to bound the volume of this high-dimensional non-flat “pancake” shaped region Xstuck\mathcal{X}_{\text{stuck}}. A crude approximation of this “pancake” by a flat “disk” loses polynomial factors in the dimensionalilty, which gives a suboptimal rate. Our proof relies on the following crucial observation: Although we do not know the explicit form of the stuck region, we know it must be very “thin,” therefore it cannot have a large volume. The informal statement of the lemma is as follows:

Conclusion

This paper presents the first (nearly) dimension-free result for gradient descent in a general non-convex setting. We present a general convergence result and show how it can be further strengthened when combined with further structure such as strict saddle conditions and/or local regularity/convexity.

There are still many related open problems. First, in the presence of constraints, it is worthwhile to study whether gradient descent still admits similar sharp convergence results. Another important question is whether similar techniques can be applied to accelerated gradient descent. We hope that this result could serve as a first step towards a more general theory with strong, almost dimension free guarantees for non-convex optimization.

References

Appendix A Detailed Proof of Main Theorem

In this section, we give detailed proof for the main theorem. We will first state two key lemmas that show how the algorithm can make progress when the gradient is large or near a saddle point, and show how the main theorem follows from the two lemmas. Then we will focus on the novel technique in this paper: how to analyze gradient descent near saddle point.

In order to prove the main theorem, we need to show that the algorithm will not be stuck at any point that either has a large gradient or is near a saddle point. This idea is similar to previous works (e.g.[Ge et al., 2015]). We first state a standard Lemma that shows if the current gradient is large, then we make progress in function value.

By Assumption A1 and its property, we have:

The next lemma says that if we are “close to a saddle points”, i.e., we are at a point where the gradient is small, but the Hessian has a reasonably large negative eigenvalue. This is the main difficulty in the analysis. We show a perturbation followed by small number of standard gradient descent steps can also make the function value decrease with high probability.

The proof of this lemma is deferred to Section A.2. Using this Lemma, we can then prove the main Theorem.

In this proof, we will actually achieve some point satisfying following condition:

Since c1c\leq 1, χ1\chi\geq 1, we have cχ21\frac{\sqrt{c}}{\chi^{2}}\leq 1, which implies any x\mathbf{x} satisfy Eq.(3) is also a ϵ\epsilon-second-order stationary point.

Starting from x0\mathbf{x}_{0}, we know if x0\mathbf{x}_{0} does not satisfy Eq.(3), there are only two possibilities:

f(x0)>gthres\|{\nabla f(\mathbf{x}_{0})}\|>g_{\text{thres}}: In this case, Algorithm 2 will not add perturbation. By Lemma 12:

f(x0)gthres\|{\nabla f(\mathbf{x}_{0})}\|\leq g_{\text{thres}}: In this case, Algorithm 2 will add a perturbation of radius rr, and will perform gradient descent (without perturbations) for the next tthrest_{\text{thres}} steps. Algorithm 2 will then check termination condition. If the condition is not met, we must have:

This means on average every step decreases the function value by

By union bound, for all these perturbations, with high probability Lemma 13 is satisfied. As a result Algorithm 2 works correctly. The probability of that is at least

A.2 Main Lemma: Escaping from Saddle Points Quickly

Now we prove the main lemma (Lemma 13), which shows near a saddle point, a small perturbation followed by a small number of gradient descent steps will decrease the function value with high probability. This is the main step where we need new analysis, as the analysis previous works (e.g.[Ge et al., 2015]) do not work when the step size and perturbation do not depend polynomially in dimension dd.

However, we do not know the exact form of the function near the saddle point, so the escaping region does not have a clean analytic description. Explicitly computing its volume can be very difficult. Our proof rely on a crucial observation: although we do not know the shape of the stuck region, we know the “width” of it must be small, therefore it cannot have a large volume. We will formalize this intuition later in Lemma 15.

The proof of the main lemma requires carefully balancing between different quantities including function value, gradient, parameter space and number of iterations. For clarify, we define following scalar quantities, which serve as the “units” for function value, gradient, parameter space, and time (iterations). We will use these notations throughout the proof.

For simplicity of later proofs, we first restate Lemma 13 into a slightly more general form as follows. Lemma 13 is directly implied following lemma.

Now we will formalize the intuition that the “width” of stuck region is small.

By adding perturbation, in worst case we increase function value by:

The second last inequality is by the property of Gamma function that Γ(x+1)Γ(x+1/2)<x+12\frac{\Gamma(x+1)}{\Gamma(x+1/2)}<\sqrt{x+\frac{1}{2}} as long as x0x\geq 0. Therefore, with at least probability 1δ1-\delta, x0∉Xstuck\mathbf{x}_{0}\not\in\mathcal{X}_{\text{stuck}}. In this case, we have:

A.3 Bounding the Width of Stuck Region

In order to prove Lemma 15, we do it in two steps:

We first show if gradient descent from u0\mathbf{u}_{0} does not decrease function value, then all the iterates must lie within a small ball around u0\mathbf{u}_{0} (Lemma 16).

If gradient descent starting from a point u0\mathbf{u}_{0} stuck in a small ball around a saddle point, then gradient descent from w0\mathbf{w}_{0} (moving u0\mathbf{u}_{0} along e1\mathbf{e}_{1} direction for at least a certain distance), will decreases the function value (Lemma 17).

Now we are ready to state two lemmas formally:

Note the conclusion T<c^TT<\hat{c}\mathscr{T} in Lemma 17 equivalently means:

In this case, by Lemma 16, we know uT1O(S)\|{\mathbf{u}_{T^{\prime}-1}}\|\leq O(\mathscr{S}), and therefore

In this case, by Lemma 16, we know utO(S)\|{\mathbf{u}_{t}}\|\leq O(\mathscr{S}) for all tTt\leq T^{\star}. Define

By Lemma 17, we immediately have TTT^{\prime\prime}\leq T^{\star}. Apply same argument as in first case, we have for all T1cmaxTT\geq\frac{1}{c_{\max}}\mathscr{T} that f(wT)f(w0)f(wT)f(w0)2.5Ff(\mathbf{w}_{T})-f(\mathbf{w}_{0})\leq f(\mathbf{w}_{T^{\star}})-f(\mathbf{w}_{0})\leq-2.5\mathscr{F}. ∎

Next we finish the proof by proving Lemma 16 and Lemma 17.

A.3.1 Proof of Lemma 16

In Lemma 16, we hope to show if the function value did not decrease, then all the iterations must be constrained in a small ball. We do that by analyzing the dynamics of the iterations, and we decompose the dd-dimensional space into two subspaces: a subspace S\mathcal{S} which is the span of significantly negative eigenvectors of the Hessian and its orthogonal compliment.

We will now compute the projections of ut\mathbf{u}_{t} in different eigenspaces of H\mathcal{H}. Let S\mathcal{S} be the subspace spanned by all eigenvectors of H\mathcal{H} whose eigenvalue is less than γc^log(dκδ)-\frac{\gamma}{\hat{c}\log(\frac{d\kappa}{\delta})}. Sc\mathcal{S}^{c} denotes the subspace of remaining eigenvectors. Let αt\bm{\alpha}_{t} and βt\bm{\beta}_{t} denote the projections of ut\mathbf{u}_{t} onto S\mathcal{S} and Sc\mathcal{S}^{c} respectively i.e., αt=PSut\bm{\alpha}_{t}=\mathcal{P}_{\mathcal{S}}\mathbf{u}_{t}, and βt=PScut\bm{\beta}_{t}=\mathcal{P}_{\mathcal{S}^{c}}\mathbf{u}_{t}. We can decompose the update equations Eq.(5) into:

By definition of TT, we know for all t<Tt<T:

Combined with the fact ut2=αt2+βt2\|{\mathbf{u}_{t}}\|^{2}=\|{\bm{\alpha}_{t}}\|^{2}+\|{\bm{\beta}_{t}}\|^{2}, we have:

where last inequality is due to f(0)3G\|{\nabla f(0)}\|\leq 3\mathscr{G}. This gives:

Clearly Eq.(9) is true for t=0t=0 since u0=0\mathbf{u}_{0}=0. Suppose Eq.(9) is true for all τt\tau\leq t. We will now show that Eq.(9) holds for t+1<Tt+1<T. Note that by the definition of S\mathscr{S}, F\mathscr{F} and G\mathscr{G}, we only need to bound the last two terms of Eq.(8) i.e., βt+1\|{\bm{\beta}_{t+1}}\| and βt+1Hβt+1\bm{\beta}_{t+1}^{\top}\mathcal{H}\bm{\beta}_{t+1}.

By update function of βt\bm{\beta}_{t} (Eq.(7)), we have:

and the norm of δt\bm{\delta}_{t} is bounded as follows:

𝑡1\|{\bm{\beta}_{t+1}}\|: Combining Eq.(10), Eq.(11) and using the definiton of Sc\mathcal{S}^{c}, we have:

Since β0=0\|{\bm{\beta}_{0}}\|=0 and t+1Tt+1\leq T, by applying above relation recurrsively, we have:

The second last inequality is because Tc^TT\leq\hat{c}\mathscr{T} by definition, so that (1+ηγc^log(dκδ))T3(1+\frac{\eta\gamma}{\hat{c}\log(\frac{d\kappa}{\delta})})^{T}\leq 3.

Let the eigenvalues of H\mathcal{H} to be {λi}\{\lambda_{i}\}, then for any τ1,τ20\tau_{1},\tau_{2}\geq 0, we know the eigenvalues of (IηH)τ1H(IηH)τ2(\mathbf{I}-\eta\mathcal{H})^{\tau_{1}}\mathcal{H}(\mathbf{I}-\eta\mathcal{H})^{\tau_{2}} are {λi(1ηλi)τ1+τ2}\{\lambda_{i}(1-\eta\lambda_{i})^{\tau_{1}+\tau_{2}}\}. Let gt(λ):=λ(1ηλ)tg_{t}(\lambda)\mathrel{\mathop{:}}=\lambda(1-\eta\lambda)^{t}, and setting its derivative to zero, we obtain:

We see that λt=1(1+t)η\lambda_{t}^{\star}=\frac{1}{(1+t)\eta} is the unique maximizer, and gt(λ)g_{t}(\lambda) is monotonically increasing in (,λt](-\infty,\lambda_{t}^{\star}]. This gives:

The second last inequality is because by rearrange summation:

Finally, substitue Eq.(12) and Eq.(13) into Eq.(8), this gives:

This finishes the induction as well as the proof of the lemma. ∎

A.3.2 Proof of Lemma 17

In this Lemma we try to show if all the iterates from u0\mathbf{u}_{0} are constrained in a small ball, iterates from w0\mathbf{w}_{0} must be able to decrease the function value. In order to do that, we keep track of vector v\mathbf{v} which is the difference between u\mathbf{u} and w\mathbf{w}. Similar as before, we also decompose v\mathbf{v} into different eigenspaces. However, this time we only care about the projection of v\mathbf{v} on the direction e1\mathbf{e}_{1} and its orthognal subspace.

On the other hand, denote ψt\psi_{t} be the norm of vt\mathbf{v}_{t} projected onto e1\mathbf{e}_{1} direction, and φt\varphi_{t} be the norm of vt\mathbf{v}_{t} projected onto remaining subspace. Eq.(14) gives us:

where μ=ηρS(300c^+1)\mu=\eta\rho\mathscr{S}(300\hat{c}+1). We will now prove via induction that for all t<Tt<T:

By hypothesis of Lemma 17, we know φ0=0\varphi_{0}=0, thus the base case of induction holds. Assume Eq.(16) is true for τt\tau\leq t, For t+1Tt+1\leq T, we have:

From above inequalities, we see that we only need to show:

Now, we know φt4μtψtψt\varphi_{t}\leq 4\mu t\cdot\psi_{t}\leq\psi_{t}, this gives:

where the last step follows from μ=ηρS(300c^+1)cmax(300c^+1)γηlog1(dκδ)<γη22\mu=\eta\rho\mathscr{S}(300\hat{c}+1)\leq\sqrt{c_{\max}}(300\hat{c}+1)\gamma\eta\cdot\log^{-1}(\frac{d\kappa}{\delta})<\frac{\gamma\eta}{2\sqrt{2}}.

Finally, combining Eq.(15) and (17) we have for all t<Tt<T:

The last inequality is due to δ(0,dκe]\delta\in(0,\frac{d\kappa}{e}] we have log(dκδ)1\log(\frac{d\kappa}{\delta})\geq 1. By choosing constant c^\hat{c} to be large enough to satisfy 2+log(400c^)c^2+\log(400\hat{c})\leq\hat{c}, we will have T<c^TT<\hat{c}\mathscr{T}, which finishes the proof. ∎

Appendix B Improve Convergence by Local Structure

In this section, we show if the objective function has nice local structure (e.g. satisfies Assumptions A3.a or A3.b), then it is possible to combine our analysis with the local analysis in order to get very fast convergence to a local minimum.

By Corollary 4, we know x^\hat{\mathbf{x}} is already in the ζ\zeta-neighborhood of X\mathcal{X}^{\star}, where X\mathcal{X}^{\star} is the set of local minima. Therefore, to prove this theorem, we only need to show prove following two claims:

Suppose {xt}\{\mathbf{x}_{t}\} is the sequence of gradient descent starting from x0=x^\mathbf{x}_{0}=\hat{\mathbf{x}} with learning rate 1β\frac{1}{\beta}, then xt\mathbf{x}_{t} is always in the ζ\zeta-neighborhood of X\mathcal{X}^{\star}.

Local structure (assumption A3.a or A3.b) allows iterates to converge to points ϵ\epsilon-close to X\mathcal{X}^{\star} within O(βαlogζϵ)O(\frac{\beta}{\alpha}\log\frac{\zeta}{\epsilon}) iterations.

We will focus on Assumption A3.b (as we will later see Assumption A3.a is a special case of Assumption A3.b). Assume xt\mathbf{x}_{t} is in ζ\zeta-neighborhood of X\mathcal{X}^{\star}, by gradient updates and the definition of projection, we have:

The second last inequality is due to (α,β)(\alpha,\beta)-regularity condition. The last inequality is because of the choice η=1β\eta=\frac{1}{\beta}.

There are two consequences of this calculation. First, it shows xt+1PX(xt+1)2xtPX(xt)2\|{\mathbf{x}_{t+1}-\mathcal{P}_{\mathcal{X}^{\star}}(\mathbf{x}_{t+1})}\|^{2}\leq\|{\mathbf{x}_{t}-\mathcal{P}_{\mathcal{X}^{\star}}(\mathbf{x}_{t})}\|^{2}. As a result if xt\mathbf{x}_{t} in ζ\zeta-neighborhood of X\mathcal{X}^{\star}, xt+1\mathbf{x}_{t+1} is also in this ζ\zeta-neighborhood. Since x0\mathbf{x}_{0} is in the ζ\zeta-neighborhood by Corollary 4, by induction we know all later iterations are in the same neighborhood.

Now, since we know all the points xt\mathbf{x}_{t} are in the neighborhood, the equation also shows linear convergence rate (1αβ)(1-\frac{\alpha}{\beta}). The initial distance is bounded by x0PX(x0)ζ\|{\mathbf{x}_{0}-\mathcal{P}_{\mathcal{X}^{\star}}(\mathbf{x}_{0})}\|\leq\zeta, therefore to converge to points ϵ\epsilon-close to X\mathcal{X}^{\star}, we only need the following number of iterations:

This finishes the proof under Assumption A3.b.

Finally, we argue assumption A3.a implies A3.b. First, notice that if a function is locally strongly convex, then its local minima are isolated: for any two points x,xX\mathbf{x},\mathbf{x}^{\prime}\in\mathcal{X}^{\star}, the local region Bx(ζ)B_{\mathbf{x}}(\zeta) and Bx(ζ)B_{\mathbf{x}^{\prime}}(\zeta) must be disjoint (otherwise function f(x)f(\mathbf{x}) is strongly convex in connected domain Bx(ζ)Bx(ζ)B_{\mathbf{x}}(\zeta)\cup B_{\mathbf{x}^{\prime}}(\zeta) but has two distinct local minima, which is impossible). Therefore, W.L.O.G, it suffices to consider one perticular disjoint region, with unique local minimum we denote as x\mathbf{x}^{\star}, clearly, for all xBx(ζ)\mathbf{x}\in B_{\mathbf{x}^{\star}}(\zeta) we have PX(x)=x\mathcal{P}_{\mathcal{X}^{\star}}(\mathbf{x})=\mathbf{x}^{\star}.

On the other hand, for any x\mathbf{x} in this ζ\zeta-neighborhood, we already proved x1βf(x)\mathbf{x}-\frac{1}{\beta}\nabla f(\mathbf{x}) also in this ζ\zeta-neighborhood. By β\beta-smoothness, we also have:

Combining Eq.(18) and Eq.(19), and using the fact f(x)f(x1βf(x))f(\mathbf{x}^{\star})\leq f(\mathbf{x}-\frac{1}{\beta}\nabla f(\mathbf{x})), we get:

Appendix C Geometric Structures of Matrix Factorization Problem

In this Section we investigate the global geometric structures of the matrix factorization problem. These properties are summarized in Lemmas 6 and 7. Such structures allow us to apply our main Theorem and get fast convergence (as shown in Theorem 8).

Recall for vectors we use \|{\cdot}\| to denote the 2-norm, and for matrices we use \|{\cdot}\| and F\|{\cdot}\|_{\text{F}} to denote spectral norm, and Frobenius norm respectively. Furthermore, we always use σi()\sigma_{i}(\cdot) to denote the ii-th largest singular value of the matrix.

We first show how the geometric properties (Lemma 6 and Lemma 7) imply a fast convergence (Theorem 8).

There exists an absolute constant cmaxc_{\max} such that the following holds. For matrix factorization (2), for any δ>0\delta>0 and constant ccmaxc\leq c_{\max}, let Γ1/2:=2max{U0,3(σ1)1/2}\Gamma^{1/2}\mathrel{\mathop{:}}=2\max\{\|{\mathbf{U}_{0}}\|,3(\sigma^{\star}_{1})^{1/2}\}, suppose we run PGDli(U0,8Γ,12Γ1/2,(σr)2108Γ1/2,c,δ,rΓ22,10σ1)\text{PGDli}(\mathbf{U}_{0},8\Gamma,12\Gamma^{1/2},\frac{(\sigma^{\star}_{r})^{2}}{108\Gamma^{1/2}},c,\delta,\frac{r\Gamma^{2}}{2},10\sigma^{\star}_{1}), then:

With probability 1, the iterates satisfy UtΓ1/2\|{\mathbf{U}_{t}}\|\leq\Gamma^{1/2} for every t0t\geq 0.

With probability 1δ1-\delta, the output will be ϵ\epsilon-close to global minima set X\mathcal{X}^{\star} in following iterations:

Theorem 8 consists of two parts. In part 1 we show that the iterations never bring the matrix to a very large norm, while in part 2 we apply our main Theorem to get fast convergence. We will first prove the bound on number of iterations assuming the bound on the norm. We will then proceed to prove part 11.

Part 2: Assume part 11 of the theorem is true i.e., with probability 11, the iterates satisfy UtΓ1/2\|{\mathbf{U}_{t}}\|\leq\Gamma^{1/2} for every t0t\geq 0. In this case, although we are doing unconstrained optimization, we can still use the geometric properties that hold inside this region. .

By Lemma 6 and 7, we know objective function Eq.(2) is 8Γ8\Gamma-smooth, 12Γ1/212\Gamma^{1/2}-Lipschitz Hessian, (124(σr)3/2,13σr,13(σr)1/2)(\frac{1}{24}(\sigma^{\star}_{r})^{3/2},\frac{1}{3}\sigma^{\star}_{r},\frac{1}{3}(\sigma^{\star}_{r})^{1/2})-strict saddle, and holds (23σr,10σ1)(\frac{2}{3}\sigma^{\star}_{r},10\sigma^{\star}_{1})-regularity condition in 13(σr)1/2\frac{1}{3}(\sigma^{\star}_{r})^{1/2} neighborhood of local minima (also global minima) X\mathcal{X}^{\star}. Furthermore, note f=0f^{\star}=0 and recall Γ1/2=2max{U0,3(σ1)1/2}\Gamma^{1/2}=2\max\{\|{\mathbf{U}_{0}}\|,3(\sigma^{\star}_{1})^{1/2}\}, then, we have:

Thus, we can choose Δf=rΓ22\Delta_{f}=\frac{r\Gamma^{2}}{2}. Substituting the corresponding parameters from Theorem 5, we know by running PGDli(U0,8Γ,12Γ1/2,(σr)2108Γ1/2,c,δ,rΓ22,10σ1)\text{PGDli}(\mathbf{U}_{0},8\Gamma,12\Gamma^{1/2},\frac{(\sigma^{\star}_{r})^{2}}{108\Gamma^{1/2}},c,\delta,\frac{r\Gamma^{2}}{2},10\sigma^{\star}_{1}), with probability 1δ1-\delta, the output will be ϵ\epsilon-close to global minima set X\mathcal{X}^{\star} in iterations:

Part 1: We will now show part 11 of the theorem. Recall PGDli (Algorithm 3) runs PGD (Algorithm 2) first, and then runs gradient descent within 13(σr)1/2\frac{1}{3}(\sigma^{\star}_{r})^{1/2} neighborhood of X\mathcal{X}^{\star}. It is easy to verify that 13(σr)1/2\frac{1}{3}(\sigma^{\star}_{r})^{1/2} neighborhood of X\mathcal{X}^{\star} is a subset of {UU2Γ}\{\mathbf{U}|\|{\mathbf{U}}\|^{2}\leq\Gamma\}. Therefore, we only need to show that first phase PGD will not leave the region. Specifically, we now use induction to prove the following for PGD:

UtΓ\|{\mathbf{U}_{t}}\|\leq\Gamma for all t0t\geq 0.

By choice of parameters of Algorithm 2, we know η=c8Γ\eta=\frac{c}{8\Gamma}. First, consider gradient descent step without perturbations:

For the first term, we know function f(t)=tηt3f(t)=t-\eta t^{3} is monotonically increasing in [0,1/3η][0,1/\sqrt{3\eta}]. On the other hand, by induction assumption, we also know UtΓ1/28Γ/(3c)=1/3η\|{\mathbf{U}_{t}}\|\leq\Gamma^{1/2}\leq\sqrt{8\Gamma/(3c)}=1/\sqrt{3\eta}. Therefore, the max is taken when i=1i=1:

We seperate our discussion into following cases.

Case Ut>12Γ1/2\|{\mathbf{U}_{t}}\|>\frac{1}{2}\Gamma^{1/2}: In this case Utmax{U0,3(σ1)1/2}\|{\mathbf{U}_{t}}\|\geq\max\{\|{\mathbf{U}_{0}}\|,3(\sigma^{\star}_{1})^{1/2}\}. Recall Γ1/2=2max{U0,3(σ1)1/2}\Gamma^{1/2}=2\max\{\|{\mathbf{U}_{0}}\|,3(\sigma^{\star}_{1})^{1/2}\}. Clearly, Γ36σ1\Gamma\geq 36\sigma^{\star}_{1}, we know:

This means that in each iteration, the spectral norm would decrease by at least c72Γ1/2\frac{c}{72}\Gamma^{1/2}.

Case Ut12Γ1/2\|{\mathbf{U}_{t}}\|\leq\frac{1}{2}\Gamma^{1/2}: From (20), we know that as long as Ut2M\|{\mathbf{U}_{t}}\|^{2}\geq\|{\mathbf{M}^{\star}}\|, we will always have Ut+1Ut12Γ1/2\|{\mathbf{U}_{t+1}}\|\leq\|{\mathbf{U}_{t}}\|\leq\frac{1}{2}\Gamma^{1/2}. For Ut2M\|{\mathbf{U}_{t}}\|^{2}\leq\|{\mathbf{M}^{\star}}\|, we have:

Thus, in this case, we always have Ut+112Γ1/2\|{\mathbf{U}_{t+1}}\|\leq\frac{1}{2}\Gamma^{1/2}.

In conclusion, if we don’t add perturbation in iteration tt, we have:

If Ut>12Γ1/2\|{\mathbf{U}_{t}}\|>\frac{1}{2}\Gamma^{1/2}, then Ut+1Utc72Γ1/2\|{\mathbf{U}_{t+1}}\|\leq\|{\mathbf{U}_{t}}\|-\frac{c}{72}\Gamma^{1/2}.

If Ut12Γ1/2\|{\mathbf{U}_{t}}\|\leq\frac{1}{2}\Gamma^{1/2}, then Ut+112Γ1/2\|{\mathbf{U}_{t+1}}\|\leq\frac{1}{2}\Gamma^{1/2}.

Now consider the iterations where we add perturbation. By the choice of radius of perturbation in Algorithm 2 , we increase spectral norm by at most :

On the other hand, according to Algorithm 2, once we add perturbation, we will not add perturbation for next tthres=χ24Γc2σr24c248ct_{\text{thres}}=\frac{\chi\cdot 24\Gamma}{c^{2}\sigma^{\star}_{r}}\geq\frac{24}{c^{2}}\geq\frac{48}{c} iterations. Let T=min{infi{Ut+iUt+i12Γ1/2},tthres}T=\min\{\inf_{i}\{\mathbf{U}_{t+i}|\|{\mathbf{U}_{t+i}}\|\leq\frac{1}{2}\Gamma^{1/2}\},t_{\text{thres}}\}:

Finally, U012Γ1/2\|{\mathbf{U}_{0}}\|\leq\frac{1}{2}\Gamma^{1/2} by definition of Γ\Gamma, so the initial condition holds. This finishes induction and the proof of the theorem. ∎

In the next subsections we prove the geometric structures.

Before we start proofs of lemmas, we first state some properties about gradient and Hessians. The gradient of the objective function f(U)f(\mathbf{U}) is

For any Γσ1\Gamma\geq\sigma^{\star}_{1}, inside the region {UU2<Γ}\{\mathbf{U}|\|{\mathbf{U}}\|^{2}<\Gamma\}, f(U)f(\mathbf{U}) defined in Eq.(2) is 8Γ8\Gamma-smooth and 12Γ1/212\Gamma^{1/2}-Hessian Lipschitz.

Denote D={UU2<Γ}\mathcal{D}=\{\mathbf{U}|\|{\mathbf{U}}\|^{2}<\Gamma\}, and recall Γσ1\Gamma\geq\sigma^{\star}_{1}.

Smoothness: For any U,VD\mathbf{U},\mathbf{V}\in\mathcal{D}, we have:

The last line is due to following decomposition and triangle inequality:

The inequality is due to following decomposition and triangle inequality:

C.2 Strict-Saddle Property and Local Regularity

Recall the gradient and Hessian of objective function is calculated as in Eq.(21) and Eq.(22). We first prove an elementary inequality regarding to the trace of product of two symmetric PSD matrices. This lemma will be frequently used in the proof.

Let A=VDV\mathbf{A}=\mathbf{V}\mathbf{D}\mathbf{V}^{\top} be the eigendecomposition of A\mathbf{A}, where D\mathbf{D} is diagonal matrix, and V\mathbf{V} is orthogonal matrix. Then we have:

Since B\mathbf{B} is PSD, we know VBV\mathbf{V}^{\top}\mathbf{B}\mathbf{V} is also PSD, thus the diagonal entries are non-negative. That is, (VBV)ii0(\mathbf{V}^{\top}\mathbf{B}\mathbf{V})_{ii}\geq 0 for all i=1,,di=1,\ldots,d. Finally, the lemma follows from the fact that σmin(A)DiiA\sigma_{\min}(\mathbf{A})\leq\mathbf{D}_{ii}\leq\|{\mathbf{A}}\| and tr(VBV)=tr(BVV)=tr(B)\text{tr}(\mathbf{V}^{\top}\mathbf{B}\mathbf{V})=\text{tr}(\mathbf{B}\mathbf{V}\mathbf{V}^{\top})=\text{tr}(\mathbf{B}). ∎

For f(U)f(\mathbf{U}) defined in Eq.(2), all local minima are global minima. The set of global minima is X={VRRR=RR=I}\mathcal{X}^{\star}=\{\mathbf{V}^{\star}\mathbf{R}|\mathbf{R}\mathbf{R}^{\top}=\mathbf{R}^{\top}\mathbf{R}=\mathbf{I}\}. Furthermore, f(U)f(\mathbf{U}) satisfies:

(124(σr)3/2,13σr,13(σr)1/2)(\frac{1}{24}(\sigma^{\star}_{r})^{3/2},\frac{1}{3}\sigma^{\star}_{r},\frac{1}{3}(\sigma^{\star}_{r})^{1/2})-strict saddle property, and

(23σr,10σ1)(\frac{2}{3}\sigma^{\star}_{r},10\sigma^{\star}_{1})-regularity condition in 13(σr)1/2\frac{1}{3}(\sigma^{\star}_{r})^{1/2} neighborhood of X\mathcal{X}^{\star}.

Let us denote the set X:={VRRR=RR=I}\mathcal{X}^{\star}\mathrel{\mathop{:}}=\{\mathbf{V}^{\star}\mathbf{R}|\mathbf{R}\mathbf{R}^{\top}=\mathbf{R}^{\top}\mathbf{R}=\mathbf{I}\}, in the end of proof, we will show this set is the set of all local minima (which is also global minima).

Throughout the proof of this lemma, we always focus on the first-order and second-order property for one matrix U\mathbf{U}. For simplicity of calculation, when it is clear from the context, we denote U=PX(U)\mathbf{U}^{\star}=\mathcal{P}_{\mathcal{X}^{\star}}(\mathbf{U}) and Δ=UPX(U)\Delta=\mathbf{U}-\mathcal{P}_{\mathcal{X}^{\star}}(\mathbf{U}). By definition of X\mathcal{X}^{\star}, we know U=VRU\mathbf{U}^{\star}=\mathbf{V}^{\star}\mathbf{R}_{\mathbf{U}} and Δ=UVRU\Delta=\mathbf{U}-\mathbf{V}^{\star}\mathbf{R}_{\mathbf{U}}, where

We first prove following claim, which will used in many places across this proof:

This because by expanding the Frobenius norm, and letting the SVD of VU\mathbf{V}^{\star}{}^{\top}\mathbf{U} be ADB\mathbf{A}\mathbf{D}\mathbf{B}^{\top}, we have:

Since A,B,R\mathbf{A},\mathbf{B},\mathbf{R} are all orthonormal matrix, we know ARB\mathbf{A}^{\top}\mathbf{R}\mathbf{B} is also orthonormal matrix. Moreover for any orthonormal matrix T\mathbf{T}, we have:

The last inequality is because Dii\mathbf{D}_{ii} is singular value thus non-negative, and T\mathbf{T} is orthonormal, thus Tii1\mathbf{T}_{ii}\leq 1. This means the maximum of tr(DT)\text{tr}(\mathbf{D}\mathbf{T}) is achieved when T=I\mathbf{T}=\mathbf{I}, i.e., the minimum of tr(DARB)-\text{tr}(\mathbf{D}\mathbf{A}^{\top}\mathbf{R}\mathbf{B}) is achieved when R=AB\mathbf{R}=\mathbf{A}\mathbf{B}^{\top}. Therefore, UVRU=BDAAB=BDB\mathbf{U}^{\top}\mathbf{V}^{\star}\mathbf{R}_{\mathbf{U}}=\mathbf{B}\mathbf{D}\mathbf{A}^{\top}\mathbf{A}\mathbf{B}^{\top}=\mathbf{B}\mathbf{D}\mathbf{B}^{\top} is symmetric PSD matrix.

Strict Saddle Property: In order to show the strict saddle property, we only need to show that for any U\mathbf{U} satisfying f(U)F124(σr)3/2\|{\nabla f(\mathbf{U})}\|_{\text{F}}\leq\frac{1}{24}(\sigma^{\star}_{r})^{3/2} and ΔF=UUF13(σr)1/2\|{\Delta}\|_{\text{F}}=\|{\mathbf{U}-\mathbf{U}^{\star}}\|_{\text{F}}\geq\frac{1}{3}(\sigma^{\star}_{r})^{1/2}, we always have σmin(2f(U))13σr\sigma_{\min}(\nabla^{2}f(\mathbf{U}))\leq-\frac{1}{3}\sigma^{\star}_{r}.

Let’s consider Hessian 2(U)\nabla^{2}(\mathbf{U}) in the direction of Δ=UU\Delta=\mathbf{U}-\mathbf{U}^{\star}. Clearly, we have:

Therefore, by Eq.(22) and above two equalities, we have:

Consider the first two terms, by expanding, we have:

where the second last inequality is because U(U+Δ)ΔΔ=UUΔΔ\mathbf{U}^{\star}{}^{\top}(\mathbf{U}^{\star}+\Delta)\Delta^{\top}\Delta=\mathbf{U}^{\star}{}^{\top}\mathbf{U}\Delta^{\top}\Delta is the product of two symmetric PSD matrices (thus its trace is non-negative); the last inequality is by Lemma 18.

Finally, in case we have f(U)F124(σr)3/2\|{\nabla f(\mathbf{U})}\|_{\text{F}}\leq\frac{1}{24}(\sigma^{\star}_{r})^{3/2} and ΔF=UUF13(σr)1/2\|{\Delta}\|_{\text{F}}=\|{\mathbf{U}-\mathbf{U}^{\star}}\|_{\text{F}}\geq\frac{1}{3}(\sigma^{\star}_{r})^{1/2}

Local Regularity: In 13(σr)1/2\frac{1}{3}(\sigma^{\star}_{r})^{1/2} neigborhood of X\mathcal{X}^{\star}, by definition, we know,

Clearly, by Weyl’s inequality, we have UU+Δ43(σ1)1/2\|{\mathbf{U}}\|\leq\|{\mathbf{U}^{\star}}\|+\|{\Delta}\|\leq\frac{4}{3}(\sigma^{\star}_{1})^{1/2}, and σr(U)σr(U)Δ23(σr)1/2\sigma_{r}(\mathbf{U})\geq\sigma_{r}(\mathbf{U}^{\star})-\|{\Delta}\|\geq\frac{2}{3}(\sigma^{\star}_{r})^{1/2}. Moreover, since UU\mathbf{U}^{\star}{}^{\top}\mathbf{U} is symmetric matrix, we have:

At a highlevel, we will prove (α,β)(\alpha,\beta)-regularity property (1) by proving that:

f(x),xPX(x)αxPX(x)2\langle\nabla f(\mathbf{x}),\mathbf{x}-\mathcal{P}_{\mathcal{X}^{\star}}(\mathbf{x})\rangle\geq\alpha\|{\mathbf{x}-\mathcal{P}_{\mathcal{X}^{\star}}(\mathbf{x})}\|^{2}, and

f(x),xPX(x)1βf(x)2\langle\nabla f(\mathbf{x}),\mathbf{x}-\mathcal{P}_{\mathcal{X}^{\star}}(\mathbf{x})\rangle\geq\frac{1}{\beta}\|{\nabla f(\mathbf{x})}\|^{2}.

The last equality is because ΔU\Delta^{\top}\mathbf{U} is symmetric matrix. Since UU\mathbf{U}^{\star}{}^{\top}\mathbf{U} is symmetric PSD matrix, and recall σr(UU)23σr\sigma_{r}(\mathbf{U}^{\star}{}^{\top}\mathbf{U})\geq\frac{2}{3}\sigma^{\star}_{r}, by Lemma 18 we have:

For term A\mathfrak{A}, by Lemma 18, and ΔU\Delta^{\top}\mathbf{U} being a symmetric matrix, we have:

For term B\mathfrak{B}, by Eq.(23) we can denote C=UU=UU\mathbf{C}=\mathbf{U}^{\star}{}^{\top}\mathbf{U}=\mathbf{U}^{\top}\mathbf{U}^{\star} which is symmetric PSD matrix, by Lemma 18, we have: