When Are Nonconvex Problems Not Scary?

Ju Sun, Qing Qu, John Wright

Introduction

General nonconvex optimization problems (henceforth “NCVX problems” for brevity) are NP-hard, even the goal is computing only a local minimizer [MK87, Ber99]. In applied disciplines, however, NCVX problems abound, and heuristic algorithms such as gradient descent and alternating directions are often surprisingly effective. The ability of natural heuristics to find high-quality solutions for practical NCVX problems remains largely mysterious.

In this note, we study a family of NCVX problems that can be solved efficiently. This family cuts across central tasks in signal processing and machine learning, such as complete (sparse) dictionary learning [SQW15], generalized phase retrieval [SQW16], orthogonal tensor decomposition [GHJY15], and noisy phase synchronization and community detection [Bou16, BBV16].

Natural optimization formulations for these distinct tasks are nonconvex; surprisingly they exhibit a common characteristic structure. In each case, the goal is to estimate or recover an object from observed data. Under certain technical hypotheses, every local minimizer of the objective function exactly recovers the object of interest.

With this structure, the central issue is how to escape the saddle points and local maximizers.

Fortunately, for these problems, all saddle points and local maximizers are “typical” – the associated Hessian matrix has at least one negative eigenvalue. Geometrically, this means around any saddle point or local maximizer, the objective function has a negative curvature in a certain direction. Particularly, we call saddles of this type ridable saddles;They are also called “strict saddle” points in optimization literature, see, e.g., pp 38 of [RI10]; see also [GHJY15]. the importance of this apparently extraneous restriction is illustrated in Figure 1. Intuitively, at saddle points or local maximizers, in the direction of negative curvature the objective function is also locally descending. One can use this to design algorithms that escape from the saddle points and local maximizers concerned here. Indeed, consider a natural quadratic approximation to the objective ff around a saddle point x\mathbf{x}:

Thus, minimizing f^(δ;x)\widehat{f}(\mathbf{\delta};\mathbf{x}) locally provides a direction δ\mathbf{\delta}_{\star} that tends to decrease the objective ff, provided local approximation of f^\widehat{f} to ff is reasonably accurate.For general saddles that seem to demand higher-order approximations, the computation may quickly become intractable. For example, third order saddle points seem to generally demand studying spectral property of three-way tensors, which entails NP-hard computational problems [HL13]; see [AG16] for a recent attempt in this line. Based on this intuition, we derive an algorithmic framework that can exploit the second-order information to escape from saddle points and local maximizers and provably returns a global minimizer.

Nonconvex Optimization with Ridable Saddles

In this section, we present a more quantitative definition of the problem class we focus on and provide several concrete examples in this class.

We are interested in optimization problem of the form:

Here we assume ff is twice continuously differentiable, i.e., it has continuous first- and second-order derivatives, and M\mathcal{M} is a Riemannian manifold. Restricting ff to M\mathcal{M} and (with abuse of notation) writing the restricted function as ff also, one can effectively treat (2.1) as an unconstrained optimization on M\mathcal{M}. We further use gradf(x)\operatorname{grad}f(\mathbf{x}) and Hessf(x)\operatorname{Hess}f(\mathbf{x}) to denote the Riemannian gradient and Hessian of ff at point x\mathbf{x} Detailed introduction to these quantities can be found in [AMS09]. We prefer to keep this at an intuitive level not to obscure the main ideas. , which one can think of as Riemannian counterparts of Euclidean gradient and Hessian for functions, with the exception that gradf(x)[]\operatorname{grad}f(\mathbf{x})[\cdot] and Hessf(x)[]\operatorname{Hess}f(\mathbf{x})[\cdot] only act on vectors in tangent space of M\mathcal{M} at x\mathbf{x}, i.e., TxMT_{\mathbf{x}}\mathcal{M}.

In words, the function has no spurious local minimizers. Moreover, each point on the manifold M\mathcal{M} either has strong Riemannian gradient, or has Riemannian Hessian with at least one strictly negative eigenvalue, or lives in a small neighborhood of a local minimizer, such that the function is locally strongly convex. We remark in passing that requiring a function to be ridable may appear far too restrictive than it actually is. Indeed, one of the central results in Morse theory implies that a generic smooth function is ridable.

In this note, we deal exclusively with minimizing X\mathcal{X} functions. Though if the target is to compute any local minimizer of a ridable function, our method also applies. These functions indeed appear in natural nonconvex formulations of important practical problems.

Independent Component Analysis (ICA) and Orthogonal Tensor Decomposition [GHJY15]. Typical setting of ICA asks for a linear transformation A\mathbf{A} for a given data matrix Y\mathbf{Y}, such that rows of AY\mathbf{A}\mathbf{Y} achieve maximal statistical independence. Tensor decomposition generalizes spectral decomposition of matrices. Here we focus on orthogonally decomposable dd-th order tensors T\mathcal{T} which can be represented as

where \otimes generalizes the usual outer product of vectors. Tensor decomposition refers to finding (up to sign and permutation) the components ai\mathbf{a}_{i}’s given T\mathcal{T}. With appropriate processing and up to small perturbation, ICA is showed to be equivalent to decomposition of a certain form of 44-th order orthogonally decomposable tensors [FJK96, AGMS12]. Specifically, [GHJY15] showed (Section C.1.) [GHJY15] has not used the manifold language as we use here, but resorted to Lagrange multiplier and optimality of the Lagrangian function. For the two decomposition formulations we discussed here, one can verify that the gradient and Hessian they defined are exactly the Riemannian gradient and Hessian of the respective manifolds. the minimization problem

from noisy measurements of the form Cij=zizj+ΔijC_{ij}=z_{i}\overline{z_{j}}+\Delta_{ij}. The problem is interesting when the noise is nonzero yet controlled, which demands robust solution schemes. Turning to the optimization approach, a natural formulation (if one assumes a Gaussian noise model) is

where we have collected CijC_{{ij}} into a matrix C\mathbf{C}. Assuming the noise is symmetric (i.e., Δij=Δji\Delta_{ij}=\Delta_{ji}), the above formulation is equivalent to

Interestingly, for the phase synchronization model, i.e., C=zz+Δ\mathbf{C}=\mathbf{z}\mathbf{z}^{*}+\mathbf{\Delta} with Hermitian noise matrix Δ\mathbf{\Delta}, [Bou16] recently showed that (Theorem 4) when the noise Δ\mathbf{\Delta} is bounded in mild sense,

second-order necessary condition for optimality is also sufficient.

Particularly, this holds w.h.p. when the noise is i.i.d. complex Gaussians with small variance (Lemma 5). To understand the above statement, recall that second-order necessary condition asks for vanishing gradient and negative semidefinite Hessian at a point. The above statement asserts that such condition is sufficient to guarantee global optimality. In other words, at any critical points other than these verifying the condition have indefinite Hessians. Thus, [Bou16] has effectively showed that when Δ\mathbf{\Delta} is appropriately bounded,

all points verifying the second-order necessary condition are global optimizes.

By analogous argument as for the complex case, this implies:

A similar result was derived in [BBV16] for the two-block community detection problem based on the stochastic block model (Theorem 6).Both [BBV16] and [Mon16] also contain results that characterize local optimizers in terms of their correlation with the optimizer under less stringent/general conditions.

Second-order Trust-region Method and Proof of Convergence

The intuition that second-order information can help escape ridable saddles from the very start suggests a second-order method. We describe a second-order trust-region algorithm on manifolds [ABG07, AMS09] for this purpose.

For the generic problem (2.1), we start from any feasible x(0)M\mathbf{x}^{(0)}\in\mathcal{M}, and form a sequence of iterates x(1),x(2),M\mathbf{x}^{(1)},\mathbf{x}^{(2)},\dots\in\mathcal{M} as follows. For the current iterate x(k)\mathbf{x}^{(k)}, we consider the quadratic approximation

which is defined for all δTx(k)M\mathbf{\delta}\in T_{\mathbf{x}^{(k)}}\mathcal{M}. The next iterate is determined by minimizing the quadratic approximation within a small radius Δ\Delta (i.e., the trust region) of x(k)\mathbf{x}^{(k)}, i.e.,

which is called the Riemannian trust-region subproblem. The vector x(k)+δ(k+1)\mathbf{x}^{(k)}+\mathbf{\delta}^{(k+1)} is generally not a point on M\mathcal{M}. One then performs a retraction step Rx(k)R_{\mathbf{x}^{(k)}} that pulls the vector back to the manifold, resulting in the update formula

for which efficient numerical algorithms exist [MS83, CGT00, FW04, HK14].

Design choice of the retraction is often problem-specific, ranging from the classical exponential map to the Euclidean projection that works for many matrix manifolds [AM12].

To show the trust-region algorithm converges to a global minimizer, we assume Δ\Delta is small enough such that approximation error of (3.1) to ff is “negligible” locally. Each step around a negative-curvature or strong-gradient point decreases the objective by a certain amount. Indeed, it is clear there is always one descent direction in such cases. Thus, the trust-region step will approximately follow one descent direction and decrease the function value. When the iterate sequence moves into a strongly convex region around a global minimizer, a step is either constrained such that it also deceases the objective by an amount, or unconstrained, which is a good indicator that the target minimizer is within a radius Δ\Delta. In the latter case, the algorithm behaves like the classical Newton method and quadratic sequence convergence can be shown.

Quantitative convergence proof demands knowledge of the ridability parameters, smoothness parameters of the objective, and elements of Riemannian geometry. We refer the reader to [SQW15, SQW16] for practical examples of convergence analyses.

Discussion

Recently, there is a surge of interest in understanding nonconvex heuristics for practical problems [KMO10, JNS13, Har14, HW14, NNS+14, JN14, SL14, ZL15, TBSR15, CW15, NJS13, CLS15, CC15, WWS15, JO14, AGJ14b, AGJ14a, AJSN15, YCS13, SA14, LWB13, QSW14, LWB13, AAJ+13, AAN13, AGM13, AGMM15, ABGM14, JJKN15]. Majority of the work start from clever initializations, and then proceed with analysis of local convergence. In comparison, it is clear that for X\mathcal{X} functions, second-order trust-region algorithms with any initialization guarantee to retrieve one target minimizer. Identifying X\mathcal{X} functions has involved intensive technical work [SQW15, SQW16, GHJY15]. It is interesting to see if streamlined toolkits can be developed, say via operational rules or unified potential functions. This would facilitate study of other practical problems, such as the deep networks of which saddle points are believed to be prevalent and constitute significant computational bottleneck [PDGB14, DPG+14, CHM+14]. To match heuristics computationally, more practical algorithms other than the second-order trust-region methods are needed. Practical trust-region solvers with saddle-escaping capability may be possible for structured problems [BMAS14, SQW16]. Moreover, simulations with several practical problems suggest gradient-style algorithms with random initializations succeed. [GHJY15, LSJR16] are recent endeavors towards this direction.

References