The non-convex Burer-Monteiro approach works on smooth semidefinite programs

Nicolas Boumal, Vladislav Voroninski, Afonso S. Bandeira

Introduction

We consider semidefinite programs (SDPs) of the form

If (SDP) has a compact search space, then it admits a global optimum of rank at most rr, where r(r+1)2m\frac{r(r+1)}{2}\leq m (Pataki, 1998; Barvinok, 1995). Thus, if one restricts the search space of (SDP) to matrices of rank at most pp with p(p+1)2m\frac{p(p+1)}{2}\geq m, then the globally optimal value remains unchanged. This restriction is easily enforced by factorizing X=YY ⁣X=YY^{\top}\! where YY has size n×pn\times p, yielding an equivalent quadratically constrained quadratic program:

In general, (P) is non-convex, making it a priori unclear how to solve it globally. Still, the benefits are that it is lower dimensional than (SDP) and has no conic constraint. This has motivated Burer and Monteiro (2003, 2005) to try and solve (P) using local optimization methods, with surprisingly good results. They developed theory in support of this observation (details below). About their results, Burer and Monteiro (2005, §3) write (mutatis mutandis):

“How large must we take pp so that the local minima of (P) are guaranteed to map to global minima of (SDP)? Our theorem asserts that we need onlyThe condition on pp and mm is slightly, but inconsequentially, different in (Burer and Monteiro, 2005). p(p+1)2>m\frac{p(p+1)}{2}>m (with the important caveat that positive-dimensional faces of (SDP) which are ‘flat’ with respect to the objective function can harbor non-global local minima).”

The caveat—the existence or non-existence of non-global local optima, or their potentially adverse effect for local optimization algorithms—was not further discussed.

In this paper, assuming p(p+1)2>m\frac{p(p+1)}{2}>m, we show that if the search space of (SDP) is compact and if the search space of (P) is a regularly defined smooth manifold, then, for almost all cost matrices CC, if YY satisfies first- and second-order necessary optimality conditions for (P), then YY is a global optimum of (P) and, since p(p+1)2m\frac{p(p+1)}{2}\geq m, X=YY ⁣X=YY^{\top}\! is a global optimum of (SDP). In other words, first- and second-order necessary optimality conditions for (P) are also sufficient for global optimality—an unusual theoretical guarantee in non-convex optimization.

Notice that this is a statement about the optimization problem itself, not about specific algorithms. Interestingly, known algorithms for optimization on manifolds converge to second-order critical points,Second-order critical points satisfy first- and second-order necessary optimality conditions. regardless of initialization (Boumal et al., 2016).

For the specified class of SDPs, our result improves on those of (Burer and Monteiro, 2005) in two important ways. Firstly, for almost all CC, we formally exclude the existence of spurious local optima.Before Prop. 2.3 in (Burer and Monteiro, 2005), the authors write: “The change of variables X=YY ⁣X=YY^{\top}\! does not introduce any extraneous local minima.” This is sometimes misunderstood to mean (P) does not have spurious local optima, when it actually means that the local optima of (P) are in exact correspondence with the local optima of “(SDP) with the extra constraint rank(X)p\operatorname{rank}(X)\leq p,” which is also non-convex and thus also liable to having local optima. Unfortunately, this misinterpretation has led to some confusion in the literature. Secondly, we only require the computation of second-order critical points of (P) rather than local optima (which is hard in general (Vavasis, 1991)). Below, we make a statement about computational complexity, and we illustrate the practical efficiency of the proposed methods through numerical experiments.

Given an undirected graph, Max-Cut is the NP-hard problem of clustering the nn nodes of this graph in two classes, +1+1 and 1-1, such that as many edges as possible join nodes of different signs. If CC is the adjacency matrix of the graph, Max-Cut is expressed as

Introducing the positive semidefinite matrix X=xx ⁣X=xx^{\top}\!, both the cost and the constraints may be expressed linearly in terms of XX. Ignoring that XX has rank 1 yields the well-known convex relaxation in the form of a semidefinite program (up to an affine transformation of the cost):

If a solution XX of this SDP has rank 1, then X=xx ⁣X=xx^{\top}\! for some xx which is then an optimal cut. In the general case of higher rank XX, Goemans and Williamson (1995) exhibited the celebrated rounding scheme to produce approximately optimal cuts (within a ratio of .878) from XX.

The corresponding Burer–Monteiro non-convex problem with rank bounded by pp is:

For p=2np=\left\lceil\sqrt{2n}\,\right\rceil, for almost all CC, even though (Max-Cut BM) is non-convex, any local optimum YY is a global optimum (and so is X=YY ⁣X=YY^{\top}\!), and all saddle points have an escape (the Hessian has a negative eigenvalue).

We note that, for p>n/2p>n/2, the same holds for all CC (Boumal, 2015).

Notation

Main results

Our main result establishes conditions under which first- and second-order necessary optimality conditions for (P) are sufficient for global optimality. Under those conditions, it is a fortiori true that global optima of (P) map to global optima of (SDP), so that local optimization methods on (P) can be used to solve the higher-dimensional, cone-constrained (SDP).

We now specify the necessary optimality conditions of (P). Under the assumptions of our main result below (Theorem 2), the search space

is a smooth and compact manifold of dimension npmnp-m. As such, it can be linearized at each point YMY\in\mathcal{M} by a tangent space, differentiating the constraints (Absil et al., 2008, eq. (3.19)):

A (first-order) critical point for (P) is a point YMY\in\mathcal{M} such that

All local (and global) optima of (P) are second-order critical points.

See (Yang et al., 2014, Rem. 4.2 and Cor. 4.2). ∎

the search space of (SDP) is compact; and

The assumptions are discussed in the next section. The proof—see Appendix A—follows directly from the combination of two intermediate results:

If YY is rank deficient and second-order critical for (P), then it is globally optimal and X=YY ⁣X=YY^{\top}\! is optimal for (SDP); and

If p(p+1)2>m\frac{p(p+1)}{2}>m, then, for almost all CC, every first-order critical YY is rank-deficient.

The first step holds in a more general context, as previously established by Burer and Monteiro (2003, 2005). The second step is new and crucial, as it allows to formally exclude the existence of spurious local optima, generically in CC, thus resolving the caveat mentioned in the introduction.

The smooth structure of (P) naturally suggests using Riemannian optimization to solve it (Absil et al., 2008), which is something that was already proposed by Journée et al. (2010) in the same context. Importantly, known algorithms converge to second-order critical points regardless of initialization. We state here a recent computational result to that effect.

Under the numbered assumptions of Theorem 2, the Riemannian trust-region method (RTR) (Absil et al., 2007) initialized with any Y0MY_{0}\in\mathcal{M} returns in O(1/εg2εH+1/εH3)\mathcal{O}(1/\varepsilon_{g}^{2}\varepsilon_{H}+1/\varepsilon_{H}^{3}) iterations a point YMY\in\mathcal{M} such that

Proposition 3 bounds worst-case iteration counts for arbitrary initialization. In practice, a good initialization point may be available, making the local convergence rate of RTR more informative. For RTR, one may expect superlinear or even quadratic local convergence rates near isolated local minimizers (Absil et al., 2007). While minimizers are not isolated in our case (Journée et al., 2010), experiments show a characteristically superlinear local convergence rate in practice (Boumal, 2015). This means high accuracy solutions can be achieved, as demonstrated in Appendix B.

Thus, under the conditions of Theorem 2, generically in CC, RTR converges to global optima. In practice, the algorithm returns after a finite number of steps, and only approximate second-order criticality is guaranteed. Hence, it is interesting to bound the optimality gap in terms of the approximation quality. Unfortunately, we do not establish such a result for small pp. Instead, we give an a posteriori computable optimality gap bound which holds for all pp and for all CC. In the following statement, the dependence of M\mathcal{M} on pp is explicit, as Mp\mathcal{M}_{p}. The proof is in Appendix A.

If all feasible XX have the same trace RR and there exists a positive definite feasible XX, then the bound simplifies to

In particular, for p=n+1p=n+1, the bound can be controlled a priori: approximate second-order critical points are approximately optimal, for any CC.With p=n+1p=n+1, problem (P) is no longer lower dimensional than (SDP), but retains the advantage of not involving a positive semidefiniteness constraint.

Under the same condition as in Theorem 4, the bound can be simplified to RεHR\varepsilon_{H}.

This works well with Proposition 3. For any pp, equation (4) also implies the following:

Low-rank approaches to solve SDPs have featured in a number of recent research papers. We highlight just two which illustrate different classes of SDPs of interest.

Shah et al. (2016) tackle SDPs with linear cost and linear constraints (both equalities and inequalities) via low-rank factorizations, assuming the matrices appearing in the cost and constraints are positive semidefinite. They propose a non-trivial initial guess to partially overcome non-convexity with great empirical results, but do not provide optimality guarantees.

Bhojanapalli et al. (2016a) on the other hand consider the minimization of a convex cost function over positive semidefinite matrices, without constraints. Such problems could be obtained from generic SDPs by penalizing the constraints in a Lagrangian way. Here too, non-convexity is partially overcome via non-trivial initialization, with global optimality guarantees under some conditions.

Also of interest are recent results about the harmlessness of non-convexity in low-rank matrix completion (Ge et al., 2016; Bhojanapalli et al., 2016b). Similarly to the present work, the authors there show there is no need for special initialization despite non-convexity.

Discussion of the assumptions

Our main result, Theorem 2, comes with geometric assumptions on the search spaces of both (SDP) and (P) which we now discuss. Examples of SDPs which fit the assumptions of Theorem 2 are featured in the next section.

The assumption that the search space of (SDP),

is compact works in pair with the assumption p(p+1)2>m\frac{p(p+1)}{2}>m as follows. For (P) to reveal the global optima of (SDP), it is necessary that (SDP) admits a solution of rank at most pp. One way to ensure this is via the Pataki–Barvinok theorems (Pataki, 1998; Barvinok, 1995), which state that all extreme points of C\mathcal{C} have rank rr bounded as r(r+1)2m\frac{r(r+1)}{2}\leq m. Extreme points are faces of dimension zero (such as vertices for a cube). When optimizing a linear cost function C,X\left\langle{C},{X}\right\rangle over a compact convex set C\mathcal{C}, at least one extreme point is a global optimum (Rockafellar, 1970, Cor. 32.3.2)—this is not true in general if C\mathcal{C} is not compact. Thus, under the assumptions of Theorem 2, there is a point YMY\in\mathcal{M} such that X=YY ⁣X=YY^{\top}\! is an optimal extreme point of (SDP); then, of course, YY itself is optimal for (P).

In general, the Pataki–Barvinok bound is tight, in that there exist extreme points of rank up to that upper-bound (rounded down)—see for example (Laurent and Poljak, 1996) for the Max-Cut SDP and (Boumal, 2015) for the Orthogonal-Cut SDP. Let CC (the cost matrix) be the negative of such an extreme point. Then, the unique optimum of (SDP) is that extreme point, showing that p(p+1)2m\frac{p(p+1)}{2}\geq m is necessary for (SDP) and (P) to be equivalent for all CC. We further require a strict inequality because our proof relies on properties of rank deficient YY’s in M\mathcal{M}.

Finally, we note that Theorem 2 only applies for almost all CC, rather than all CC. To justify this restriction, if indeed it is justified, one should exhibit a matrix CC that leads to suboptimal second-order critical points while other assumptions are satisfied. We do not have such an example. We do observe that (Max-Cut SDP) on cycles of certain even lengths has a unique solution of rank 1, while the corresponding (Max-Cut BM) with p=2p=2 has suboptimal local optima (strictly, if we quotient out symmetries). This at least suggests it is not enough, for generic CC, to set pp just larger than the rank of the solutions of the SDP. (For those same examples, at p=3p=3, we consistently observe convergence to global optima.)

Examples of smooth SDPs

Certain concrete examples of SDPs include:

Their rank-constrained counterparts read as follows (matrix norms are Frobenius norms):

The first example has only one constraint: the SDP always admits an optimal rank 1 solution, corresponding to an eigenvector associated to the left-most eigenvalue of CC. This generalizes to the trust-region subproblem as well.

In the third example, YY of size n×pn\times p is divided in qq slices of size d×pd\times p, with pdp\geq d. Each slice has orthonormal rows. For p=dp=d, the slices are orthogonal (or unitary) matrices, allowing to capture Orthogonal-Cut (Bandeira et al., 2016b) and the related problems of synchronization of rotations (Wang and Singer, 2013) and permutations. Synchronization of rotations is an important step in simultaneous localization and mapping, for example. Here, it is sufficient for almost all CC to let p=d(d+1)qp=\left\lceil\sqrt{d(d+1)q}\,\right\rceil.

Conclusions

We further reference the Riemannian trust-region method (Absil et al., 2007) to solve the problem in YY, as it was recently guaranteed to converge from any starting point to a point which satisfies second-order optimality conditions, with global convergence rates (Boumal et al., 2016). In addition, for p=n+1p=n+1, we guarantee that approximate satisfaction of second-order conditions implies approximate global optimality. We note that the 1/ε31/\varepsilon^{3} convergence rate in our results may be pessimistic. Indeed, the numerical experiments clearly show that high accuracy solutions can be computed fast using optimization on manifolds, at least for certain applications.

Addressing a broader class of SDPs, such as those with inequality constraints or equality constraints that may violate our smoothness assumptions, could perhaps be handled by penalizing those constraints in the objective in an augmented Lagrangian fashion. We also note that, algorithmically, the Riemannian trust-region method we use applies just as well to nonlinear costs in the SDP. We believe that extending the theory presented here to broader classes of problems is a good direction for future work.

Acknowledgment

VV was partially supported by the Office of Naval Research. ASB was supported by NSF Grant DMS-1317308. Part of this work was done while ASB was with the Department of Mathematics at the Massachusetts Institute of Technology. We thank Wotao Yin and Michel Goemans for helpful discussions.

References

Appendix A Proofs and additional lemmas

be the (Euclidean) gradient and Hessian of the cost function (3). The Riemannian gradient and Hessian of ff on M\mathcal{M} are related to these as follows (see Absil et al., 2008, eqs (3.37), (5.15)):

First-order critical points then satisfy (using (12)):

We note in passing that μ(Y)\mu(Y) is feasible for the dual of (SDP) exactly when S(Y)0S(Y)\succeq 0:

which illustrates the importance of SS as a dual certificate for (SDP).

We now show that rank-deficient second-order critical points are globally optimal. We obtain this result as a corollary to a more informative statement about optimality gap at approximately second-order critical points (assuming exact rank deficiency). The lemmas also show how SS can be used to control the optimality gap at approximate critical points without requiring rank deficiency. This is valid for any pp and any CC.

This implies that MM is orthogonal to all XXX-X^{\prime}. These span kerA\ker\mathcal{A} since XX^{\prime} is interior. Indeed, for any HkerAH\in\ker\mathcal{A}, since X0X^{\prime}\succ 0, there exists ε>0\varepsilon>0 such that XX+εH0X\triangleq X^{\prime}+\varepsilon H\succeq 0 and A(X)=b\mathcal{A}(X)=b, so that XCX\in\mathcal{C}. Hence, MkerAM\in\ker\mathcal{A} is orthogonal to kerA\ker\mathcal{A}. Consequently, M=0M=0 and In=A(ν)I_{n}=\mathcal{A}^{*}(\nu). ∎

If YMpY\in\mathcal{M}_{p} is a column rank-deficient second-order critical point for (P), then it is optimal for (P) and X=YY ⁣X=YY^{\top}\! is optimal for (SDP). In particular, for p>np>n, all second-order critical points are optimal.

The first part of this corollary also appears as (Burer and Monteiro, 2003, Prop. 4), where the statement is made about local optima rather than second-order critical points.

At this point, we can already give a short proof of Theorem 4.

Under the assumptions of Theorem 2, for almost all CC, all critical points of (P) are rank deficient.

The set on the right hand side contains all “bad” CC’s, that is, those for which (20) offers no information about the rank of YY. The dimension of that set is bounded as follows, using that the dimension of a finite union is at most the maximal dimension, and the dimension of a finite sum of sets is at most the sum of the set dimensions:

Hence, if p(p+1)2>m\frac{p(p+1)}{2}>m, then, for almost all CC, critical points verify rank(Y)<p\operatorname{rank}(Y)<p. ∎

Theorem 2 follows as a corollary of Corollary 8 and Lemma 9.

Appendix B Numerical experiments

As an example, we run five different solvers on (Max-Cut SDP) with a collection of graphs used in (Burer and Monteiro, 2003, 2005) known as the Gset.Downloaded from: http://web.stanford.edu/~yyye/yyye/Gset/ on June 6, 2016. The solvers are as follows, all run in Matlab. The first three are based on a low-rank factorization while the last two are interior point methods (IPM).

Manopt+ runs the same algorithm as above, but with pp increasing from 22 to 8n+12\left\lceil\frac{\sqrt{8n+1}}{2}\right\rceil in 5 steps. The point YY computed at a lower pp is appended with columns of i.i.d. random Gaussian variables with standard deviation 10510^{-5} and mean 0, then rows are normalized to produce Y+Y_{+}: the initial point for the next value of pp. The randomization allows to escape near-saddle points (in practice). Code is in Matlab.

SDPLR runs the original Burer–Monteiro algorithm implemented by its authors (Burer and Monteiro, 2003). Code is in C interfaced through C-mex.

HRVW runs an IPM whose implementation is tailored to (Max-Cut SDP), implemented by its authors (Helmberg et al., 1996). Code is in Matlab.

CVX runs SDPT3 (Toh et al., 1999) on (Max-Cut SDP) via CVX (CVX, 2012). Code is in C interfaced through C-mex.

After the solvers return, we project their answers to the feasible set. Manopt and SDPLR return a matrix YY: it is sufficient to normalize each row to ensure X=YY ⁣X=YY^{\top}\! is feasible for (Max-Cut SDP) (for Manopt, this step is not necessary). HRVW and CVX return a symmetric matrix XX. We compute its Cholesky factorization X=RR ⁣X=RR^{\top}\!—if XX is not positive semidefinite, we first project using an eigenvalue decomposition. Then, each row of RR is normalized so that X=RR ⁣X=RR^{\top}\! is feasible for (Max-Cut SDP). Computation time required for these projections is not included in the timings.

We report three metrics for each graph and each solver.

Time: computation time in seconds for the solver to runMatlab R2015a on 2×62\times 6 cores processors with hyperthreading, Intel(R) Xeon(R) CPU E5-2640 @ 2.50GHz, 256Gb RAM, Springdale Linux 6. (this excludes time taken to project the solution to the feasible set and to compute the reported metrics.)

Appendix C Numerical experiments: results

Appendix D Regularity assumption

Originally, Theorems 2 and 4 had the assumption that the search space of the factorized problem,

In order to restore equality, we strengthened the assumption, requiring constraint qualifications to hold at all feasible points (see (7)):

We now describe an SDP such that M\mathcal{M} is indeed a manifold, yet (2) does not hold. Consider n=2,m=2n=2,m=2, b=(1,1) ⁣b=(1,1)^{\top}\! and

is degenerate but it is compact. Furthermore, the set M\mathcal{M} is a smooth manifold for p=1p=1:

This manifold has dimension 1 (and so do all its tangent spaces). Yet, KYK_{Y} has dimension 3 for all YMY\in\mathcal{M}. Indeed, we can parameterize Mp=2\mathcal{M}_{p=2} as the matrices