Non-square matrix sensing without spurious local minima via the Burer-Monteiro approach

Dohyung Park, Anastasios Kyrillidis, Constantine Caramanis, Sujay Sanghavi

Introduction and Problem Formulation

Consider the following matrix sensing problem:

Now, (LABEL:eqn:formulation) has a different form of non-convexity due to the bilinearity of the variable space, which raises the question whether we introduce spurious local minima by doing this transformation.

Contributions: The goal of this paper is to answer negatively to this question: We show that, under standard regulatory assumptions on A\mathcal{A}, UVUV^{\top} parametrization does not introduce any spurious local minima. To do so, we non-trivially generalize recent developments for the square, PSD case to the non-square case for XX^{\star}. Our result requires a different (but equivalent) problem re-formulation and analysis, with the introduction of an appropriate regularizer in the objective.

There are several papers that consider similar questions, but for other objectives. characterizes the non-convex geometry of the complete dictionary recovery problem, and proves that all local minima are global; considers the problem of non-convex phase synchronization where the task is modeled as a non-convex least-squares optimization problem, and can be globally solved via a modified version of power method; show that a nonconvex fourth-order polynomial objective for phase retrieval has no local minimizers and all global minimizers are equivalent; show that the Burer-Monteiro approach works on smooth semidefinite programs, with applications in synchronization and community detection; consider the PCA problem under streaming settings and use martingale arguments to prove that stochastic gradient descent on the factors reaches to the global solution with non-negligible probability; introduces the notion of strict saddle points and shows that noisy stochastic gradient descent can escape saddle points for generic objectives ff; proves that gradient descent converges to (local) minimizers almost surely, using arguments drawn from dynamical systems theory.

More related to this paper are the works of and : they show that matrix completion and sensing have no spurious local minima, for the case where XX^{\star} is square and PSD. For both cases, extending these arguments for the more realistic non-square case is a non-trivial task.

1 Assumptions and Definitions

We first state the assumptions we make for the matrix sensing setting. We consider the case where the linear operator A\mathcal{A} satisfies the Restricted Isometry Property, according to the following definition :

Characteristic examples are Gaussian-based linear maps , Pauli-based measurement operators, used in quantum state tomography applications , Fourier-based measurement operators, which lead to computational gains in practice due to their structure , or even permuted and sub-sampled noiselet linear operators, used in image and video compressive sensing applications .

In this paper, we consider sensing mechanisms that can be expressed as:

A useful property derived from the RIP definition is the following :

An important issue in optimizing ff over the factored space is the existence of non-unique possible factorizations for a given XX. Since we are interested in obtaining a low-rank solution in the original space, we need a notion of distance to the low-rank solution XX^{\star} over the factors. Among infinitely many possible decompositions of XX^{\star}, we focus on the set of “equally-footed” factorizations :

Given a pair (U,V)(U,V), we define the distance to XX^{\star} as:

2 Problem Re-formulation

Before we delve into the main results, we need to further reformulate the objective (LABEL:eqn:formulation) for our analysis. First, we use a well-known trick to reduce (LABEL:eqn:formulation) to a semidefinite optimization. Let us define auxiliary variables

It is important to note that B\mathcal{B} operates on (m+n)×(m+n)(m+n)\times(m+n) matrices, while we assume RIP on A\mathcal{A} and m×nm\times n matrices. Making no other assumptions for B\mathcal{B}, we cannot directly apply on (4), but a rather different analysis is required.

This regularizer was first introduced in to prove convergence of its algorithm for non-square matrix sensing, and it is also used in this paper to analyze local minima of the problem. After setting λ=14\lambda=\frac{1}{4}, (LABEL:eqn:formulation) can be equivalently written as:

By equivalent, we note that the addition of gg in the objective does not change the problem, since for any rank-rr matrix XX there is a pair of factors (U,V)(U,V) such that g(W)=0g(W)=0. It merely reduces the set of optimal points from all possible factorizations of X{X^{\star}} to balanced factorizations of X{X^{\star}} in Xr\mathcal{X}^{\star}_{r}. U{U^{\star}} and V{V^{\star}} have the same set of singular values, which are the square roots of the singular values of X{X^{\star}}. A key property of the balanced factorizations is the following.

For any factorization of the form (3), it holds that

By “balanced factorizations” of X=UV{X^{\star}}={U^{\star}}{V^{\star}}^{\top}, we mean that factors U{U^{\star}} and V{V^{\star}} satisfy

Therefore, we have g(W)=0g({W^{\star}})=0, and (U,V)({U^{\star}},{V^{\star}}) is an optimal point of (5).

Main Results

This section describes our main results on the function landscape of the non-square matrix sensing problem. The following theorem bounds the distance of any local minima to the global minimum, by the function value at the global minimum.

Suppose W{W^{\star}} is any target matrix of the optimization problem (5), under the balanced singular values assumption for U{U^{\star}} and V{V^{\star}}. If WW is a critical point satisfying the first- and the second-order optimality conditions, i.e., (f+g)(W)=0\nabla(f+g)(W)=0 and 2(f+g)(W)0\nabla^{2}(f+g)(W)\succeq 0, then we have

Observe that for this bound to make sense, the term 15δ2r544δ4r21088δ2rδ4r28(40+68δ2r)(1+δ2r)\tfrac{1-5\delta_{2r}-544\delta_{4r}^{2}-1088\delta_{2r}\delta_{4r}^{2}}{8(40+68\delta_{2r})(1+\delta_{2r})} needs to be positive. We provide some intuition of this result next. Combined with Lemma 5.14 in , we can also obtain the distance between (U,V)(U,V) and (U,V)({U^{\star}},{V^{\star}}).

For W=[UV]W=\begin{bmatrix}U\\ V\end{bmatrix} and given the assumptions of Theorem 2.1, we have

Implications of these results are described next, where we consider specific settings.

Suppose that W=[UV]{W^{\star}}=\begin{bmatrix}{U^{\star}}\\ {V^{\star}}\end{bmatrix} is the underlying unknown true matrix, i.e., X=UVX^{\star}={U^{\star}}{V^{\star}}^{\top} is rank-rr and b=A(UV)b=\mathcal{A}({U^{\star}}{V^{\star}}^{\top}). We assume the noiseless setting, w=0w=0. If 0δ2rδ4r0.03630\leq\delta_{2r}\leq\delta_{4r}\lesssim 0.0363, then 15δ2r544δ4r21088δ2rδ4r210(40+68δ2r)(1+δ2r)>0\tfrac{1-5\delta_{2r}-544\delta_{4r}^{2}-1088\delta_{2r}\delta_{4r}^{2}}{10(40+68\delta_{2r})(1+\delta_{2r})}>0 in Corollary 2.2. Since the RHS of (8) is zero, this further implies that \textscDist(U,V;X)=0{\rm{\textsc{Dist}}}\left(U,V;X^{\star}\right)=0, i.e., any critical point WW that satisfies first- and second-order optimality conditions is global minimum.

Suppose that W{W^{\star}} is the underlying true matrix, such that X=UVX^{\star}={U^{\star}}{V^{\star}}^{\top} and is rank-rr, and b=A(UV)+wb=\mathcal{A}({U^{\star}}{V^{\star}}^{\top})+w, for some noise term ww. If 0δ2rδ4r<0.020\leq\delta_{2r}\leq\delta_{4r}<0.02, then it follows from (7) that for any local minima WW the distance to UV{U^{\star}}{V^{\star}}^{\top} is bounded by

Suppose that X{X^{\star}} is of arbitrary rank and let XrX^{\star}_{r} denote its best rank-rr approximation. Let b=A(X)+wb=\mathcal{A}({X^{\star}})+w where ww is some noise and let (U,V)({U^{\star}},{V^{\star}}) be a balanced factorization of XrX_{r}^{\star}. If 0δ2rδ4r<0.0050\leq\delta_{2r}\leq\delta_{4r}<0.005, then it follows from (8) that for any local minima (U,V)(U,V) the distance to (U,V)({U^{\star}},{V^{\star}}) is bounded by

Proof of Main Results

We first describe the first- and second-order optimality conditions for f+gf+g objective with WW variable. Then, we provide a detailed proof of the main results: by carefully analyzing the conditions, we study how a local optimum is related to the global optimum.

The gradients of ff and gg w.r.t. WW are given by:

2 Optimality conditions

Given the expressions above, we now describe first- and second-order optimality conditions on the composite objective f+gf+g.

By the first-order optimality condition of a pair (U, V)(U,~{}V) such that W=[UV]W=\begin{bmatrix}U\\ V\end{bmatrix}, we have (f+g)(W)=0\nabla(f+g)\left(W\right)=0. This further implies:

3 Proof of Theorem 2.1

Suppose that WW is a critical point satisfying the optimality conditions (9) and (3.2). The second order optimality is again written as

As in , we sum up the above condition for Z1(WWR)e1e1,,Zr(WWR)ererZ_{1}\triangleq(W-{W^{\star}}R)e_{1}e_{1}^{\top},\ldots,Z_{r}\triangleq(W-{W^{\star}}R)e_{r}e_{r}^{\top}. For simplicity, we first assume Z=WWRZ=W-{W^{\star}}R.

where (a) follows from that every BiB_{i} is symmetric, (b) follows from Proposition 1.2, and (c) follows from the AM-GM inequality. We also have

where at (a) we add the first-order optimality equation

and (b) follows from Proposition 1.2. Then we have

where (a) follows from the Cauchy-Schwarz inequality, and (b) follows from Proposition 1.2. We finally get

where the last inequality follows from the AM-GM inequality.

Now we apply Zj=ZejejZ_{j}=Ze_{j}e_{j}^{\top}. Since ZZ=j=1rZjZjZZ^{\top}=\sum_{j=1}^{r}Z_{j}Z_{j}^{\top} in (3.3), the analysis does not change for (B) and (E). For (A), (C), and (D), we obtain

where the inequality follows from the Cauchy-Schwarz inequality. Applying this bound, we get

Let WW and WW^{\star} be two matrices, and QQ is an orthonormal matrix that spans the column space of WW. Then, there exists an orthonormal matrix RR such that, for any stationary point WW of g(W)g(W) that satisfies first and second order condition, the following holds:

And we have the following variant of [5, Lemma 4.2].

For any pair of points (U, V)(U,~{}V) that satisfies the first-order optimality condition, and A\mathcal{A} be a linear operator satisfying the RIP condition with parameter δ4r\delta_{4r}, the following inequality holds:

Applying the above two lemmas, we can get

3.1 Proof of Lemma 3.2

The first-order optimality condition can be written as

Applying Proposition 1.2 and the Cauchy-Schwarz inequality to the condition, we obtain

Let Z=(WWWW)QR1Z=(WW^{\top}-{W^{\star}}{{W^{\star}}}^{\top})QR^{-1\top} where W=QRW=QR is the QR decomposition. Then we obtain

where (a) follows from Proposition 1.3, and (b) follows from that the inner product of two PSD matrices is non-negative. Then we obtain

Plugging the above bounds into (3.3.1), we get

In either case of (WWWW)QQF\left\|{(WW^{\top}-{W^{\star}}{{W^{\star}}}^{\top})QQ^{\top}}\right\|_{F} being zero or positive, we can obtain

What About Saddle Points?

Our discussion so far concentrates on whether UVUV^{\top} parametrization introduces spurious local minima. Our main results show that any point (U,V)(U,V) that satisfies both first- and second-order optimality conditionsNote here that the second-order optimality condition includes positive semi-definite second-order information; i.e., Theorem 2.1 also handles saddle points due to the semi-definiteness of the Hessian at these points. should be (or lie close to) the global optimum. However, we have not discussed what happens with saddle points, i.e., points (U,V)(U,V) where the Hessian matrix contains both positive and negative eigenvalues.Here, we do not consider the harder case where saddle points have Hessian with positive, negative and zero eigenvalues. This is important for practical reasons: first-order methods rely on gradient information and, thus, can easily get stuck to saddle points that may be far away from the global optimum.

studied conditions of the objective that guarantee that stochastic gradient descent—randomly initialized—converges to a local minimum; i.e., we can avoid getting stuck to non-degenerate saddle points. These conditions include f+gf+g being bounded and smooth, having Lipschitz Hessian, being locally strongly convex, and satisfying the strict saddle property, according to the following definition.

A twice differentiable function f+gf+g is strict saddle, if all its stationary points, that are not local minima, satisfy λmin(2(f+g)(W))<0\lambda_{\min}(\nabla^{2}(f+g)(W))<0.

relax some of these conditions and prove the following theorem (for standard gradient descent).

If the objective is twice differentiable and satisfies the strict saddle property, then gradient descent, randomly initialized and with sufficiently small step size, converges to a local minimum almost surely.

In this section, based on the analysis in , we show that f+gf+g satisfy the strict saddle property, which implies that gradient descent can avoid saddle points and converge to the global minimum, with high probability.

Consider noiseless measurements b=A(X)b=\mathcal{A}(X^{\star}), with A\mathcal{A} satisfying RIP with constant δ4r1100\delta_{4r}\leq\tfrac{1}{100}. Assume that rank(X)=r\text{rank}(X^{\star})=r. Let (U,V)(U,V) be a pair of factors that satisfies the first order optimality condition f(W)=0\nabla f(W)=0, for W=[UV]W=\begin{bmatrix}U\\ V\end{bmatrix}, and UVXUV^{\top}\neq X^{\star}. Then,

where the last inequality is due to the requirement δ4r1100\delta_{4r}\leq\tfrac{1}{100}. For the LHS of (4), we can lower bound as follows:

where the last equality is by setting Z=WWRZ=W-W^{\star}R. Combining this expression with (4), we obtain:

References