Global Optimality of Local Search for Low Rank Matrix Recovery

Srinadh Bhojanapalli, Behnam Neyshabur, Nathan Srebro

Introduction

Low rank matrix recovery problem is heavily studied and has numerous applications in collaborative filtering, quantum state tomography, clustering, community detection, metric learning and multi-task learning .

This matrix sensing problem has received considerable attention recently . This and other rank-constrained problems are common in machine learning and related fields, and have been used for applications discussed above. A typical theoretical approach to low-rank problems, including (1) is to relax the low-rank constraint to a convex constraint, such as the trace-norm of $\bm{X}$ . Indeed, for matrix sensing, Recht et al. showed that if the measurements are noiseless and the measurement operator $\mathcal{A}$ satisfies a restricted isometry property, then a low-rank $\bm{X^{*}}$ can be recovered as the unique solution to a convex relaxation of (1). Subsequent work established similar guarantees also for the noisy and approximate case .

However, convex relaxations to the rank are not the common approach employed in practice. In this and other low-rank problems, the method of choice is typically unconstrained local optimization (via e.g. gradient descent, SGD or alternating minimization) on the factorized parametrization

where the rank constraint is enforced by limiting the dimensionality of $\bm{U}$ . Problem (2) is a non-convex optimization problem that could have many bad local minima (as we show in Section 5), as well as saddle points. Nevertheless, local optimization seems to work very well in practice. Working on (2) is much cheaper computationally and allows scaling to large-sized problems—the number of optimization variables is only $O(nr)$ rather than $O(n^{2})$ , and the updates are usually very cheap, especially compared to typical methods for solving the SDP resulting from the convex relaxation. There is therefore a significant disconnect between the theoretically studied and analyzed methods (based on convex relaxations) and the methods actually used in practice.

Recent attempts at bridging this gap showed that, some form of global “initialization”, typically relying on singular value decomposition, yields a solution that is already close enough to $\bm{X^{*}}$ ; that local optimization from this initializer gets to the global optima (or to a good enough solution). Jain et al. , Keshavan proved convergence for alternating minimization algorithm provided the starting point is close to the optimum, while Zheng and Lafferty , Zhao et al. , Tu et al. , Chen and Wainwright , Bhojanapalli et al. considered gradient descent methods on the factor space and proved local convergence. But all these studies rely on global initialization followed by local convergence, and do not tackle the question of the existence of spurious local minima or deal with optimization starting from random initialization. There is therefore still a disconnect between this theory and the empirical practice of starting from random initialization and relying only on the local search to find the global optimum.

In this paper we show that, under a suitable incoherence condition on the measurement operator $\mathcal{A}$ (defined in Section 2), with noiseless measurements and with $\operatorname{rank}(\bm{X^{*}})\leq r$ , the problem (2) has no spurious local minima (i.e. all local minima are global and satisfy $\bm{X^{*}}=\bm{U}\bm{U}^{\top}$ ). Furthermore, under the same conditions, all saddle points have a direction with significant negative curvature, and so using a recent result of Ge et al. we can establish that stochastic gradient descent from random initialization converges to $\bm{X^{*}}$ in polynomial number of iterations. We extend the results also to the noisy and approximately-low-rank settings, where we can guarantee that every local minima is close to a global minimum. The incoherence condition we require is weaker than conditions used to establish recovery through local search, and so our results also ensures recovery in polynomial time under milder conditions than what was previously known. In particular, with i.i.d. Gaussian measurements, we ensure no spurious local minima and recovery through local search with the optimal number $O(nr)$ of measurements.

Our work is heavily inspired by Bandeira et al. , who recently showed similar behavior for the problem of community detection—this corresponds to a specific rank- $1$ problem with a linear objective, elliptope constraints and a binary solution. Here we take their ideas, extend them and apply them to matrix sensing with general rank- $r$ matrices. In the past several months, similar type of results were also obtained for other non-convex problems (where the source of non-convexity is not a rank constraint), specifically complete dictionary learning and phase recovery . A related recent result of a somewhat different nature pertains to rank unconstrained linear optimization on the elliptope, showing that local minima of the rank-constrained problem approximate well the global optimum of the rank unconstrained convex problem, even though they might not be the global minima (in fact, the approximation guarantee for the actual global optimum is better) .

Another non-convex low-rank problem long known to not possess spurious local minima is the PCA problem, which can also be phrased as matrix approximation with full observations, namely $\min_{\operatorname{rank}(\bm{X})\leq r}\left\lVert{A-X}\right\rVert_{F}$ (e.g. ). Indeed, local search methods such as the power-method are routinely used for this problem. Recently local optimization methods for the PCA problem working more directly on the optimized formulation have also been studied, including SGD and Grassmannian optimization . These results are somewhat orthogonal to ours, as they study a setting in which it is well known there are never any spurious local minima, and the challenge is obtaining satisfying convergence rates.

The seminal work of Burer and Monteiro proposed low-rank factorized optimization for SDPs, and showed that for extremely high rank $r>\sqrt{m}$ (number of constraints), an Augmented Lagrangian method converges asymptotically to the optimum. It was also shown that (under mild conditions) any rank deficient local minima is a global minima , providing a post-hoc verifiable sufficient condition for global optimality. However, this does not establish any a-priori condition, based on problem structure, implying the lack of spurious local minima.

While preparing this manuscript, we also became aware of parallel work studying the same question for the related but different problem of matrix completion. For this problem they obtain a similar guarantee, though with suboptimal dependence on the incoherence parameters and so suboptimal sample complexity, and requiring adding a specific non-standard regularizer to the objective—this is not needed for our matrix sensing results.

We believe our work, together with the parallel work of , are the first to establish the lack of spurious local minima and the global convergence of local search from random initialization for a non-trivial rank-constrained problem (beyond PCA with full observations) with rank $r>1$ .

Formulation and Assumptions

Even if we know that $\operatorname{rank}(\bm{X^{*}})\leq r$ , having many measurements might not be sufficient for recovery if they are not “spread out” enough. E.g., if all measurements only involve the first $n/2$ rows and columns, we would never have any information on the bottom-right block. A sufficient condition for identifiability of a low-rank $\bm{X^{*}}$ from linear measurements by Recht et al. is based on restricted isometry property defined below.

In particular, $\bm{X^{*}}$ of rank $r$ is identifiable if $\delta_{2r}<1$ [see 20, Theorem 3.2]. One situation in which RIP is obtained is for random measurement operators. For example, matrices with i.i.d. $\mathcal{N}(0,1)$ entries satisfy $(r,\delta_{r})$ -RIP when $m=O(\frac{nr}{\delta^{2}})$ [see 6, Theorem 2.3]. This implies identifiability based on i.i.d. Gaussian measurement with $m=O(nr)$ measurements (coincidentally, the number of degrees of freedom in $\bm{X^{*}}$ , optimal up to a constant factor).

Main Results

We are now ready to present our main result about local minima for the matrix sensing problem (2). We first present the results for noisy sensing of exact low rank matrices, and then generalize the results also to approximately low rank matrices.

Now we will present our result characterizing local minima of $f(\bm{U})$ , for low-rank $\bm{X^{*}}$ . Recall that measurements are $\bm{y}=\mathcal{A}(\bm{X^{*}})+\bm{w}$ , where entries of $\bm{w}$ are i.i.d. Gaussian - $w_{i}\sim\mathcal{N}(0,\sigma_{w}^{2})$ .

Consider the optimization problem (2) where $\bm{y}=\mathcal{A}(\bm{X^{*}})+\bm{w}$ , $\bm{w}$ is i.i.d. $\mathcal{N}(0,\sigma_{w}^{2}),$ $\mathcal{A}$ satisfies $(2r,\delta_{2r})$ -RIP with $\delta_{2r}<\frac{1}{10}$ , and $\operatorname{rank}(\bm{X^{*}})\leq r$ . Then, with probability $\geq 1-\frac{10}{n^{2}}$ (over the noise), for any local minimum $\bm{U}$ of $f(\bm{U})$ :

In particular, in the noiseless case ( $\sigma_{w}=0$ ) we have $\bm{U}\bm{U}^{\top}=\bm{X^{*}}$ and so $f(\bm{U})=0$ and every local minima is global. In the noiseless case, we can also relax the RIP requirement to $\delta_{2r}<1/5$ (see Theorem 4.1 in Section 4). In the noisy case we cannot expect to ensure we always get to an exact global minima, since the noise might cause tiny fluctuations very close to the global minima possibly creating multiple very close local minima. But we show that all local minima are indeed very close to some factorization $\bm{U^{*}}{\bm{U^{*}}}^{\top}=\bm{X^{*}}$ of the true signal, and hence to a global optimum, and this “radius” of local minima decreases as we have more observations.

The proof of the Theorem for the noiseless case is presented in Section 4. The proof for the general setting follows along the same lines and can be found in the Appendix.

So far we have discussed how all local minima are global, or at least very close to a global minimum. Using a recent result by Ge et al. on the convergence of SGD for non-convex functions, we can further obtain a polynomial bound on the number of SGD iterations required to reach the global minima. The main condition that needs to be established in order to ensure this, is that all saddle points of (2) satisfy the “strict saddle point condition”, i.e. have a direction with significant negative curvature:

Consider the optimization problem (2) in the noiseless case, where $\bm{y}=\mathcal{A}(\bm{X^{*}})$ , $\mathcal{A}$ satisfies $(2r,\delta_{2r})$ -RIP with $\delta_{2r}<\frac{1}{10}$ , and $\operatorname{rank}(\bm{X^{*}})\leq r$ . Let $\bm{U}$ be a first order critical point of $f(\bm{U})$ with $\bm{U}\bm{U}^{\top}\neq\bm{X^{*}}$ . Then the smallest eigenvalue of the Hessian satisfies

Now consider the stochastic gradient descent updates,

where $\psi$ is uniformly distributed on the unit sphere and $\textrm{Proj}_{b}$ is a projection onto $\left\lVert{\bm{U}}\right\rVert_{F}\leq b$ . Using Theorem 3.2 and the result of Ge et al. we can establish:

Consider the optimization problem (2) under the same noiseless conditions as in Theorem 3.2. Using $b\geq\left\lVert{\bm{U}^{*}}\right\rVert_{F}$ , for some global optimum $\bm{U}^{*}$ of $f(\bm{U})$ , for any $\epsilon,c>0$ , after $T=poly\left(\frac{1}{\sigma_{r}(\bm{X^{*}})},\sigma_{1}(\bm{X^{*}}),b,\frac{1}{\epsilon},\log(1/c)\right)$ iterations of (4) with an appropriate stepsize $\eta$ , starting from a random point uniformly distributed on $\left\lVert{\bm{U}}\right\rVert_{F}=b$ , with probability at least $1-c$ , we reach an iterate $\bm{U}_{T}$ satisfying

The above result guarantees convergence of noisy gradient descent to a global optimum. Alternatively, second order methods such as cubic regularization (Nesterov and Polyak ) and trust region (Cartis et al. ) that have guarantees based on the strict saddle point property can also be used here.

RIP Requirement: Our results require $(2r,1/10)$ -RIP for the noisy case and $(2r,1/5)$ -RIP for the noiseless case. Requiring $(2r,\delta_{2r})$ -RIP with $\delta_{2r}<1$ is sufficient to ensure uniqueness of the global optimum of (1), and thus recovery in the noiseless setting , but all known efficient recovery methods require stricter conditions. The best guarantees we are aware of require $(5r,1/10)$ -RIP or $(4r,0.414)$ -RIP using a convex relaxation. Our requirement is not directly comparable to the latter, as we require RIP on a smaller set, but with a lower (stricter) $\delta$ . Alternatively, $(6r,1/10)$ -RIP is required for global initialization followed by non-convex optimization —our requirement is strictly better. In terms of requirements on $(2r,\delta_{2r})$ -RIP for non-convex methods, the best we are aware of is requiring $\delta_{2r}<\Omega(1/r)$ –this is a much stronger condition than ours, and it yields a suboptimal required number of spherical Gaussian measurements of $\Omega(nr^{3})$ . So, compared to prior work our requirement is very mild—it ensures efficient recovery even in a regime not previously covered by any guarantee on efficient recovery, and requires the optimal number of spherical Gaussian measurements (up to a constant factor) of $O(nr)$ .

We can also obtain similar results that deteriorate gracefully if $\bm{X^{*}}$ is not exactly low rank, but is close to being low-rank (see proof in the Appendix):

Consider the optimization problem (2) where $\bm{y}=\mathcal{A}(\bm{X^{*}})$ and $\mathcal{A}$ satisfies $(2r,\delta_{2r})$ -RIP with $\delta_{2r}<\frac{1}{100}$ , Then, for any local minima $\bm{U}$ of $f(\bm{U})$ :

where $\bm{X_{r}^{*}}$ is the best rank $r$ approximation of $\bm{X^{*}}$ .

This theorem guarantees that any local optimum of $f(\bm{U})$ is close to $\bm{X^{*}}$ upto an error depending on $\|\bm{X^{*}}-\bm{X_{r}^{*}}\|$ . For the low-rank noiseless case we have $\bm{X^{*}}=\bm{X_{r}^{*}}$ and the right hand side vanishes. When $\bm{X^{*}}$ is not exactly low rank, the best recovery error we can hope for is $\left\lVert{\bm{X^{*}}-\bm{X_{r}^{*}}}\right\rVert_{F}$ , since $\bm{U}\bm{U}^{\top}$ is at most rank $k$ . On the right hand side of Theorem 3.4, we have also a nuclear norm term, which might be higher, but it also gets scaled down by $\delta_{2r}$ , and so by the number of measurements.

Proof for the Noiseless Case

In this section we present the proof characterizing the local minima of problem (2). For ease of exposition we first present the results for the noiseless case ( $\bm{w}=0$ ). Proof for the general case can be found in the Appendix.

Consider the optimization problem (2) where $\bm{y}=\mathcal{A}(\bm{X^{*}})$ , $\mathcal{A}$ satisfies $(2r,\delta_{2r})$ -RIP with $\delta_{2r}<\frac{1}{5}$ , and $\operatorname{rank}(\bm{X^{*}})\leq r$ . Then, for any local minimum $\bm{U}$ of $f(\bm{U})$ :

For the proof of this theorem we first discuss the implications of the first and second order optimality conditions and then show how to combine them to yield the result.

Our proof techniques are different from existing results characterizing local minima of dictionary learning , phase retrieval and community detection . For most of these problems Hessian is PSD only for points close to optimum. However, Hessian of $f(\bm{U})$ can be PSD even for points far from to optima. Hence we need new directions to use the second order conditions.

Invariance of $f(\bm{U})$ over $r\times r$ orthonormal matrices introduces additional challenges in comparing a given stationary point to a global optimum. We have to find the best orthonormal matrix $R$ to align a given stationary point $\bm{U}$ to a global optimum $\bm{U^{*}}$ , where $\bm{U^{*}}{\bm{U^{*}}}^{\top}=\bm{X^{*}}$ , to combine results from the first and second order conditions, without degrading the isometry constants.

First we present the following consequence of the RIP assumption [see 5, Lemma 2.1].

Given two $n\times n$ rank- $r$ matrices $\bm{X}$ and $\bm{Y}$ , and a $(2r,\delta)$ -RIP measurement operator $\mathcal{A}$ , the following holds:

First we will consider the first order condition, $\nabla f(\bm{U})=0$ . For any stationary point $\bm{U}$ this implies

Now using the isometry property of $\bm{A}_{i}$ gives us the following result.

[First order condition] For any first order stationary point $\bm{U}$ of $f(\bm{U})$ , and $\mathcal{A}$ satisfying the $(2r,\delta)$ -RIP (3), the following holds:

where $Q$ is an orthonormal matrix that spans the column space of $\bm{U}$ .

This lemma states that any stationary point of $f(\bm{U})$ is close to a global optimum $\bm{U^{*}}$ in the subspace spanned by columns of $\bm{U}$ . Notice that the error along the orthogonal direction $\|\bm{X^{*}}Q_{\perp}Q_{\perp}^{\top}\|_{F}$ can still be large making the distance between $\bm{X}$ and $\bm{X^{*}}$ arbitrarily far.

Let $\bm{U}=QR$ , for some orthonormal $Q$ . Consider any matrix of the form $\bm{Z}Q{R^{-1}}^{\top}$ . The first order optimality condition then implies,

The above equation together with Restricted Isometry Property (equation (5)) gives us the following inequality:

Note that for any matrix $\bm{A}$ , $\left\langle\bm{A},QQ^{\top}\bm{Z}\right\rangle=\left\langle QQ^{\top}\bm{A},\bm{Z}\right\rangle$ . Furthermore, for any matrix $\bm{A}$ , $\sup_{\{\bm{Z}:\|\bm{Z}\|_{F}\leq 1\}}\left\langle\bm{A},\bm{Z}\right\rangle=\|\bm{A}\|_{F}$ . Hence the above inequality implies the lemma statement. ∎

2 Second order optimality

We now consider the second order condition to show that the error along $Q_{\perp}Q_{\perp}^{\top}$ is indeed bounded well. Let $\nabla^{2}f(\bm{U})$ be the hessian of the objective function. Note that this is an $n\cdot r\times n\cdot r$ matrix. Fortunately for our result we need to only evaluate the Hessian along the direction $\texttt{vec}(\bm{U}-\bm{U^{*}}R)$ for some orthonormal matrix $R$ . Here $\texttt{vec}(.)$ denotes writing a matrix in vector form.

[Hessian computation] Let $\bm{U}$ be a first order critical point of $f(\bm{U})$ . Then for any $r\times r$ orthonormal matrix $R$ and $\Delta_{j}=\Delta e_{j}e_{j}^{\top}$ ( $\Delta=\bm{U}-\bm{U^{*}}R$ ),

For any matrix $\bm{Z}$ , taking directional second derivative of the function $f(\bm{U})$ with respect to $\bm{Z}$ we get:

Setting $\bm{Z}=\Delta_{j}=(\bm{U}-\bm{U^{*}}R)e_{j}e_{j}^{\top}$ and using the first order optimality condition on $\bm{U}$ , we get,

$(i)$ and $(ii)$ follow from the first order optimality condition (6),

for $j=1\cdots r$ . Finally taking sum over $j$ from $1$ to $r$ gives the result. ∎

Hence from second order optimality of $\bm{U}$ we get,

[Second order optimality] Let $\bm{U}$ be a local minimum of $f(\bm{U})$ . For any $r\times r$ orthonormal matrix $R$ ,

Further for $\mathcal{A}$ satisfying $(2r,\delta)$ -RIP (equation (3)) we have,

The proof of this result follows simply by applying Lemma 4.3. The above Lemma gives a bound on the distance in the factor $(\bm{U})$ space $\|\bm{U}(\bm{U}-\bm{U^{*}}R)^{\top}\|_{F}^{2}$ . To be able to compare the second order condition to the first order condition we need a relation between $\|\bm{U}(\bm{U}-\bm{U^{*}}R)^{\top}\|_{F}^{2}$ and $\|\bm{X}-\bm{X^{*}}\|_{F}^{2}$ . Towards this we show the following result.

Let $\bm{U}$ and $\bm{U^{*}}$ be two $n\times r$ matrices, and $Q$ is an orthonormal matrix that spans the column space of $\bm{U}$ . Then there exists an $r\times r$ orthonormal matrix $R$ such that for any first order stationary point $\bm{U}$ of $f(\bm{U})$ , the following holds:

This Lemma bounds the distance in the factor space ( $\|(\bm{U}-\bm{U^{*}}R)\bm{U}^{\top}\|_{F}^{2}$ ) with $\|\bm{U}\bm{U}^{\top}-\bm{U^{*}}{\bm{U^{*}}}^{\top}\|_{F}^{2}$ and $\|(\bm{U}\bm{U}^{\top}-\bm{U^{*}}{\bm{U^{*}}}^{\top})QQ^{\top}\|_{F}^{2}$ . Combining this with the result from second order optimality (Corollary 4.1) shows $\|\bm{U}\bm{U}^{\top}-\bm{U^{*}}{\bm{U^{*}}}^{\top}\|_{F}^{2}$ is bounded by a constant factor of $\|(\bm{U}\bm{U}^{\top}-\bm{U^{*}}{\bm{U^{*}}}^{\top})QQ^{\top}\|_{F}^{2}$ . This implies $\|\bm{X^{*}}Q_{\perp}Q_{\perp}\|_{F}$ is bounded, opposite to what the first order condition implied (Lemma 4.2). The proof of the above lemma is in Section E. Hence from the above optimality conditions we get the proof of Theorem 4.1.

Assuming $\bm{U}\bm{U}^{\top}\neq\bm{U^{*}}{\bm{U^{*}}}^{\top}$ , from Lemmas 4.2, 4.4 and Corollary 4.1 we get,

If $\delta\leq\frac{1}{5}$ the above inequality holds only if $\bm{U}\bm{U}^{\top}=\bm{U^{*}}{\bm{U^{*}}}^{\top}$ . ∎

Necessity of RIP

We showed that there are no spurious local minima only under a restricted isometry assumption. A natural question is whether this is necessary, or whether perhaps the problem (2) never has any spurious local minima, perhaps similarly to the non-convex PCA problem $\min_{\bm{U}}\left\lVert{\bm{A}-\bm{U}\bm{U}^{\top}}\right\rVert$ .

A good indication that this is not the case is that (2) is NP-hard, even in the noiseless case when $\bm{y}=\mathcal{A}(\bm{X^{*}})$ for $\operatorname{rank}(\bm{X^{*}})\leq k$ (if we don’t require RIP, we can have each $\bm{A}_{i}$ be non-zero on a single entry in which case (2) becomes a matrix completion problem, for which hardness has been shown even under fairly favorable conditions ). That is, we are unlikely to have a poly-time algorithm that succeeds for any linear measurement operator. Although this doesn’t formally preclude the possibility that there are no spurious local minima, but it just takes a very long time to find a local minima, this scenario seems somewhat unlikely.

To resolve the question, we present an explicit example of a measurement operator $\mathcal{A}$ and $\bm{y}=\mathcal{A}(\bm{X^{*}})$ (i.e. $f(\bm{X^{*}})=0$ ), with $\operatorname{rank}(\bm{X^{*}})=r$ , for which (1), and so also (2), have a non-global local minima.

Example 1: Let $f(\bm{X})=(X_{11}+X_{22}-1)^{2}+(X_{11}-1)^{2}$ and consider (1) with $r=1$ (i.e. a rank- $1$ constraint). For $\bm{X^{*}}=\begin{bmatrix}1&0\\ 0&0\end{bmatrix}$ we have $f(\bm{X^{*}})=0$ and $\operatorname{rank}(\bm{X^{*}})=1$ . But $\bm{X}=\begin{bmatrix}0&0\\ 0&1\end{bmatrix}$ is a rank $1$ local minimum with $f(\bm{X})=1$ .

We can be extended the construction to any rank $r$ by simply adding $\sum_{i=3}^{r+2}(X_{ii}-1)^{2}$ to the objective, and padding both the global and local minimum with a diagonal beneath the leading $2\times 2$ block.

In Example 1, we had a rank- $r$ problem, with a rank- $r$ exact solution, and a rank- $r$ local minima. Another question we can ask is what happens if we allow a larger rank than the rank of the optimal solution. That is, if we have $f(\bm{X^{*}})=0$ with low $\operatorname{rank}(\bm{X^{*}})$ , even $\operatorname{rank}(\bm{X^{*}})=1$ , but consider (1) or (2) with a high $r$ . Could we still have non-global local minima? The answer is yes…

Example 2: Let $f(\bm{X})=(X_{11}+X_{22}+X_{33}-1)^{2}+(X_{11}-1)^{2}+(X_{22}-X_{33})^{2}$ and consider the problem (1) with a rank $r=2$ constraint. We can verify that $\bm{X^{*}}=\begin{bmatrix}1&0&0\\ 0&0&0\\ 0&0&0\end{bmatrix}$ is a rank= $1$ global minimum with $f(\bm{X^{*}})=0$ , but $\bm{X}=\begin{bmatrix}0&0&0\\ 0&\nicefrac{{1}}{{2}}&0\\ 0&0&\nicefrac{{1}}{{2}}\end{bmatrix}$ is a local minimum with $f(\bm{X})=1$ . Also for an arbitrary large rank constraint $r>1$ (taking $r$ to be odd for simplicity), extend the objective to $f(\bm{X})=(X_{11}-1)^{2}+\sum_{i=1}^{(r-1)/2}\left[(X_{11}+X_{2i,2i}+X_{(2i+1),(2i+1)}-1)^{2}\right.$ $\left.+(X_{2i,2i}-X_{(2i+1),(2i+1)})^{2}\right]$ . We still have a rank- $1$ global minimum $\bm{X^{*}}$ with a single non-zero entry $\bm{X^{*}}_{11}=1$ , while $\bm{X}=(I-\bm{X^{*}})/2$ is a local minimum with $f(\bm{X})=1$ .

Conclusion

We established that under conditions similar to those required for convex relaxation recovery guarantees, the non-convex formulation of matrix sensing (2) does not exhibit any spurious local minima (or, in the noisy and approximate settings, at least not outside some small radius around a global minima), and we can obtain theoretical guarantees on the success of optimizing it using SGD from random initialization. This matches the methods frequently used in practice, and can explain their success. This guarantee is very different in nature from other recent work on non-convex optimization for low-rank problems, which relied heavily on initialization to get close to the global optimum, and on local search just for the final local convergence to the global optimum. We believe this is the first result, together with the parallel work of Ge et al. , on the global convergence of local search for common rank-constrained problems that are worst-case hard.

Our result suggests that SVD initialization is not necessary for global convergence, and random initialization would succeed under similar conditions (in fact, our conditions are even weaker than in previous work that used SVD initialization). To investigate empirically whether SVD initialization is indeed helpful for ensuring global convergence, in Figure 1 we compare recovery probability of random rank- $k$ matrices for random and SVD initialization—there is no significant difference between the two.

Beyond the implications for matrix sensing, we are hoping these type of results could be a first step and serve as a model for understanding local search in deep networks. Matrix factorization, such as in (2), is a depth-two neural network with linear transfer—an extremely simple network, but already non-convex and arguably the most complicated network we have a good theoretical understanding of. Deep networks are also hard to optimize in the worst case, but local search seems to do very well in practice. Our ultimate goal is to use the study of matrix recovery as a guide in understating the conditions that enable efficient training of deep networks.

Authors would like to thank Afonso Bandeira for discussions, Jason Lee and Tengyu Ma for sharing and discussing their work. This research was supported in part by an NSF RI-AF award and by Intel ICRI-CI.

References

Appendix A Numerical Simulations

In this section we present simulation results for performance of gradient descent over $f(\bm{U})$ . We consider measurements $y_{i}=\left\langle\bm{A}_{i},\bm{X^{*}}\right\rangle$ , where $\bm{A}_{i}$ are i.i.d Gaussian with each entry distributed as $\mathcal{N}(0,\nicefrac{{1}}{{m}})$ . $\bm{X^{*}}$ is a $100\times 100$ rank $r$ random p.s.d matrix with $\|\bm{X^{*}}\|_{F}=1$ . $r$ is varied from 1 to 20 in the experiments.

We consider both standard gradient descent and noisy gradient descent (4) with step size $\frac{1}{\|\bm{U}\|_{2}}$ . We add noise of magnitude $1e-4$ for the noisy gradient updates. Each method is run until convergence (max of 200 iterations). Let the output of gradient descent be $\widehat{\bm{U}}$ . A run of this experiment is considered success if the final error $\|\widehat{\bm{U}}\widehat{\bm{U}}^{\top}-\bm{X^{*}}\|_{F}\leq 1e-2$ . Each experiment is repeated for 20 times and average probability of success is computed.

We repeat the above procedure starting from both random initialization and SVD initialization. For SVD initialization, the initial point is set to be the rank $r$ approximation of $\sum_{i=1}^{m}y_{i}\bm{A}_{i}$ as suggested by Jain et al. . In figure 2 we have the plots for the cases discussed above. All of them have phase transition around number of samples $m=2\cdot n\cdot r$ . This is in agreement with the results in Section 3. $f(\bm{U})$ has no local minima once $m\geq 2\cdot n\cdot r$ and random initialization has same performance as SVD initialization.

In figure 3, the left two plots show error $\nicefrac{{\|\widehat{\bm{U}}\widehat{\bm{U}}^{\top}-\bm{X^{*}}\|_{F}}}{{\|\bm{X^{*}}\|_{F}}}$ behaves with varying rank and number of samples for random and SVD initializations. The rightmost plot shows the phase transition for rank 10 case for all the methods. Again we notice no significant difference between these methods.

Appendix B Proof for the Noisy Case

In this section we present the proof characterizing the local minima of problem (2). Recall $\bm{y}=\mathcal{A}(\bm{X^{*}})+\bm{w}$ , where $\bm{X^{*}}$ is a rank- $r$ matrix and $\bm{w}$ is i.i.d. $\mathcal{N}(0,\sigma_{w}^{2}).$

First we will consider the first order condition, $\nabla f(\bm{U})=0$ . For any stationary point $\bm{U}$ this implies

Now using the isometry property of $\bm{A}_{i}$ gives us the following result.

[First order condition] For any first order stationary point $\bm{U}$ of $f(\bm{U})$ , and $\mathcal{A}$ satisfying the $(2r,\delta)$ -RIP (3), the following holds:

w.p. $\geq 1-\frac{1}{n^{2}}$ , where $Q$ is an orthonormal matrix that spans the column space of $\bm{U}$ .

Let $\bm{U}=QR$ , for some orthonormal $Q$ . Consider any matrix of the form $\bm{Z}Q{R^{-1}}^{\top}$ . The first order optimality condition then implies,

The above equation together with Restricted Isometry Property(equation (5)) gives us the following inequality:

by Cauchy Schwarz inequality and Lemma E.2. Note that for any matrix $A$ , $\left\langle\bm{A},QQ^{\top}\bm{Z}\right\rangle=\left\langle\bm{A}QQ^{\top},\bm{Z}\right\rangle$ . Furthermore, for any matrix $A$ , $\sup_{\{\bm{Z}:\|\bm{Z}\|_{F}\leq 1\}}\left\langle\bm{A},\bm{Z}\right\rangle=\|\bm{A}\|_{F}$ . Hence the above inequality implies the lemma statement. ∎

B.2 Second order optimality

We will now consider the second order condition to show that the error along $Q_{\perp}Q_{\perp}^{\top}$ is indeed bounded well. Let $\nabla^{2}f(\bm{U})$ be the hessian of the objective function. Note that this is an $n\cdot r\times n\cdot r$ matrix. Fortunately for our result we need to only evaluate the Hessian along the direction $\texttt{vec}(U-\bm{U^{*}}R)$ for some orthonormal matrix $R$ .

[Hessian computation] Let $\bm{U}$ be a first order critical point of $f(\bm{U})$ . Then for any $r\times r$ orthonormal matrix $R$ and $\Delta=\bm{U}-\bm{U^{*}}R$ ,

For any matrix $\bm{Z}$ , taking directional second derivative of the function $f(\bm{U})$ with respect to $\bm{Z}$ we get:

Setting $\bm{Z}=\Delta_{j}=(\bm{U}-\bm{U^{*}}R)e_{j}e_{j}^{\top}$ and using the first order optimality condition on $\bm{U}$ , we get,

where the last equality is again by the first order optimality condition (9). ∎

Hence from second order optimality of $U$ we get,

[Second order optimality] Let $\bm{U}$ be a local minimum of $f(\bm{U})$ . For any $r\times r$ orthonormal matrix $R$ , w.p. $\geq 1-\frac{1}{n^{2}}$ ,

Further for $\mathcal{A}$ satisfying $(2r,\delta)$ -RIP (equation (3)) we have,

Hence from the above optimality conditions we get the proof of Theorem 4.1.

Assuming $\bm{U}\bm{U}^{\top}\neq\bm{U^{*}}{\bm{U^{*}}}^{\top}$ , from Lemma 4.4 and Corollary B.1 we get, with probability $\geq 1-\frac{2}{n^{2}}$ ,

$(i)$ follows from Lemma B.1. The above inequality implies,

If $\delta\leq\frac{1}{10}$ , the above inequality reduces to $\|\bm{U}\bm{U}^{\top}-\bm{U^{*}}{\bm{U^{*}}}^{\top}\|_{F}\leq c\sqrt{\frac{\log(n)}{m}}\sigma_{w}$ , for some constant $c\leq 17$ , w.p $\geq 1-\frac{2}{n^{2}}$ . ∎

Appendix C Proof for the High Rank Case

In this section we will present the proof for the inexact case, where $\operatorname{rank}(\bm{X^{*}})\geq r$ . Recall that measurements are $\bm{y}=\mathcal{A}(\bm{X^{*}})$ .

Let SVD of $\bm{X^{*}}$ be $Q^{*}\Sigma^{*}{Q^{*}}\top$ . With slight abuse of notation we use $\bm{X^{*}}_{jr+1:(j+1)r}$ to denote the $j$ th rank $r$ block $Q^{*}_{jr+1:(j+1)r}\Sigma^{*}_{jr+1:(j+1)r}{Q^{*}_{jr+1:(j+1)r}}^{\top}$ , where $Q^{*}_{jr+1:(j+1)r}$ denotes the restriction of $Q$ to columns $jr+1$ to $(j+1)r$ .

First we will consider the first order condition, $\nabla f(\bm{U})=0$ . For any stationary point $\bm{U}$ this implies

Now using the isometry property of $\bm{A}_{i}$ gives us the following result.

[First order condition] For any first order stationary point $\bm{U}$ of $f(\bm{U})$ , and $\{\bm{A}_{i}\}$ satisfying the $(2r,\delta)$ -RIP (3), the following holds:

where $Q$ is an orthonormal matrix that spans the column space of $\bm{U}$ .

Let $\bm{U}=QR$ , for some orthonormal $Q$ . Consider any matrix of the form $\bm{Z}QR^{-\top}$ . The first order optimality condition then implies,

Note that $\bm{X}-\bm{X_{r}^{*}}$ is atmost rank- $2r$ . Hence, the above equation together with Restricted Isometry Property(equation (5)) gives us the following inequality:

The last inequality follows from $\sum_{j}\|\bm{X^{*}}_{jr+1:(j+1)r}\|_{F}\leq\|\bm{X^{*}}-\bm{X_{r}^{*}}\|_{*}$ . The above inequalities are true for any $\bm{Z}$ .

Further note that for any matrix $\bm{A}$ , $\left\langle\bm{A},QQ^{\top}\bm{Z}\right\rangle=\left\langle\bm{A}QQ^{\top},\bm{Z}\right\rangle$ . Furthermore, for any matrix $A$ , $\sup_{\{\bm{Z}:\|\bm{Z}\|_{F}\leq 1\}}$ $\left\langle\bm{A},\bm{Z}\right\rangle=\|\bm{A}\|_{F}$ . Hence the above inequality implies the Lemma. ∎

C.2 Second order optimality

[Hessian computation] Let $\bm{U}$ be a first order critical point of $f(\bm{U})$ . Then for any $n\times r$ matrix $\bm{Z}$ ,

Further let $\bm{U}$ be a local minimum of $f(\bm{U})$ and $\mathcal{A}$ satisfying $(2r,\delta)$ -RIP (equation (3)). Then,

For any matrix $\bm{Z}$ , taking directional second derivative of the function $f(\bm{U})$ with respect to $\bm{Z}$ we get:

Setting $\bm{Z}=\Delta_{j}=(\bm{U}-\bm{U^{*}}R)e_{j}e_{j}^{\top}$ we get,

$(i)$ is by the first order optimality condition (12).

Hence from second order optimality of $U$ we get,

$(i)$ is from using RIP and splitting $\bm{X^{*}}-\bm{X_{r}^{*}}$ into rank- $r$ components $\bm{X^{*}}-\bm{X_{r}^{*}}=\sum_{j=1}^{\nicefrac{{n}}{{r}}-1}\bm{X^{*}}_{jr+1:(j+1)r}$ . $(ii)$ follows from using RIP (5). $(iii)$ follows from $\sum_{j}\|\bm{X^{*}}_{jr+1:(j+1)r}\|_{F}\leq\|\bm{X^{*}}-\bm{X_{r}^{*}}\|_{*}$ .

The Lemma now follows by combining equations (13), (14) and using RIP (3). ∎

Hence from the above optimality conditions we get the proof of Theorem 3.4.

Assuming $\bm{U}\bm{U}^{\top}\neq\bm{U_{r}^{*}}{\bm{U_{r}^{*}}}^{\top}$ , from Lemma 4.4 we know,

for some orthonormal $R$ . Hence combining equations (15),with Lemma C.2 we get,

The last inequality follows from just using $2ab\leq a^{2}+b^{2}$ .

Substituting $\delta=\frac{1}{100}$ gives,

Appendix D Proofs for Section 3

In this section we present the proofs for the strict saddle theorem (Theorem 3.2) and the convergence guarantees (Theorem 3.3). The proofs use the Lemmas developed in Section 4 and the supporting Lemmas from Section E.

where the last inequality follows from the RIP (3). Now applying Lemma 4.4 in equation (18) we get,

$(i)$ follows from Lemma 4.2. $(ii)$ follows from $\delta\leq\nicefrac{{1}}{{10}}$ . Now notice that from lemma E.1

Finally notice that $\Delta_{j}=\Delta e_{j}e_{j}^{\top}$ have orthogonal columns. Hence,

$(i)$ follows from equation (19). $(ii)$ follows from equation (20). ∎

To prove this theorem we use Theorem 6 of Ge et al. . We need to show that $f(\bm{U})$ satisfies, 1) strict saddle property, 2) local strong convexity, 3) $f$ is bounded, smooth and has Lipschitz Hessian.

The boundedness assumption easily follows from assuming we are optimizing over a bounded domain $b$ such that, $\|\bm{U^{*}}\|_{F}\leq b$ . Note that we can have any reasonable upper bound on the optimum and we can easily estimate this from $\sum_{i}y_{i}^{2}$ which is $\geq(1-\delta)\|\bm{X^{*}}\|_{F}^{2}$ for the noiseless case.

Finally all the calculations below are for scaled version of $f(x)$ by $\frac{1}{m}$ . Note that this does not change the number of iterations as both smoothness and strong convexity parameters are scaled by the same constant.

Smoothness constant $\beta$ : Recall that smoothness of $f$ is bounded by maximum eigenvalue of Hessian over the domain. Hence, $\beta=\max_{\bm{Z}:\|\bm{Z}\|_{F}\leq 1}\bm{Z}^{\top}\nabla^{2}f(\bm{U})\bm{Z}$ . We have computed this projection of Hessian in Lemma B.2. Hence,

$\rho$ - Lipschitz Hessian: Now we will compute the Lipschitz constant of Hessian of $f(\bm{U})$ . We will first bound the spectral norm of difference of Hessian at two points $\bm{U}$ , $\bm{V}$ in terms of $\|\bm{U}-\bm{V}\|_{F}$ along orthogonal direction $\bm{Z}_{i}$ and combine them to get bound on $\rho$ .. Given two $n\times r$ matrices $\bm{U},\bm{V}$ ,

Hence, using the variational characterization of the Frobenius norm, the Hessian Lipschitz constant is bounded by $\max{\{\bm{Z}_{i}\}}\sum_{i}\left\langle\nabla^{2}f(\bm{U})-\nabla^{2}f(\bm{V}),\bm{Z}_{i}\bm{Z}_{i}^{\top}\right\rangle$ , where $\bm{Z}_{i}$ are orthogonal with $\sum_{i}\|\bm{Z}_{i}\|_{F}^{2}\leq 1$ . Hence from equation (21) we get $\rho=O(b)$ .

Strict saddle property: So far we have shown regularity properties of $f(\bm{U})$ . Now we will discuss the strict saddle property. Theorem 3.2 shows that $\lambda_{\min}\left[\nabla^{2}(f(\bm{U}))\right]\leq\frac{-2}{5}\sigma_{r}(\bm{X^{*}}).$ To use results of we need to show this property over an $\epsilon$ neighborhood of any saddle point $\bm{U}$ . For this first recall by smoothness, $\|\nabla f(\bm{U})-\nabla f(\bm{V})\|_{F}\leq\beta\|\bm{U}-\bm{V}\|_{F}.$ Therefore $\nabla f(\bm{V})\leq\epsilon$ , when $\|\bm{U}-\bm{V}\|_{F}\leq\frac{\epsilon}{\beta}$ . Further we know the Hessian spectral norm is $\rho$ Lipschitz from equation (21). Hence, for any direction $\bm{Z}$ ,

In particular choosing $\bm{Z}$ to be the projection direction, $\bm{U}-\bm{U^{*}}$ implies from Theorem 3.2,

Hence for all $\bm{V}$ in the bowl of radius $\epsilon$ around $\bm{U}$ , where $\epsilon\leq\frac{\beta}{5\rho}\sigma_{r}(\bm{X^{*}})$ ,

Local strong convexity: Finally we need to show that the function is $\alpha$ strongly convex in a neighborhood $\theta$ around the optimum $\bm{U^{*}}R$ , for any orthonormal $R$ . This easily follows from existing local convergence results for this problem. For example, Lemma 6.1 of Bhojanapalli et al. states that, for $\|\bm{U}-\bm{U^{*}}R\|_{F}\leq\frac{\sigma_{r}(\bm{X^{*}})}{200\sigma_{1}(\bm{X^{*}})}\sigma_{r}(\bm{U^{*}}R)$ ,

for $\delta=\frac{1}{10}$ and some step size $\eta\propto\frac{1}{\|\bm{X^{*}}\|_{2}}$ . Hence $f(\bm{U})$ is locally strong convex with $\alpha=\frac{27}{200}\sigma_{r}(\bm{U^{*}}R)^{2}$ in the neighborhood of radius $\theta=\frac{\sigma_{r}(\bm{X^{*}})}{200\sigma_{1}(\bm{X^{*}})}\sigma_{r}(\bm{U^{*}}R)$ around the optimum.

Substituting these parameters in the Theorem 6 of Ge et al. gives the result. ∎

Appendix E Supporting Lemmas

In this section we present the supporting results used in the proofs above.

To prove this we will expand terms on the both sides in terms of $\bm{U}$ and $\Delta=\bm{U}-\bm{U^{*}}R$ and then compare. First notice the following properties of $R$ that minimizes $\|\bm{U^{*}}R-\bm{U}\|_{F}$ . Let $LSP^{\top}$ be the SVD of ${\bm{U^{*}}}^{\top}\bm{U}$ . Then, $R=LP^{\top}$ . Hence, $R^{\top}{\bm{U^{*}}}^{\top}\bm{U}=PSP^{\top}=\bm{U}^{\top}\bm{U^{*}}R$ is a PSD matrix. This implies, $\bm{U}^{\top}\Delta=\bm{U}^{\top}\bm{U}-\bm{U}^{\top}\bm{U^{*}}R=\bm{U}^{\top}\bm{U}-R^{\top}{\bm{U^{*}}}^{\top}\bm{U}=\Delta^{\top}\bm{U}$ .

Let columns of $U$ be orthogonal, else we can multiply $U$ by an orthonormal matrix and $UR$ will satisfy this. Since $UR$ is also local minimum, and $\bm{U}\bm{U}^{\top}=\bm{U}RR^{\top}\bm{U}^{\top}$ , results for $\bm{U}R$ will also hold for $\bm{U}$ . Let $Q$ be the orthonormal matrix that spans the column space of $\bm{U}$ and $Q_{\perp}Q_{\perp}^{\top}=I-QQ^{\top}$ . Similarly let $Q_{j}$ span $\bm{U}e_{j}e_{j}^{\top}$ . Note that $Q_{j}$ are orthonormal since columns of $U$ are orthogonal. Hence,

The last inequality follows from Lemma E.1 and the fact that $e_{j}^{\top}\bm{U}^{\top}\bm{U^{*}}Re_{j}\geq 0,\forall j$ as ${\bm{U}}^{\top}\bm{U^{*}}R$ is PSD. Now we will bound the second term in the above equation. The main idea here is to split this term into error between the subspaces of $\bm{X},\bm{X^{*}}$ and then error between their singular values, since both of them are bounded by distance $\|\bm{X}-\bm{X^{*}}QQ^{\top}\|_{F}.$ Let $Q^{*}$ be an orthonormal matrix that spans the column space of $\bm{X^{*}}$ . Also let $\bm{X}=Q\Sigma_{U}^{2}Q^{\top}$ .

where $(i)$ follows from Cauchy-Schwarz inequality.

We will use the following inequality through the rest of the proof. So we state it first for any matrix $\bm{T}$ .

Now we will bound each of the terms in equation . Term 1: Let, $\bm{T}=Q_{j\perp}Q_{j\perp}^{\top}\bm{U^{*}}R$ . Then applying inequaltiy from equation (26) we get,

$(i)$ follows from $e_{j}^{\top}\bm{U}^{\top}\bm{U}e_{j}=\|\bm{U}_{j}\|_{F}^{2}$ and $\|\bm{U}_{j}\|_{F}^{2}Q_{j}Q_{j}^{\top}=\bm{U}e_{j}e_{j}^{\top}\bm{U}^{\top}$ . Now from orthogonality of $Q_{j}$ we have,

Term 3: Finally we bound the last term in equation (25) similar to the first term, which gives,

Substituting the above equations (27), (28), (29) and (30) in (24) and (25) gives the result. ∎

The following lemma relates the error $\|(\bm{U}-\bm{Y})\bm{U}^{\top}\|_{F}$ with $\|\bm{U}\bm{U}^{\top}-\bm{Y}\bm{Y}^{\top}\|_{F}$ under some conditions on $\bm{U}$ and $\bm{Y}$ . This is a generalization of Lemma 5.4 in and the proof follows similarly.

Let $\bm{U}$ and $\bm{Y}$ be two $n\times r$ matrices. Further let $\bm{U}^{\top}\bm{Y}=\bm{Y}^{\top}\bm{U}$ be a PSD matrix. Then,

To prove this we will expand terms on the both sides in terms of $\bm{U}$ and $\Delta=\bm{U}-\bm{Y}$ and then compare.

$(i)$ follows from the following properties of trace: $\operatorname{trace}(\bm{A}\bm{B})=\operatorname{trace}(\bm{B}\bm{A})$ and $\operatorname{trace}(\bm{A})=\operatorname{trace}(\bm{A}^{\top})$ . $(ii)$ follows from completing the squares. $(iii)$ follows from $\operatorname{trace}(\bm{A}^{2})\geq 0$ . $(iv)$ follows from the hypothesis of the lemma ( $\bm{U}^{\top}\bm{Y}$ is PSD) and $\operatorname{trace}(\bm{A}\bm{B})\geq 0$ for PSD matrices $\bm{A}$ and $\bm{B}$ .

Finally notice that $\|(\bm{U}-\bm{Y})\bm{U}^{\top}\|_{F}^{2}=\operatorname{trace}(\bm{U}^{\top}\bm{U}\Delta^{\top}\Delta)$ . This completes the proof. ∎

We recall the standard Gaussian random variable concentration here.

Let $w_{i}\approx\mathcal{N}(0,\sigma_{w})$ , then

with probability $\geq 1-\frac{1}{n^{2}}$ .