Global Optimality of Local Search for Low Rank Matrix Recovery

Srinadh Bhojanapalli, Behnam Neyshabur, Nathan Srebro

Introduction

Low rank matrix recovery problem is heavily studied and has numerous applications in collaborative filtering, quantum state tomography, clustering, community detection, metric learning and multi-task learning .

This matrix sensing problem has received considerable attention recently . This and other rank-constrained problems are common in machine learning and related fields, and have been used for applications discussed above. A typical theoretical approach to low-rank problems, including (1) is to relax the low-rank constraint to a convex constraint, such as the trace-norm of X\bm{X}. Indeed, for matrix sensing, Recht et al. showed that if the measurements are noiseless and the measurement operator A\mathcal{A} satisfies a restricted isometry property, then a low-rank X\bm{X^{*}} can be recovered as the unique solution to a convex relaxation of (1). Subsequent work established similar guarantees also for the noisy and approximate case .

However, convex relaxations to the rank are not the common approach employed in practice. In this and other low-rank problems, the method of choice is typically unconstrained local optimization (via e.g. gradient descent, SGD or alternating minimization) on the factorized parametrization

where the rank constraint is enforced by limiting the dimensionality of U\bm{U}. Problem (2) is a non-convex optimization problem that could have many bad local minima (as we show in Section 5), as well as saddle points. Nevertheless, local optimization seems to work very well in practice. Working on (2) is much cheaper computationally and allows scaling to large-sized problems—the number of optimization variables is only O(nr)O(nr) rather than O(n2)O(n^{2}), and the updates are usually very cheap, especially compared to typical methods for solving the SDP resulting from the convex relaxation. There is therefore a significant disconnect between the theoretically studied and analyzed methods (based on convex relaxations) and the methods actually used in practice.

Recent attempts at bridging this gap showed that, some form of global “initialization”, typically relying on singular value decomposition, yields a solution that is already close enough to X\bm{X^{*}}; that local optimization from this initializer gets to the global optima (or to a good enough solution). Jain et al. , Keshavan proved convergence for alternating minimization algorithm provided the starting point is close to the optimum, while Zheng and Lafferty , Zhao et al. , Tu et al. , Chen and Wainwright , Bhojanapalli et al. considered gradient descent methods on the factor space and proved local convergence. But all these studies rely on global initialization followed by local convergence, and do not tackle the question of the existence of spurious local minima or deal with optimization starting from random initialization. There is therefore still a disconnect between this theory and the empirical practice of starting from random initialization and relying only on the local search to find the global optimum.

In this paper we show that, under a suitable incoherence condition on the measurement operator A\mathcal{A} (defined in Section 2), with noiseless measurements and with rank(X)r\operatorname{rank}(\bm{X^{*}})\leq r, the problem (2) has no spurious local minima (i.e. all local minima are global and satisfy X=UU\bm{X^{*}}=\bm{U}\bm{U}^{\top}). Furthermore, under the same conditions, all saddle points have a direction with significant negative curvature, and so using a recent result of Ge et al. we can establish that stochastic gradient descent from random initialization converges to X\bm{X^{*}} in polynomial number of iterations. We extend the results also to the noisy and approximately-low-rank settings, where we can guarantee that every local minima is close to a global minimum. The incoherence condition we require is weaker than conditions used to establish recovery through local search, and so our results also ensures recovery in polynomial time under milder conditions than what was previously known. In particular, with i.i.d. Gaussian measurements, we ensure no spurious local minima and recovery through local search with the optimal number O(nr)O(nr) of measurements.

Our work is heavily inspired by Bandeira et al. , who recently showed similar behavior for the problem of community detection—this corresponds to a specific rank-11 problem with a linear objective, elliptope constraints and a binary solution. Here we take their ideas, extend them and apply them to matrix sensing with general rank-rr matrices. In the past several months, similar type of results were also obtained for other non-convex problems (where the source of non-convexity is not a rank constraint), specifically complete dictionary learning and phase recovery . A related recent result of a somewhat different nature pertains to rank unconstrained linear optimization on the elliptope, showing that local minima of the rank-constrained problem approximate well the global optimum of the rank unconstrained convex problem, even though they might not be the global minima (in fact, the approximation guarantee for the actual global optimum is better) .

Another non-convex low-rank problem long known to not possess spurious local minima is the PCA problem, which can also be phrased as matrix approximation with full observations, namely minrank(X)rAXF\min_{\operatorname{rank}(\bm{X})\leq r}\left\lVert{A-X}\right\rVert_{F} (e.g. ). Indeed, local search methods such as the power-method are routinely used for this problem. Recently local optimization methods for the PCA problem working more directly on the optimized formulation have also been studied, including SGD and Grassmannian optimization . These results are somewhat orthogonal to ours, as they study a setting in which it is well known there are never any spurious local minima, and the challenge is obtaining satisfying convergence rates.

The seminal work of Burer and Monteiro proposed low-rank factorized optimization for SDPs, and showed that for extremely high rank r>mr>\sqrt{m} (number of constraints), an Augmented Lagrangian method converges asymptotically to the optimum. It was also shown that (under mild conditions) any rank deficient local minima is a global minima , providing a post-hoc verifiable sufficient condition for global optimality. However, this does not establish any a-priori condition, based on problem structure, implying the lack of spurious local minima.

While preparing this manuscript, we also became aware of parallel work studying the same question for the related but different problem of matrix completion. For this problem they obtain a similar guarantee, though with suboptimal dependence on the incoherence parameters and so suboptimal sample complexity, and requiring adding a specific non-standard regularizer to the objective—this is not needed for our matrix sensing results.

We believe our work, together with the parallel work of , are the first to establish the lack of spurious local minima and the global convergence of local search from random initialization for a non-trivial rank-constrained problem (beyond PCA with full observations) with rank r>1r>1.

Formulation and Assumptions

Even if we know that rank(X)r\operatorname{rank}(\bm{X^{*}})\leq r, having many measurements might not be sufficient for recovery if they are not “spread out” enough. E.g., if all measurements only involve the first n/2n/2 rows and columns, we would never have any information on the bottom-right block. A sufficient condition for identifiability of a low-rank X\bm{X^{*}} from linear measurements by Recht et al. is based on restricted isometry property defined below.

In particular, X\bm{X^{*}} of rank rr is identifiable if δ2r<1\delta_{2r}<1 [see 20, Theorem 3.2]. One situation in which RIP is obtained is for random measurement operators. For example, matrices with i.i.d. N(0,1)\mathcal{N}(0,1) entries satisfy (r,δr)(r,\delta_{r})-RIP when m=O(nrδ2)m=O(\frac{nr}{\delta^{2}}) [see 6, Theorem 2.3]. This implies identifiability based on i.i.d. Gaussian measurement with m=O(nr)m=O(nr) measurements (coincidentally, the number of degrees of freedom in X\bm{X^{*}}, optimal up to a constant factor).

Main Results

We are now ready to present our main result about local minima for the matrix sensing problem (2). We first present the results for noisy sensing of exact low rank matrices, and then generalize the results also to approximately low rank matrices.

Now we will present our result characterizing local minima of f(U)f(\bm{U}), for low-rank X\bm{X^{*}}. Recall that measurements are y=A(X)+w\bm{y}=\mathcal{A}(\bm{X^{*}})+\bm{w}, where entries of w\bm{w} are i.i.d. Gaussian - wiN(0,σw2)w_{i}\sim\mathcal{N}(0,\sigma_{w}^{2}).

Consider the optimization problem (2) where y=A(X)+w\bm{y}=\mathcal{A}(\bm{X^{*}})+\bm{w}, w\bm{w} is i.i.d. N(0,σw2),\mathcal{N}(0,\sigma_{w}^{2}), A\mathcal{A} satisfies (2r,δ2r)(2r,\delta_{2r})-RIP with δ2r<110\delta_{2r}<\frac{1}{10}, and rank(X)r\operatorname{rank}(\bm{X^{*}})\leq r. Then, with probability 110n2\geq 1-\frac{10}{n^{2}} (over the noise), for any local minimum U\bm{U} of f(U)f(\bm{U}):

In particular, in the noiseless case (σw=0\sigma_{w}=0) we have UU=X\bm{U}\bm{U}^{\top}=\bm{X^{*}} and so f(U)=0f(\bm{U})=0 and every local minima is global. In the noiseless case, we can also relax the RIP requirement to δ2r<1/5\delta_{2r}<1/5 (see Theorem 4.1 in Section 4). In the noisy case we cannot expect to ensure we always get to an exact global minima, since the noise might cause tiny fluctuations very close to the global minima possibly creating multiple very close local minima. But we show that all local minima are indeed very close to some factorization UU=X\bm{U^{*}}{\bm{U^{*}}}^{\top}=\bm{X^{*}} of the true signal, and hence to a global optimum, and this “radius” of local minima decreases as we have more observations.

The proof of the Theorem for the noiseless case is presented in Section 4. The proof for the general setting follows along the same lines and can be found in the Appendix.

So far we have discussed how all local minima are global, or at least very close to a global minimum. Using a recent result by Ge et al. on the convergence of SGD for non-convex functions, we can further obtain a polynomial bound on the number of SGD iterations required to reach the global minima. The main condition that needs to be established in order to ensure this, is that all saddle points of (2) satisfy the “strict saddle point condition”, i.e. have a direction with significant negative curvature:

Consider the optimization problem (2) in the noiseless case, where y=A(X)\bm{y}=\mathcal{A}(\bm{X^{*}}), A\mathcal{A} satisfies (2r,δ2r)(2r,\delta_{2r})-RIP with δ2r<110\delta_{2r}<\frac{1}{10}, and rank(X)r\operatorname{rank}(\bm{X^{*}})\leq r. Let U\bm{U} be a first order critical point of f(U)f(\bm{U}) with UUX\bm{U}\bm{U}^{\top}\neq\bm{X^{*}}. Then the smallest eigenvalue of the Hessian satisfies

Now consider the stochastic gradient descent updates,

where ψ\psi is uniformly distributed on the unit sphere and Projb\textrm{Proj}_{b} is a projection onto UFb\left\lVert{\bm{U}}\right\rVert_{F}\leq b. Using Theorem 3.2 and the result of Ge et al. we can establish:

Consider the optimization problem (2) under the same noiseless conditions as in Theorem 3.2. Using bUFb\geq\left\lVert{\bm{U}^{*}}\right\rVert_{F}, for some global optimum U\bm{U}^{*} of f(U)f(\bm{U}), for any ϵ,c>0\epsilon,c>0, after T=poly(1σr(X),σ1(X),b,1ϵ,log(1/c))T=poly\left(\frac{1}{\sigma_{r}(\bm{X^{*}})},\sigma_{1}(\bm{X^{*}}),b,\frac{1}{\epsilon},\log(1/c)\right) iterations of (4) with an appropriate stepsize η\eta, starting from a random point uniformly distributed on UF=b\left\lVert{\bm{U}}\right\rVert_{F}=b, with probability at least 1c1-c, we reach an iterate UT\bm{U}_{T} satisfying

The above result guarantees convergence of noisy gradient descent to a global optimum. Alternatively, second order methods such as cubic regularization (Nesterov and Polyak ) and trust region (Cartis et al. ) that have guarantees based on the strict saddle point property can also be used here.

RIP Requirement: Our results require (2r,1/10)(2r,1/10)-RIP for the noisy case and (2r,1/5)(2r,1/5)-RIP for the noiseless case. Requiring (2r,δ2r)(2r,\delta_{2r})-RIP with δ2r<1\delta_{2r}<1 is sufficient to ensure uniqueness of the global optimum of (1), and thus recovery in the noiseless setting , but all known efficient recovery methods require stricter conditions. The best guarantees we are aware of require (5r,1/10)(5r,1/10)-RIP or (4r,0.414)(4r,0.414)-RIP using a convex relaxation. Our requirement is not directly comparable to the latter, as we require RIP on a smaller set, but with a lower (stricter) δ\delta. Alternatively, (6r,1/10)(6r,1/10)-RIP is required for global initialization followed by non-convex optimization —our requirement is strictly better. In terms of requirements on (2r,δ2r)(2r,\delta_{2r})-RIP for non-convex methods, the best we are aware of is requiring δ2r<Ω(1/r)\delta_{2r}<\Omega(1/r) –this is a much stronger condition than ours, and it yields a suboptimal required number of spherical Gaussian measurements of Ω(nr3)\Omega(nr^{3}). So, compared to prior work our requirement is very mild—it ensures efficient recovery even in a regime not previously covered by any guarantee on efficient recovery, and requires the optimal number of spherical Gaussian measurements (up to a constant factor) of O(nr)O(nr).

We can also obtain similar results that deteriorate gracefully if X\bm{X^{*}} is not exactly low rank, but is close to being low-rank (see proof in the Appendix):

Consider the optimization problem (2) where y=A(X)\bm{y}=\mathcal{A}(\bm{X^{*}}) and A\mathcal{A} satisfies (2r,δ2r)(2r,\delta_{2r})-RIP with δ2r<1100\delta_{2r}<\frac{1}{100}, Then, for any local minima U\bm{U} of f(U)f(\bm{U}):

where Xr\bm{X_{r}^{*}} is the best rank rr approximation of X\bm{X^{*}}.

This theorem guarantees that any local optimum of f(U)f(\bm{U}) is close to X\bm{X^{*}} upto an error depending on XXr\|\bm{X^{*}}-\bm{X_{r}^{*}}\|. For the low-rank noiseless case we have X=Xr\bm{X^{*}}=\bm{X_{r}^{*}} and the right hand side vanishes. When X\bm{X^{*}} is not exactly low rank, the best recovery error we can hope for is XXrF\left\lVert{\bm{X^{*}}-\bm{X_{r}^{*}}}\right\rVert_{F}, since UU\bm{U}\bm{U}^{\top} is at most rank kk. On the right hand side of Theorem 3.4, we have also a nuclear norm term, which might be higher, but it also gets scaled down by δ2r\delta_{2r}, and so by the number of measurements.

Proof for the Noiseless Case

In this section we present the proof characterizing the local minima of problem (2). For ease of exposition we first present the results for the noiseless case (w=0\bm{w}=0). Proof for the general case can be found in the Appendix.

Consider the optimization problem (2) where y=A(X)\bm{y}=\mathcal{A}(\bm{X^{*}}), A\mathcal{A} satisfies (2r,δ2r)(2r,\delta_{2r})-RIP with δ2r<15\delta_{2r}<\frac{1}{5}, and rank(X)r\operatorname{rank}(\bm{X^{*}})\leq r. Then, for any local minimum U\bm{U} of f(U)f(\bm{U}):

For the proof of this theorem we first discuss the implications of the first and second order optimality conditions and then show how to combine them to yield the result.

Our proof techniques are different from existing results characterizing local minima of dictionary learning , phase retrieval and community detection . For most of these problems Hessian is PSD only for points close to optimum. However, Hessian of f(U)f(\bm{U}) can be PSD even for points far from to optima. Hence we need new directions to use the second order conditions.

Invariance of f(U)f(\bm{U}) over r×rr\times r orthonormal matrices introduces additional challenges in comparing a given stationary point to a global optimum. We have to find the best orthonormal matrix RR to align a given stationary point U\bm{U} to a global optimum U\bm{U^{*}}, where UU=X\bm{U^{*}}{\bm{U^{*}}}^{\top}=\bm{X^{*}}, to combine results from the first and second order conditions, without degrading the isometry constants.

First we present the following consequence of the RIP assumption [see 5, Lemma 2.1].

Given two n×nn\times n rank-rr matrices X\bm{X} and Y\bm{Y}, and a (2r,δ)(2r,\delta)-RIP measurement operator A\mathcal{A}, the following holds:

First we will consider the first order condition, f(U)=0\nabla f(\bm{U})=0. For any stationary point U\bm{U} this implies

Now using the isometry property of Ai\bm{A}_{i} gives us the following result.

[First order condition] For any first order stationary point U\bm{U} of f(U)f(\bm{U}), and A\mathcal{A} satisfying the (2r,δ)(2r,\delta)-RIP (3), the following holds:

where QQ is an orthonormal matrix that spans the column space of U\bm{U}.

This lemma states that any stationary point of f(U)f(\bm{U}) is close to a global optimum U\bm{U^{*}} in the subspace spanned by columns of U\bm{U}. Notice that the error along the orthogonal direction XQQF\|\bm{X^{*}}Q_{\perp}Q_{\perp}^{\top}\|_{F} can still be large making the distance between X\bm{X} and X\bm{X^{*}} arbitrarily far.

Let U=QR\bm{U}=QR, for some orthonormal QQ. Consider any matrix of the form ZQR1\bm{Z}Q{R^{-1}}^{\top}. The first order optimality condition then implies,

The above equation together with Restricted Isometry Property (equation (5)) gives us the following inequality:

Note that for any matrix A\bm{A}, A,QQZ=QQA,Z\left\langle\bm{A},QQ^{\top}\bm{Z}\right\rangle=\left\langle QQ^{\top}\bm{A},\bm{Z}\right\rangle. Furthermore, for any matrix A\bm{A}, sup{Z:ZF1}A,Z=AF\sup_{\{\bm{Z}:\|\bm{Z}\|_{F}\leq 1\}}\left\langle\bm{A},\bm{Z}\right\rangle=\|\bm{A}\|_{F}. Hence the above inequality implies the lemma statement. ∎

2 Second order optimality

We now consider the second order condition to show that the error along QQQ_{\perp}Q_{\perp}^{\top} is indeed bounded well. Let 2f(U)\nabla^{2}f(\bm{U}) be the hessian of the objective function. Note that this is an nr×nrn\cdot r\times n\cdot r matrix. Fortunately for our result we need to only evaluate the Hessian along the direction vec(UUR)\texttt{vec}(\bm{U}-\bm{U^{*}}R) for some orthonormal matrix RR. Here vec(.)\texttt{vec}(.) denotes writing a matrix in vector form.

[Hessian computation] Let U\bm{U} be a first order critical point of f(U)f(\bm{U}). Then for any r×rr\times r orthonormal matrix RR and Δj=Δejej\Delta_{j}=\Delta e_{j}e_{j}^{\top} ( Δ=UUR\Delta=\bm{U}-\bm{U^{*}}R),

For any matrix Z\bm{Z}, taking directional second derivative of the function f(U)f(\bm{U}) with respect to Z\bm{Z} we get:

Setting Z=Δj=(UUR)ejej\bm{Z}=\Delta_{j}=(\bm{U}-\bm{U^{*}}R)e_{j}e_{j}^{\top} and using the first order optimality condition on U\bm{U}, we get,

(i)(i) and (ii)(ii) follow from the first order optimality condition (6),

for j=1rj=1\cdots r. Finally taking sum over jj from 11 to rr gives the result. ∎

Hence from second order optimality of U\bm{U} we get,

[Second order optimality] Let U\bm{U} be a local minimum of f(U)f(\bm{U}) . For any r×rr\times r orthonormal matrix RR,

Further for A\mathcal{A} satisfying (2r,δ)(2r,\delta) -RIP (equation (3)) we have,

The proof of this result follows simply by applying Lemma 4.3. The above Lemma gives a bound on the distance in the factor (U)(\bm{U}) space U(UUR)F2\|\bm{U}(\bm{U}-\bm{U^{*}}R)^{\top}\|_{F}^{2}. To be able to compare the second order condition to the first order condition we need a relation between U(UUR)F2\|\bm{U}(\bm{U}-\bm{U^{*}}R)^{\top}\|_{F}^{2} and XXF2\|\bm{X}-\bm{X^{*}}\|_{F}^{2}. Towards this we show the following result.

Let U\bm{U} and U\bm{U^{*}} be two n×rn\times r matrices, and QQ is an orthonormal matrix that spans the column space of U\bm{U}. Then there exists an r×rr\times r orthonormal matrix RR such that for any first order stationary point U\bm{U} of f(U)f(\bm{U}), the following holds:

This Lemma bounds the distance in the factor space ((UUR)UF2\|(\bm{U}-\bm{U^{*}}R)\bm{U}^{\top}\|_{F}^{2}) with UUUUF2\|\bm{U}\bm{U}^{\top}-\bm{U^{*}}{\bm{U^{*}}}^{\top}\|_{F}^{2} and (UUUU)QQF2\|(\bm{U}\bm{U}^{\top}-\bm{U^{*}}{\bm{U^{*}}}^{\top})QQ^{\top}\|_{F}^{2}. Combining this with the result from second order optimality (Corollary 4.1) shows UUUUF2\|\bm{U}\bm{U}^{\top}-\bm{U^{*}}{\bm{U^{*}}}^{\top}\|_{F}^{2} is bounded by a constant factor of (UUUU)QQF2\|(\bm{U}\bm{U}^{\top}-\bm{U^{*}}{\bm{U^{*}}}^{\top})QQ^{\top}\|_{F}^{2}. This implies XQQF\|\bm{X^{*}}Q_{\perp}Q_{\perp}\|_{F} is bounded, opposite to what the first order condition implied (Lemma 4.2). The proof of the above lemma is in Section E. Hence from the above optimality conditions we get the proof of Theorem 4.1.

Assuming UUUU\bm{U}\bm{U}^{\top}\neq\bm{U^{*}}{\bm{U^{*}}}^{\top}, from Lemmas 4.2, 4.4 and Corollary 4.1 we get,

If δ15\delta\leq\frac{1}{5} the above inequality holds only if UU=UU\bm{U}\bm{U}^{\top}=\bm{U^{*}}{\bm{U^{*}}}^{\top}. ∎

Necessity of RIP

We showed that there are no spurious local minima only under a restricted isometry assumption. A natural question is whether this is necessary, or whether perhaps the problem (2) never has any spurious local minima, perhaps similarly to the non-convex PCA problem minUAUU\min_{\bm{U}}\left\lVert{\bm{A}-\bm{U}\bm{U}^{\top}}\right\rVert.

A good indication that this is not the case is that (2) is NP-hard, even in the noiseless case when y=A(X)\bm{y}=\mathcal{A}(\bm{X^{*}}) for rank(X)k\operatorname{rank}(\bm{X^{*}})\leq k (if we don’t require RIP, we can have each Ai\bm{A}_{i} be non-zero on a single entry in which case (2) becomes a matrix completion problem, for which hardness has been shown even under fairly favorable conditions ). That is, we are unlikely to have a poly-time algorithm that succeeds for any linear measurement operator. Although this doesn’t formally preclude the possibility that there are no spurious local minima, but it just takes a very long time to find a local minima, this scenario seems somewhat unlikely.

To resolve the question, we present an explicit example of a measurement operator A\mathcal{A} and y=A(X)\bm{y}=\mathcal{A}(\bm{X^{*}}) (i.e. f(X)=0f(\bm{X^{*}})=0), with rank(X)=r\operatorname{rank}(\bm{X^{*}})=r, for which (1), and so also (2), have a non-global local minima.

Example 1: Let f(X)=(X11+X221)2+(X111)2f(\bm{X})=(X_{11}+X_{22}-1)^{2}+(X_{11}-1)^{2} and consider (1) with r=1r=1 (i.e. a rank-11 constraint). For X=[1000]\bm{X^{*}}=\begin{bmatrix}1&0\\ 0&0\end{bmatrix} we have f(X)=0f(\bm{X^{*}})=0 and rank(X)=1\operatorname{rank}(\bm{X^{*}})=1. But X=[0001]\bm{X}=\begin{bmatrix}0&0\\ 0&1\end{bmatrix} is a rank 11 local minimum with f(X)=1f(\bm{X})=1.

We can be extended the construction to any rank rr by simply adding i=3r+2(Xii1)2\sum_{i=3}^{r+2}(X_{ii}-1)^{2} to the objective, and padding both the global and local minimum with a diagonal beneath the leading 2×22\times 2 block.

In Example 1, we had a rank-rr problem, with a rank-rr exact solution, and a rank-rr local minima. Another question we can ask is what happens if we allow a larger rank than the rank of the optimal solution. That is, if we have f(X)=0f(\bm{X^{*}})=0 with low rank(X)\operatorname{rank}(\bm{X^{*}}), even rank(X)=1\operatorname{rank}(\bm{X^{*}})=1, but consider (1) or (2) with a high rr. Could we still have non-global local minima? The answer is yes…

Example 2: Let f(X)=(X11+X22+X331)2+(X111)2+(X22X33)2f(\bm{X})=(X_{11}+X_{22}+X_{33}-1)^{2}+(X_{11}-1)^{2}+(X_{22}-X_{33})^{2} and consider the problem (1) with a rank r=2r=2 constraint. We can verify that X=[100000000]\bm{X^{*}}=\begin{bmatrix}1&0&0\\ 0&0&0\\ 0&0&0\end{bmatrix} is a rank=11 global minimum with f(X)=0f(\bm{X^{*}})=0, but X=[0000\nicefrac12000\nicefrac12]\bm{X}=\begin{bmatrix}0&0&0\\ 0&\nicefrac{{1}}{{2}}&0\\ 0&0&\nicefrac{{1}}{{2}}\end{bmatrix} is a local minimum with f(X)=1f(\bm{X})=1. Also for an arbitrary large rank constraint r>1r>1 (taking rr to be odd for simplicity), extend the objective to f(X)=(X111)2+i=1(r1)/2[(X11+X2i,2i+X(2i+1),(2i+1)1)2f(\bm{X})=(X_{11}-1)^{2}+\sum_{i=1}^{(r-1)/2}\left[(X_{11}+X_{2i,2i}+X_{(2i+1),(2i+1)}-1)^{2}\right. +(X2i,2iX(2i+1),(2i+1))2]\left.+(X_{2i,2i}-X_{(2i+1),(2i+1)})^{2}\right]. We still have a rank-11 global minimum X\bm{X^{*}} with a single non-zero entry X11=1\bm{X^{*}}_{11}=1, while X=(IX)/2\bm{X}=(I-\bm{X^{*}})/2 is a local minimum with f(X)=1f(\bm{X})=1.

Conclusion

We established that under conditions similar to those required for convex relaxation recovery guarantees, the non-convex formulation of matrix sensing (2) does not exhibit any spurious local minima (or, in the noisy and approximate settings, at least not outside some small radius around a global minima), and we can obtain theoretical guarantees on the success of optimizing it using SGD from random initialization. This matches the methods frequently used in practice, and can explain their success. This guarantee is very different in nature from other recent work on non-convex optimization for low-rank problems, which relied heavily on initialization to get close to the global optimum, and on local search just for the final local convergence to the global optimum. We believe this is the first result, together with the parallel work of Ge et al. , on the global convergence of local search for common rank-constrained problems that are worst-case hard.

Our result suggests that SVD initialization is not necessary for global convergence, and random initialization would succeed under similar conditions (in fact, our conditions are even weaker than in previous work that used SVD initialization). To investigate empirically whether SVD initialization is indeed helpful for ensuring global convergence, in Figure 1 we compare recovery probability of random rank-kk matrices for random and SVD initialization—there is no significant difference between the two.

Beyond the implications for matrix sensing, we are hoping these type of results could be a first step and serve as a model for understanding local search in deep networks. Matrix factorization, such as in (2), is a depth-two neural network with linear transfer—an extremely simple network, but already non-convex and arguably the most complicated network we have a good theoretical understanding of. Deep networks are also hard to optimize in the worst case, but local search seems to do very well in practice. Our ultimate goal is to use the study of matrix recovery as a guide in understating the conditions that enable efficient training of deep networks.

Authors would like to thank Afonso Bandeira for discussions, Jason Lee and Tengyu Ma for sharing and discussing their work. This research was supported in part by an NSF RI-AF award and by Intel ICRI-CI.

References

Appendix A Numerical Simulations

In this section we present simulation results for performance of gradient descent over f(U)f(\bm{U}). We consider measurements yi=Ai,Xy_{i}=\left\langle\bm{A}_{i},\bm{X^{*}}\right\rangle, where Ai\bm{A}_{i} are i.i.d Gaussian with each entry distributed as N(0,\nicefrac1m)\mathcal{N}(0,\nicefrac{{1}}{{m}}). X\bm{X^{*}} is a 100×100100\times 100 rank rr random p.s.d matrix with XF=1\|\bm{X^{*}}\|_{F}=1. rr is varied from 1 to 20 in the experiments.

We consider both standard gradient descent and noisy gradient descent (4) with step size 1U2\frac{1}{\|\bm{U}\|_{2}}. We add noise of magnitude 1e41e-4 for the noisy gradient updates. Each method is run until convergence (max of 200 iterations). Let the output of gradient descent be U^\widehat{\bm{U}}. A run of this experiment is considered success if the final error U^U^XF1e2\|\widehat{\bm{U}}\widehat{\bm{U}}^{\top}-\bm{X^{*}}\|_{F}\leq 1e-2. Each experiment is repeated for 20 times and average probability of success is computed.

We repeat the above procedure starting from both random initialization and SVD initialization. For SVD initialization, the initial point is set to be the rank rr approximation of i=1myiAi\sum_{i=1}^{m}y_{i}\bm{A}_{i} as suggested by Jain et al. . In figure 2 we have the plots for the cases discussed above. All of them have phase transition around number of samples m=2nrm=2\cdot n\cdot r. This is in agreement with the results in Section 3. f(U)f(\bm{U}) has no local minima once m2nrm\geq 2\cdot n\cdot r and random initialization has same performance as SVD initialization.

In figure 3, the left two plots show error \nicefracU^U^XFXF\nicefrac{{\|\widehat{\bm{U}}\widehat{\bm{U}}^{\top}-\bm{X^{*}}\|_{F}}}{{\|\bm{X^{*}}\|_{F}}} behaves with varying rank and number of samples for random and SVD initializations. The rightmost plot shows the phase transition for rank 10 case for all the methods. Again we notice no significant difference between these methods.

Appendix B Proof for the Noisy Case

In this section we present the proof characterizing the local minima of problem (2). Recall y=A(X)+w\bm{y}=\mathcal{A}(\bm{X^{*}})+\bm{w}, where X\bm{X^{*}} is a rank-rr matrix and w\bm{w} is i.i.d. N(0,σw2).\mathcal{N}(0,\sigma_{w}^{2}).

First we will consider the first order condition, f(U)=0\nabla f(\bm{U})=0. For any stationary point U\bm{U} this implies

Now using the isometry property of Ai\bm{A}_{i} gives us the following result.

[First order condition] For any first order stationary point U\bm{U} of f(U)f(\bm{U}), and A\mathcal{A} satisfying the (2r,δ)(2r,\delta)-RIP (3), the following holds:

w.p. 11n2\geq 1-\frac{1}{n^{2}}, where QQ is an orthonormal matrix that spans the column space of U\bm{U}.

This lemma states that any stationary point of f(U)f(\bm{U}) is close to a global optimum U\bm{U^{*}} in the subspace spanned by columns of U\bm{U}. Notice that the error along the orthogonal direction XQQF\|\bm{X^{*}}Q_{\perp}Q_{\perp}^{\top}\|_{F} can still be large making the distance between X\bm{X} and X\bm{X^{*}} arbitrarily big.

Let U=QR\bm{U}=QR, for some orthonormal QQ. Consider any matrix of the form ZQR1\bm{Z}Q{R^{-1}}^{\top}. The first order optimality condition then implies,

The above equation together with Restricted Isometry Property(equation (5)) gives us the following inequality:

by Cauchy Schwarz inequality and Lemma E.2. Note that for any matrix AA, A,QQZ=AQQ,Z\left\langle\bm{A},QQ^{\top}\bm{Z}\right\rangle=\left\langle\bm{A}QQ^{\top},\bm{Z}\right\rangle. Furthermore, for any matrix AA, sup{Z:ZF1}A,Z=AF\sup_{\{\bm{Z}:\|\bm{Z}\|_{F}\leq 1\}}\left\langle\bm{A},\bm{Z}\right\rangle=\|\bm{A}\|_{F}. Hence the above inequality implies the lemma statement. ∎

B.2 Second order optimality

We will now consider the second order condition to show that the error along QQQ_{\perp}Q_{\perp}^{\top} is indeed bounded well. Let 2f(U)\nabla^{2}f(\bm{U}) be the hessian of the objective function. Note that this is an nr×nrn\cdot r\times n\cdot r matrix. Fortunately for our result we need to only evaluate the Hessian along the direction vec(UUR)\texttt{vec}(U-\bm{U^{*}}R) for some orthonormal matrix RR.

[Hessian computation] Let U\bm{U} be a first order critical point of f(U)f(\bm{U}). Then for any r×rr\times r orthonormal matrix RR and Δ=UUR\Delta=\bm{U}-\bm{U^{*}}R,

For any matrix Z\bm{Z}, taking directional second derivative of the function f(U)f(\bm{U}) with respect to Z\bm{Z} we get:

Setting Z=Δj=(UUR)ejej\bm{Z}=\Delta_{j}=(\bm{U}-\bm{U^{*}}R)e_{j}e_{j}^{\top} and using the first order optimality condition on U\bm{U}, we get,

where the last equality is again by the first order optimality condition (9). ∎

Hence from second order optimality of UU we get,

[Second order optimality] Let U\bm{U} be a local minimum of f(U)f(\bm{U}) . For any r×rr\times r orthonormal matrix RR, w.p. 11n2\geq 1-\frac{1}{n^{2}},

Further for A\mathcal{A} satisfying (2r,δ)(2r,\delta) -RIP (equation (3)) we have,

Hence from the above optimality conditions we get the proof of Theorem 4.1.

Assuming UUUU\bm{U}\bm{U}^{\top}\neq\bm{U^{*}}{\bm{U^{*}}}^{\top}, from Lemma 4.4 and Corollary B.1 we get, with probability 12n2\geq 1-\frac{2}{n^{2}},

(i)(i) follows from Lemma B.1. The above inequality implies,

If δ110\delta\leq\frac{1}{10}, the above inequality reduces to UUUUFclog(n)mσw\|\bm{U}\bm{U}^{\top}-\bm{U^{*}}{\bm{U^{*}}}^{\top}\|_{F}\leq c\sqrt{\frac{\log(n)}{m}}\sigma_{w}, for some constant c17c\leq 17, w.p 12n2\geq 1-\frac{2}{n^{2}}. ∎

Appendix C Proof for the High Rank Case

In this section we will present the proof for the inexact case, where rank(X)r\operatorname{rank}(\bm{X^{*}})\geq r. Recall that measurements are y=A(X)\bm{y}=\mathcal{A}(\bm{X^{*}}).

Let SVD of X\bm{X^{*}} be QΣQQ^{*}\Sigma^{*}{Q^{*}}\top. With slight abuse of notation we use Xjr+1:(j+1)r\bm{X^{*}}_{jr+1:(j+1)r} to denote the jjth rank rr block Qjr+1:(j+1)rΣjr+1:(j+1)rQjr+1:(j+1)rQ^{*}_{jr+1:(j+1)r}\Sigma^{*}_{jr+1:(j+1)r}{Q^{*}_{jr+1:(j+1)r}}^{\top}, where Qjr+1:(j+1)rQ^{*}_{jr+1:(j+1)r} denotes the restriction of QQ to columns jr+1jr+1 to (j+1)r(j+1)r.

First we will consider the first order condition, f(U)=0\nabla f(\bm{U})=0. For any stationary point U\bm{U} this implies

Now using the isometry property of Ai\bm{A}_{i} gives us the following result.

[First order condition] For any first order stationary point U\bm{U} of f(U)f(\bm{U}), and {Ai}\{\bm{A}_{i}\} satisfying the (2r,δ)(2r,\delta)-RIP (3), the following holds:

where QQ is an orthonormal matrix that spans the column space of U\bm{U}.

This lemma states that any stationary point of f(U)f(\bm{U}) is close to a global optimum U\bm{U^{*}} in the subspace spanned by columns of U\bm{U}. Notice that the error along the orthogonal direction XQQF\|\bm{X^{*}}Q_{\perp}Q_{\perp}^{\top}\|_{F} can still be large making the distance between X\bm{X} and X\bm{X^{*}} arbitrarily big.

Let U=QR\bm{U}=QR, for some orthonormal QQ. Consider any matrix of the form ZQR\bm{Z}QR^{-\top}. The first order optimality condition then implies,

Note that XXr\bm{X}-\bm{X_{r}^{*}} is atmost rank-2r2r. Hence, the above equation together with Restricted Isometry Property(equation (5)) gives us the following inequality:

The last inequality follows from jXjr+1:(j+1)rFXXr\sum_{j}\|\bm{X^{*}}_{jr+1:(j+1)r}\|_{F}\leq\|\bm{X^{*}}-\bm{X_{r}^{*}}\|_{*}. The above inequalities are true for any Z\bm{Z}.

Further note that for any matrix A\bm{A}, A,QQZ=AQQ,Z\left\langle\bm{A},QQ^{\top}\bm{Z}\right\rangle=\left\langle\bm{A}QQ^{\top},\bm{Z}\right\rangle. Furthermore, for any matrix AA, sup{Z:ZF1}\sup_{\{\bm{Z}:\|\bm{Z}\|_{F}\leq 1\}} A,Z=AF\left\langle\bm{A},\bm{Z}\right\rangle=\|\bm{A}\|_{F}. Hence the above inequality implies the Lemma. ∎

C.2 Second order optimality

We will now consider the second order condition to show that the error along QQQ_{\perp}Q_{\perp}^{\top} is indeed bounded well. Let 2f(U)\nabla^{2}f(\bm{U}) be the hessian of the objective function. Note that this is an nr×nrn\cdot r\times n\cdot r matrix. Fortunately for our result we need to only evaluate the Hessian along the direction vec(UUR)\texttt{vec}(U-\bm{U^{*}}R) for some orthonormal matrix RR.

[Hessian computation] Let U\bm{U} be a first order critical point of f(U)f(\bm{U}). Then for any n×rn\times r matrix Z\bm{Z},

Further let U\bm{U} be a local minimum of f(U)f(\bm{U}) and A\mathcal{A} satisfying (2r,δ)(2r,\delta) -RIP (equation (3)). Then,

For any matrix Z\bm{Z}, taking directional second derivative of the function f(U)f(\bm{U}) with respect to Z\bm{Z} we get:

Setting Z=Δj=(UUR)ejej\bm{Z}=\Delta_{j}=(\bm{U}-\bm{U^{*}}R)e_{j}e_{j}^{\top} we get,

(i)(i) is by the first order optimality condition (12).

Hence from second order optimality of UU we get,

(i)(i) is from using RIP and splitting XXr\bm{X^{*}}-\bm{X_{r}^{*}} into rank-rr components XXr=j=1\nicefracnr1Xjr+1:(j+1)r\bm{X^{*}}-\bm{X_{r}^{*}}=\sum_{j=1}^{\nicefrac{{n}}{{r}}-1}\bm{X^{*}}_{jr+1:(j+1)r}. (ii)(ii) follows from using RIP (5). (iii)(iii) follows from jXjr+1:(j+1)rFXXr\sum_{j}\|\bm{X^{*}}_{jr+1:(j+1)r}\|_{F}\leq\|\bm{X^{*}}-\bm{X_{r}^{*}}\|_{*}.

The Lemma now follows by combining equations (13), (14) and using RIP (3). ∎

Hence from the above optimality conditions we get the proof of Theorem 3.4.

Assuming UUUrUr\bm{U}\bm{U}^{\top}\neq\bm{U_{r}^{*}}{\bm{U_{r}^{*}}}^{\top}, from Lemma 4.4 we know,

for some orthonormal RR. Hence combining equations (15),with Lemma C.2 we get,

The last inequality follows from just using 2aba2+b22ab\leq a^{2}+b^{2}.

Substituting δ=1100\delta=\frac{1}{100} gives,

Appendix D Proofs for Section 3

In this section we present the proofs for the strict saddle theorem (Theorem 3.2) and the convergence guarantees (Theorem 3.3). The proofs use the Lemmas developed in Section 4 and the supporting Lemmas from Section E.

where the last inequality follows from the RIP (3). Now applying Lemma 4.4 in equation (18) we get,

(i)(i) follows from Lemma 4.2. (ii)(ii) follows from δ\nicefrac110\delta\leq\nicefrac{{1}}{{10}}. Now notice that from lemma E.1

Finally notice that Δj=Δejej\Delta_{j}=\Delta e_{j}e_{j}^{\top} have orthogonal columns. Hence,

(i)(i) follows from equation (19). (ii)(ii) follows from equation (20). ∎

To prove this theorem we use Theorem 6 of Ge et al. . We need to show that f(U)f(\bm{U}) satisfies, 1) strict saddle property, 2) local strong convexity, 3) ff is bounded, smooth and has Lipschitz Hessian.

The boundedness assumption easily follows from assuming we are optimizing over a bounded domain bb such that, UFb\|\bm{U^{*}}\|_{F}\leq b. Note that we can have any reasonable upper bound on the optimum and we can easily estimate this from iyi2\sum_{i}y_{i}^{2} which is (1δ)XF2\geq(1-\delta)\|\bm{X^{*}}\|_{F}^{2} for the noiseless case.

Finally all the calculations below are for scaled version of f(x)f(x) by 1m\frac{1}{m}. Note that this does not change the number of iterations as both smoothness and strong convexity parameters are scaled by the same constant.

Smoothness constant β\beta: Recall that smoothness of ff is bounded by maximum eigenvalue of Hessian over the domain. Hence, β=maxZ:ZF1Z2f(U)Z\beta=\max_{\bm{Z}:\|\bm{Z}\|_{F}\leq 1}\bm{Z}^{\top}\nabla^{2}f(\bm{U})\bm{Z}. We have computed this projection of Hessian in Lemma B.2. Hence,

ρ\rho- Lipschitz Hessian: Now we will compute the Lipschitz constant of Hessian of f(U)f(\bm{U}). We will first bound the spectral norm of difference of Hessian at two points U\bm{U}, V\bm{V} in terms of UVF\|\bm{U}-\bm{V}\|_{F} along orthogonal direction Zi\bm{Z}_{i} and combine them to get bound on ρ\rho.. Given two n×rn\times r matrices U,V\bm{U},\bm{V},

Hence, using the variational characterization of the Frobenius norm, the Hessian Lipschitz constant is bounded by max{Zi}i2f(U)2f(V),ZiZi\max{\{\bm{Z}_{i}\}}\sum_{i}\left\langle\nabla^{2}f(\bm{U})-\nabla^{2}f(\bm{V}),\bm{Z}_{i}\bm{Z}_{i}^{\top}\right\rangle, where Zi\bm{Z}_{i} are orthogonal with iZiF21\sum_{i}\|\bm{Z}_{i}\|_{F}^{2}\leq 1. Hence from equation (21) we get ρ=O(b)\rho=O(b).

Strict saddle property: So far we have shown regularity properties of f(U)f(\bm{U}). Now we will discuss the strict saddle property. Theorem 3.2 shows that λmin[2(f(U))]25σr(X).\lambda_{\min}\left[\nabla^{2}(f(\bm{U}))\right]\leq\frac{-2}{5}\sigma_{r}(\bm{X^{*}}). To use results of we need to show this property over an ϵ\epsilon neighborhood of any saddle point U\bm{U}. For this first recall by smoothness, f(U)f(V)FβUVF.\|\nabla f(\bm{U})-\nabla f(\bm{V})\|_{F}\leq\beta\|\bm{U}-\bm{V}\|_{F}. Therefore f(V)ϵ\nabla f(\bm{V})\leq\epsilon, when UVFϵβ\|\bm{U}-\bm{V}\|_{F}\leq\frac{\epsilon}{\beta}. Further we know the Hessian spectral norm is ρ\rho Lipschitz from equation (21). Hence, for any direction Z\bm{Z},

In particular choosing Z\bm{Z} to be the projection direction, UU\bm{U}-\bm{U^{*}} implies from Theorem 3.2,

Hence for all V\bm{V} in the bowl of radius ϵ\epsilon around U\bm{U}, where ϵβ5ρσr(X)\epsilon\leq\frac{\beta}{5\rho}\sigma_{r}(\bm{X^{*}}),

Local strong convexity: Finally we need to show that the function is α\alpha strongly convex in a neighborhood θ\theta around the optimum UR\bm{U^{*}}R, for any orthonormal RR. This easily follows from existing local convergence results for this problem. For example, Lemma 6.1 of Bhojanapalli et al. states that, for UURFσr(X)200σ1(X)σr(UR)\|\bm{U}-\bm{U^{*}}R\|_{F}\leq\frac{\sigma_{r}(\bm{X^{*}})}{200\sigma_{1}(\bm{X^{*}})}\sigma_{r}(\bm{U^{*}}R),

for δ=110\delta=\frac{1}{10} and some step size η1X2\eta\propto\frac{1}{\|\bm{X^{*}}\|_{2}}. Hence f(U)f(\bm{U}) is locally strong convex with α=27200σr(UR)2\alpha=\frac{27}{200}\sigma_{r}(\bm{U^{*}}R)^{2} in the neighborhood of radius θ=σr(X)200σ1(X)σr(UR)\theta=\frac{\sigma_{r}(\bm{X^{*}})}{200\sigma_{1}(\bm{X^{*}})}\sigma_{r}(\bm{U^{*}}R) around the optimum.

Substituting these parameters in the Theorem 6 of Ge et al. gives the result. ∎

Appendix E Supporting Lemmas

In this section we present the supporting results used in the proofs above.

Let U\bm{U} and U\bm{U^{*}} be two n×rn\times r matrices, and QQ is an orthonormal matrix that spans the column space of U\bm{U}. Then there exists an r×rr\times r orthonormal matrix RR such that for any first order stationary point U\bm{U} of f(U)f(\bm{U}), the following holds:

To prove this we will expand terms on the both sides in terms of U\bm{U} and Δ=UUR\Delta=\bm{U}-\bm{U^{*}}R and then compare. First notice the following properties of RR that minimizes URUF\|\bm{U^{*}}R-\bm{U}\|_{F}. Let LSPLSP^{\top} be the SVD of UU{\bm{U^{*}}}^{\top}\bm{U}. Then, R=LPR=LP^{\top}. Hence, RUU=PSP=UURR^{\top}{\bm{U^{*}}}^{\top}\bm{U}=PSP^{\top}=\bm{U}^{\top}\bm{U^{*}}R is a PSD matrix. This implies, UΔ=UUUUR=UURUU=ΔU\bm{U}^{\top}\Delta=\bm{U}^{\top}\bm{U}-\bm{U}^{\top}\bm{U^{*}}R=\bm{U}^{\top}\bm{U}-R^{\top}{\bm{U^{*}}}^{\top}\bm{U}=\Delta^{\top}\bm{U}.

Let columns of UU be orthogonal, else we can multiply UU by an orthonormal matrix and URUR will satisfy this. Since URUR is also local minimum, and UU=URRU\bm{U}\bm{U}^{\top}=\bm{U}RR^{\top}\bm{U}^{\top}, results for UR\bm{U}R will also hold for U\bm{U}. Let QQ be the orthonormal matrix that spans the column space of U\bm{U} and QQ=IQQQ_{\perp}Q_{\perp}^{\top}=I-QQ^{\top}. Similarly let QjQ_{j} span Uejej\bm{U}e_{j}e_{j}^{\top}. Note that QjQ_{j} are orthonormal since columns of UU are orthogonal. Hence,

The last inequality follows from Lemma E.1 and the fact that ejUURej0,je_{j}^{\top}\bm{U}^{\top}\bm{U^{*}}Re_{j}\geq 0,\forall j as UUR{\bm{U}}^{\top}\bm{U^{*}}R is PSD. Now we will bound the second term in the above equation. The main idea here is to split this term into error between the subspaces of X,X\bm{X},\bm{X^{*}} and then error between their singular values, since both of them are bounded by distance XXQQF.\|\bm{X}-\bm{X^{*}}QQ^{\top}\|_{F}. Let QQ^{*} be an orthonormal matrix that spans the column space of X\bm{X^{*}}. Also let X=QΣU2Q\bm{X}=Q\Sigma_{U}^{2}Q^{\top}.

where (i)(i) follows from Cauchy-Schwarz inequality.

We will use the following inequality through the rest of the proof. So we state it first for any matrix T\bm{T}.

Now we will bound each of the terms in equation . Term 1: Let, T=QjQjUR\bm{T}=Q_{j\perp}Q_{j\perp}^{\top}\bm{U^{*}}R. Then applying inequaltiy from equation (26) we get,

(i)(i) follows from ejUUej=UjF2e_{j}^{\top}\bm{U}^{\top}\bm{U}e_{j}=\|\bm{U}_{j}\|_{F}^{2} and UjF2QjQj=UejejU\|\bm{U}_{j}\|_{F}^{2}Q_{j}Q_{j}^{\top}=\bm{U}e_{j}e_{j}^{\top}\bm{U}^{\top}. Now from orthogonality of QjQ_{j} we have,

Term 3: Finally we bound the last term in equation (25) similar to the first term, which gives,

Substituting the above equations (27), (28), (29) and (30) in (24) and (25) gives the result. ∎

The following lemma relates the error (UY)UF\|(\bm{U}-\bm{Y})\bm{U}^{\top}\|_{F} with UUYYF\|\bm{U}\bm{U}^{\top}-\bm{Y}\bm{Y}^{\top}\|_{F} under some conditions on U\bm{U} and Y\bm{Y}. This is a generalization of Lemma 5.4 in and the proof follows similarly.

Let U\bm{U} and Y\bm{Y} be two n×rn\times r matrices. Further let UY=YU\bm{U}^{\top}\bm{Y}=\bm{Y}^{\top}\bm{U} be a PSD matrix. Then,

To prove this we will expand terms on the both sides in terms of U\bm{U} and Δ=UY\Delta=\bm{U}-\bm{Y} and then compare.

(i)(i) follows from the following properties of trace: trace(AB)=trace(BA)\operatorname{trace}(\bm{A}\bm{B})=\operatorname{trace}(\bm{B}\bm{A}) and trace(A)=trace(A)\operatorname{trace}(\bm{A})=\operatorname{trace}(\bm{A}^{\top}). (ii)(ii) follows from completing the squares. (iii)(iii) follows from trace(A2)0\operatorname{trace}(\bm{A}^{2})\geq 0. (iv)(iv) follows from the hypothesis of the lemma (UY\bm{U}^{\top}\bm{Y} is PSD) and trace(AB)0\operatorname{trace}(\bm{A}\bm{B})\geq 0 for PSD matrices A\bm{A} and B\bm{B}.

Finally notice that (UY)UF2=trace(UUΔΔ)\|(\bm{U}-\bm{Y})\bm{U}^{\top}\|_{F}^{2}=\operatorname{trace}(\bm{U}^{\top}\bm{U}\Delta^{\top}\Delta). This completes the proof. ∎

We recall the standard Gaussian random variable concentration here.

Let wiN(0,σw)w_{i}\approx\mathcal{N}(0,\sigma_{w}), then

with probability 11n2\geq 1-\frac{1}{n^{2}}.