Matrix Completion has No Spurious Local Minimum

Rong Ge, Jason D. Lee, Tengyu Ma

Introduction

Matrix completion is the problem of recovering a low rank matrix from partially observed entries. It has been widely used in collaborative filtering and recommender systems , dimension reduction and multi-class learning . There has been extensive work on designing efficient algorithms for matrix completion with guarantees. One earlier line of results (see and the references therein) rely on convex relaxations. These algorithms achieve strong statistical guarantees, but are quite computationally expensive in practice.

These algorithms are much faster than the convex relaxation algorithms, which is crucial for their empirical success in large-scale collaborative filtering applications .

Most of the theoretical analysis of the nonconvex procedures require careful initialization schemes: the initial point should already be close to optimumThe work of De Sa et al. is an exception, which gives an algorithm that uses fresh samples at every iteration to solve matrix completion (and other matrix problems) approximately.. In fact, Sun and Luo showed that after this initialization the problem is effectively strongly-convex, hence many different optimization procedures can be analyzed by standard techniques from convex optimization.

However, in practice people typically use a random initialization, which still leads to robust and fast convergence. Why can these practical algorithms find the optimal solution in spite of the non-convexity? In this work we investigate this question and show that the matrix completion objective has no spurious local minima. More precisely, we show that any local minimum XX of objective function f()f(\cdot) is also a global minimum with f(X)=0f(X)=0, and recovers the correct low rank matrix MM.

Our characterization of the structure in the objective function implies that (stochastic) gradient descent from arbitrary starting point converge to a global minimum. This is because gradient descent converges to a local minimum , and every local minimum is also a global minimum.

There are two known issues with matrix completion. First, the choice of ZZ is not unique since M=(ZR)(ZR)M=(ZR)(ZR)^{\top} for any orthonormal matrix ZZ. Our goal is to find one of these equivalent solutions.

Another issue is that matrix completion is impossible when MM is “aligned” with standard basis. For example, when MM is the identity matrix in its first r×rr\times r block, we will very likely be observing only 0 entries. To address this issue, we make the following standard assumption:

Moreover, ZZ has a bounded condition number σmax(Z)/σmin(Z)=κ\sigma_{\max}(Z)/\sigma_{\min}(Z)=\kappa.

Throughout this paper we think of μ\mu and κ\kappa as small constants, and the sample complexity depends polynomially on these two parameters. Also note that this assumption is independent of the choice of ZZ: all ZZ such that ZZT=MZZ^{T}=M have the same row norms and Frobenius norm.

This assumption is similar to the “incoherence” assumption . Our assumption is the same as the one used in analyzing non-convex algorithms .

We enforce XX to also satisfy this assumption by a regularizer

where R(X)R(X) is a function that penalizes XX when one of its rows is too large. See Section 4 and Section 5 for the precise definition. Our main result shows that in this setting, the regularized objective function has no spurious local minimum:

[Informal] All local minimum of the regularized objective (1.2) satisfy XXT=ZZT=MXX^{T}=ZZ^{T}=M when p\mboxpoly(κ,r,μ,logd)/dp\geqslant\mbox{poly}(\kappa,r,\mu,\log d)/d.

Combined with the results in (see Theorem 2.3) Theorem 1.1, as state informally above, doesn’t guarantee that ff satisfies the condition in Theorem 2.3. See its technical version, Theorem 5.3, which shows that the condition of Theorem 2.3 is satisfied. , we have,

With high probability, stochastic gradient descent on the regularized objective (1.2) will converge to a solution XX such that XXT=ZZT=MXX^{T}=ZZ^{T}=M in polynomial time from any starting point. Gradient descent will converge to such a point with probability 1 from a random starting point.

Our results are also robust to noise. Even if each entry is corrupted with Gaussian noise of standard deviation μ2ZF2/d\mu^{2}\|Z\|_{F}^{2}/d (comparable to the magnitude of the entry itself!), we can still guarantee that all the local minima satisfy XXTZZTFε\|XX^{T}-ZZ^{T}\|_{F}\leqslant\varepsilon when pp is large enough. See the discussion in Appendix B for results on noisy matrix completion.

Our main technique is to show that every point that satisfies the first and second order necessary conditions for optimality must be a desired solution. To achieve this we use new ideas to analyze the effect of the regularizer and show how it is useful in modifying the first and second order conditions to exclude any spurious local minimum.

2 Related Work

Non-convex Optimization

Recently, a line of work analyzes non-convex optimization by separating the problem into two aspects: the geometric aspect which shows the function has no spurious local minimum and the algorithmic aspect which designs efficient algorithms can converge to local minimum that satisfy first and (relaxed versions) of second order necessary conditions.

Our result is the first that explains the geometry of the matrix completion objective. Similar geometric results are only known for a few problems: SVD/PCA phase retrieval/synchronization, orthogonal tensor decomposition, dictionary learning . The matrix completion objective requires different tools due to the sampling of the observed entries, as well as carefully managing the regularizer to restrict the geometry. Parallel to our work Bhojanapalli et al. showed similar results for matrix sensing, which is closely related to matrix completion. Loh and Wainwright showed that for many statistical settings that involve missing/noisy data and non-convex regularizers, any stationary point of the non-convex objective is close to global optima; furthermore, there is a unique stationary point that is the global minimum under stronger assumptions .

On the algorithmic side, it is known that second order algorithms like cubic regularization and trust-region algorithms converge to local minima that approximately satisfy first and second order conditions. Gradient descent is also known to converge to local minima from a random starting point. Stochastic gradient descent can converge to a local minimum in polynomial time from any starting point . All of these results can be applied to our setting, implying various heuristics used in practice are guaranteed to solve matrix completion.

Preliminaries

For Ω[d]×[d]\Omega\subset[d]\times[d], let PΩP_{\Omega} be the linear operator that maps a matrix AA to PΩ(A)P_{\Omega}(A), where PΩ(A)P_{\Omega}(A) has the same values as AA on Ω\Omega, and outside of Ω\Omega.

We will use the following matrix norms: F\|\cdot\|_{F} the frobenius norm, \|\cdot\| spectral norm, A|A|_{\infty} elementwise infinity norm, and Apq=maxxp=1Aq|A|_{p\rightarrow q}=\max_{\|x\|_{p}=1}\|A\|_{q}. We use the shorthand AΩ=PΩAF\|A\|_{\Omega}=\|P_{\Omega}A\|_{F}. The trace inner product of two matrices is A,B=tr(AB)\langle A,B\rangle=\operatorname{tr}(A^{\top}B), and σmin(X)\sigma_{\min}(X), σmax(X)\sigma_{\max}(X) are the smallest and largest singular values of XX. We also use XiX_{i} to denote the ii-th row of a matrix XX.

2 Necessary conditions for Optimality

A point xx satisfies the first order necessary condition for optimality (later abbreviated as first order optimality condition) if f(x)=0\nabla f(x)=0. A point xx satisfies the second order necessary condition for optimality (later abbreviated as second order optimality condition)if 2f(x)0\nabla^{2}f(x)\succeq 0.

These conditions are necessary for a local minimum because otherwise it is easy to find a direction where the function value decreases. We will also consider a relaxed second order necessary condition, where we only require the smallest eigenvalue of the Hessian 2f(x)\nabla^{2}f(x) to be not very negative:

For τ0\tau\geqslant 0, a point xx satisfies the τ\tau-relaxed second order optimality condition, if 2f(x)τI\nabla^{2}f(x)\succeq-\tau\cdot I.

This relaxation to the second order condition makes the conditions more robust, and allows for efficient algorithms.

Proof Strategy: “simple” proofs are more generalizable

In this section, we demonstrate the key ideas behind our analysis using the rank r=1r=1 case. In particular, we first give a “simple” proof for the fully observed case. Then we show this simple proof can be easily generalized to the random observation case. We believe that this proof strategy is applicable to other statistical problems involving partial/noisy observations. The proof sketches in this section are only meant to be illustrative and may not be fully rigorous in various places. We refer the readers to Section 4 and Section 5 for the complete proofs.

In the rank r=1r=1 case, we assume M=zzM=zz^{\top}, where z=1\|z\|=1, and zμd\|z\|_{\infty}\leqslant\frac{\mu}{\sqrt{d}}. Let ε1\varepsilon\ll 1 be the target accuracy that we aim to achieve in this section and let p=\mboxpoly(μ,logd)/(dε)p=\mbox{poly}(\mu,\log d)/(d\varepsilon).

For simplicity, we focus on the following domain B\mathcal{B} of incoherent vectors where the regularizer R(x)R(x) vanishes,

Inside this domain B\mathcal{B}, we can restrict our attention to the objective function without the regularizer, defined as,

It turns out to be insightful to consider the full observation case when Ω=[d]×[d]\Omega=[d]\times[d]. The corresponding objective is

Under the setting of this section, in the domain B\mathcal{B}, the function g()g(\cdot) has only two local minima {±z}\{\pm z\} .

Before introducing the “simple” proof, let us first look at a delicate proof that does not generalize well.

We compute the gradient and Hessian of g(x)g(x),

Therefore, a critical point xx satisfies g(x)=Mxx2x=0\nabla g(x)=Mx-\|x\|^{2}x=0, and thus it must be an eigenvector of MM and x2\|x\|^{2} is the corresponding eigenvalue. Next, we prove that the hessian is only positive definite at the top eigenvector . Let xx be an eigenvector with eigenvalue λ=x2\lambda=\|x\|^{2}, and λ\lambda is strictly less than the top eigenvalue λ\lambda^{*}. Let zz be the top eigenvector. We have that z,2g(x)z=z,Mz+x2=λ+λ<0\langle z,\nabla^{2}g(x)z\rangle=-\langle z,Mz\rangle+\|x\|^{2}=-\lambda^{*}+\lambda<0, which shows that xx is not a local minimum. Thus only zz can be a local minimizer, and it is easily verified that 2g(z)\nabla^{2}g(z) is indeed positive definite.

“Simple” and Generalizable proof

The lessons from the subsection above suggest us find an alternative proof for the full observation case which is generalizable. The alternative proof will be simple in the sense that it doesn’t use the notion of eigenvectors and eigenvalues. Concretely, the key observation behind most of the analysis in this paper is the following,

Proofs that consist of inequalities that are linear in 1Ω\mathbf{1}_{\Omega} are often easily generalizable to partial observation case.

Here statements that are linear in 1Ω\mathbf{1}_{\Omega} mean the statements of the form ij1(i,j)ΩTija\sum_{ij}1_{(i,j)\in\Omega}T_{ij}\leqslant a. We will call these kinds of proofs “simple” proofs in this section. Roughly speaking, the observation follows from the law of large numbers — Suppose Tij,(i,j)[d]×[d]T_{ij},(i,j)\in[d]\times[d] is a sequence of bounded real numbers, then the sampled sum (i,j)ΩTij=i,j1(i,j)ΩTij\sum_{(i,j)\in\Omega}T_{ij}=\sum_{i,j}\mathbf{1}_{(i,j)\in\Omega}T_{ij} is an accurate estimate of the sum pi,jTijp\sum_{i,j}T_{ij}, when the sampling probability pp is relatively large. Then, the mathematical implications of pTijap\sum T_{ij}\leqslant a are expected to be similar to the implications of (i,j)ΩTija\sum_{(i,j)\in\Omega}T_{ij}\leqslant a, up to some small error introduced by the approximation. To make this concrete, we give below informal proofs for Lemma 3.2 and Lemma 3.1 that only consists of statements that are linear in 1Ω\mathbf{1}_{\Omega}. Readers will see that due to the linearity, the proof for the partial observation case (shown on the right column) is a direct generalization of the proof for the full observation case (shown on the left column) via concentration inequalities (which will be discussed more at the end of the section).

Suppose xBx\in\mathcal{B} satisfies g(x)=0\nabla g(x)=0, then x,z2=x4\langle x,z\rangle^{2}=\|x\|^{4}.

Intuitively, this proof says that the norm of a critical point xx is controlled by its correlation with zz. Here at the lasa sampling version of the f aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa ∎

The last step uses the fact that equation (3.5) and (3.6) are approximately equal up to scaling factor pp for any xBx\in\mathcal{B}, since (3.6) is a sampled version of (3.5). ∎

Subtleties regarding uniform convergence

In the proof sketches above, our key idea is to use concentration inequalities to link the full observation objective g(x)g(x) with the partial observation counterpart. However, we require a uniform convergence result. For example, we need a statement like “w.h.p over the choice of Ω\Omega, equation (3.5) and (3.6) are similar to each other up to scaling”. This type of statement is often only true for xx inside the incoherent ball B\mathcal{B}. The fix to this is the regularizer. For non-incoherent xx, we will use a different argument that uses the property of the regularizer. This is besides the main proof strategy of this section and will be discussed in subsequent sections.

Warm-up: Rank-1 Case

In this section, using the general proof strategy described in previous section, we provide a formal proof for the rank-1 case. In subsection 4.1, we formally work out the proof sketches of Section 3. In subsection 4.2, we prove that due to the effect of the regularizer, outside incoherent ball B\mathcal{B}, the objective function doesn’t have any local minimum.

In the rank-1 case, the objective function simplifies to,

Here we use the the regularization R(x)R(x)

The parameters λ\lambda and α\alpha will be chosen later as in Theorem 4.2. We will choose α>10μ/d\alpha>10\mu/\sqrt{d} so that R(x)=0R(x)=0 for incoherent xx, and thus it only penalizes coherent xx. Moreover, we note R(x)R(x) has Lipschitz second order derivative. This is the main reason for us to choose 44-th power instead of 22-nd power.

We first state the optimality conditions, whose proof is deferred to Appendix A.

The first order optimality condition of objective (4.1) is,

and the second order optimality condition requires:

Moreover, The τ\tau-relaxed second order optimality condition requires

We give the precise version of Theorem 1.1 for the rank-11 case.

In the rest of this section, we will first prove that when xx is constrained to be incoherent (and hence the regularizer is 0 and concentration is straightforward) and satisfies the optimality conditions, then xx has to be zz or z-z. Then we go on to explain how the regularizer helps us to change the geometry of those points that are far away from zz so that we can rule out them from being local minimum. For simplicity, we will focus on the part that shows a local minimum xx must be close enough to zz.

In the setting of Theorem 4.2, suppose xx satisfies the first-order and second-order optimality condition (4.2) and (4.3). Then when pp is defined as in Theorem 4.2,

This turns out to be the main challenge. Once we proved xx is close, we can apply the result of Sun and Luo (see Lemma C.1), and obtain Theorem 4.2.

Note that the desired solution zz is in B\mathcal{B}, and the regularization R(x)R(x) vanishes inside B\mathcal{B}.

The following lemmas assume xx satisfies the first and second order optimality conditions, and deduce a sequence of properties that xx must satisfy.

Under the setting of Theorem 4.2 , with high probability over the choice of Ω\Omega, for any xBx\in\mathcal{B} that satisfies second-order optimality condition (4.3) we have,

The same is true if xBx\in\mathcal{B} only satisfies τ\tau-relaxed second order optimality condition for τ0.1p\tau\leqslant 0.1p.

We plug in v=zv=z in the second-order optimality condition (4.3), and obtain that

Intuitively, when restricted to Ω\Omega, the squared Frobenius on the LHS and the quadratic form on the RHS should both be approximately a pp fraction of the unrestricted case. In fact, both LHS and RHS can be written as the sum of terms of the form PΩ(uvT),PΩ(stT)\langle P_{\Omega}(uv^{T}),P_{\Omega}(st^{T})\rangle, because

Therefore we can use concentration inequalities (Theorem D.1), and simplify the equation

where ε=O(μ2logdpd)\varepsilon=O(\mu^{2}\sqrt{\frac{\log d}{pd}}). Similarly, by Theorem D.1 again, we have

(Note that even we use the τ\tau-relaxed second order optimality condition, the RHS only becomes 1.99pz42px,z2±O(pε)1.99p\|z\|^{4}-2p\langle x,z\rangle^{2}\pm O(p\varepsilon) which does not effect the later proofs.)

Therefore plugging in estimates above back into equation (4.6), we have that

which implies that 6px2z22px2z2+4px,z22pz4O(pε)6p\|x\|^{2}\|z\|^{2}\geqslant 2p\|x\|^{2}\|z\|^{2}+4p\langle x,z\rangle^{2}\geqslant 2p\|z\|^{4}-O(p\varepsilon). Using z2=1\|z\|^{2}=1, and ε\varepsilon being sufficiently small, we complete the proof. ∎

Next we use first order optimality condition to pin down another property of xx – it has to be close to zz after scaling. Note that this doesn’t mean directly that xx has to be close to zz since x=0x=0 also satisfies first order optimality condition (and therefore the conclusion (4.7) below).

With high probability over the randomness of Ω\Omega, for any xBx\in\mathcal{B} that satisfies first-order optimality condition (4.2), we have that xx also satisfies

Note that since xBx\in\mathcal{B}, we have R(x)=0R(x)=0. Therefore first-order optimality condition says that

Again, intuitively we hope PΩ(zzT)pzzTP_{\Omega}(zz^{T})\approx pzz^{T} and PΩ(xxT)xpx2xP_{\Omega}(xx^{T})x\approx p\|x\|^{2}x. These are made precise by the concentration inequalities Lemma D.4 and Theorem D.2 respectively.

By Theorem D.2, we have that with high probability over the choice of Ω\Omega, for every xBx\in\mathcal{B},

Plugging in estimates (4.10) and (4.9) into equation (4.8), we complete the proof. ∎

Finally we combine the two optimality conditions and show equation (4.7) implies xxTxx^{T} must be close to zzTzz^{T}.

Suppose vector xx satisfies that x21/4\|x\|^{2}\geqslant 1/4, and that z,xzx2xδ.\left\lVert\langle z,x\rangle z-\|x\|^{2}x\right\rVert\leqslant\delta\,. Then for δ(0,0.1)\delta\in(0,0.1),

In particular, we know 1u24δ|1-u^{2}|\leqslant 4\delta and uv4δu\|v\|\leqslant 4\delta. This means u1±3δ|u|\in 1\pm 3\delta and v8δ\|v\|\leqslant 8\delta. Now we expand xxTzzTxx^{T}-zz^{T}:

It is clear that all the terms have norm bounded by O(δ)O(\delta), therefore xxzzF2O(δ)\left\lVert xx^{\top}-zz^{\top}\right\rVert_{F}^{2}\leqslant O(\delta). ∎

2 Extension to general x𝑥x

We have shown when xx is incoherent and satisfies first and second order optimality conditions, then it must be close to zz or z-z. Now we need to consider more general cases when xx may have some very large coordinates. Here the main intuition is that the first order optimality condition with a proper regularizer is enough to guarantee that xx cannot have a entry that is too much bigger than μ/d\mu/\sqrt{d}.

With high probability over the choice of Ω\Omega, for any xx that satisfies first-order order optimality condition (4.2), we have

Here we recall that α\alpha was chosen to be 10μ/d10\mu/\sqrt{d} and λ\lambda is chosen to be large so that the α\alpha dominates the second term μp/λ\mu\sqrt{p/\lambda} in the setting of Theorem 4.2.

Suppose i=maxjxji^{\star}=\max_{j}|x_{j}|. Without loss of generality, suppose xi0x_{i^{\star}}\geqslant 0. Suppose ii^{\star}-th row of Ω\Omega consists of entries with index [i]×Si[i]\times S_{i^{\star}}. If xi2α|x_{i^{\star}}|\leqslant 2\alpha, we are done. Therefore in the rest of the proof we assume xi>2α|x_{i^{\star}}|>2\alpha. Note that when pc(logd)/dp\geqslant c(\log d)/d for sufficiently large constant cc, with high probability over the choice of Ω\Omega, we have Si2pd|S_{i^{\star}}|\leqslant 2pd. In the rest of argument we are working with such an Ω\Omega with Si2pd|S_{i^{\star}}|\leqslant 2pd.

We will compare the ii^{\star}-th coordinate of LHS and RHS of first-order optimality condition (4.2). For preparation, we have

where the last step we used the fact that Si2pd|S_{i^{\star}}|\leqslant 2pd. Moreover, we have that

Now plugging in the bounds above into the ii^{\star}-th coordinate of equation (4.2), we obtain

which implies that xi4pμ2/λ|x_{i^{\star}}|\leqslant 4\sqrt{p\mu^{2}/\lambda}. ∎

Setting λμ2p/α2\lambda\geqslant\mu^{2}p/\alpha^{2} and α=10μ1/d\alpha=10\mu\sqrt{1/d}, Lemma 4.7 ensures that any xx that satisfies first-order optimality condition is the following ball,

Then we would like to continue to use arguments similar to Lemma 4.4 and 4.5. However, things have become more complicated as now we need to consider the contribution of the regularizer.

In the setting of Theorem 4.2, with high probability over the choice of Ω\Omega, suppose xBx\in\mathcal{B}^{\prime} satisfies second-order optimality condition (4.3) or τ\tau-relaxed condition for τ0.1p\tau\leqslant 0.1p, we have x21/8\|x\|^{2}\geqslant 1/8.

The guarantees and proofs are very similar to Lemma 4.4. The main intuition is that we can restrict our attentions to coordinates whose regularizer is equal to 0. See Section A for details.

We will now deal with first order optimality condition. We first write out the basic extension of Lemma 4.5, which follows from the same proof except we now include the regularizer term.

With high probability over the randomness of Ω\Omega, for any xBx\in\mathcal{B}^{\prime} that satisfies first-order optimality condition (4.2), we have that xx also satisfies

Next we will show that we can remove the regularizer term, the main observation here is nonzero entries R(x)\nabla R(x) all have the same sign as the corresponding entries in xx. See Section A for details.

Suppose xBx\in\mathcal{B}^{\prime} satisfies that x21/8\|x\|^{2}\geqslant 1/8, under the same assumption as Lemma 4.9. we have,

Rank-r case

In this section we show how to extend the results in Section 4 to recover matrices of rank rr. Here we still use the same proof strategy of Section 3. Though for simplicity we only write down the proof for the partial observation case, while the analysis for the full observation case (which was our starting point) can be obtained by substituting [d]×[d][d]\times[d] for Ω\Omega everywhere.

Without loss of generality, we assume that ZF2=r\|Z\|_{F}^{2}=r in this section. This implies that σmax(Z)1σmin(Z)\sigma_{\max}(Z)\geqslant 1\geqslant\sigma_{\min}(Z). Now we shall state the first and second order optimality conditions:

If XX is a local optimum of objective function (5.1), its first order optimality condition is,

and the second order optimality condition is equivalent to

Note that the regularizer now is more complicated than the one dimensional case, but luckily we still have the following nice property.

Now we are ready to state the precise version of Theorem 1.1:

Suppose pCmax{μ6κ16r4,μ4κ4r6}d1log2dp\geqslant C\max\{\mu^{6}\kappa^{16}r^{4},\mu^{4}\kappa^{4}r^{6}\}d^{-1}\log^{2}d where CC is a large enough constant. Let α=32μκr/d,λμ2rp/α2\alpha=32\mu\kappa r/\sqrt{d},\lambda\geqslant\mu^{2}rp/\alpha^{2}. Then with high probability over the randomness of Ω\Omega, any local minimum XX of f()f(\cdot) satisfies that f(X)=0f(X)=0, and in particular, ZZ=XXZZ^{\top}=XX^{\top}.

Moreover, If XX satisfies that f(X)Fδpσ3(Z)/C\|\nabla f(X)\|_{F}\leqslant\delta\leqslant p\sigma^{3}(Z)/C and 2f(X)1/Cμ2κr2p1/2d1/2I\nabla^{2}f(X)\succeq-1/C\cdot\mu^{2}\kappa r^{2}p^{1/2}d^{-1/2}I, then XX is an approximate global minimum in the sense that XXMF2O(δ/p).\|XX^{\top}-M\|_{F}^{2}\leqslant O(\delta/p).

The proof of this Theorem follows from a similar path as Theorem 4.2. We first notice that because of the regularizer, any matrix XX that satisfies first order optimality condition must be somewhat incoherent (this is analogues to Lemma 4.7):

Suppose Si2pd|S_{i}|\leqslant 2pd. Then for any XX satisfies 1st order optimality (5.2), we have

Assume i=argmaxiXii^{\star}=\operatorname{argmax}_{i}\|X_{i}\|. Suppose the iith row of Ω\Omega consists of entries with index [i]×Si[i]\times S_{i}. If Xi2α\|X_{i^{\star}}\|\leqslant 2\alpha, then we are done. Therefore in the rest of the proof we assume Xi2α\|X_{i^{\star}}\|\geqslant 2\alpha.

We will compare the ii-th row of LHS and RHS of (5.2). For preparation, we have

Next we lowerbound the norm of the RHS of equation (5.2). We have that

Therefore plugging in equation above and equation (5.6) into 1st order optimality condition (5.2). We obtain that Xi8μ2rp/λ\|X_{i^{\star}}\|\leqslant\sqrt{8\mu^{2}rp/\lambda} which completes the proof. ∎

Next, we prove a property implied by first order optimality condition, which is similar to Lemma 4.9.

In the setting of Theorem 5.3, with high probability over the choice of Ω\Omega, for any XX that satisfies 1st order optimality condition (5.2), we have

where δ=O(μ3κ3r2log0.75(d)σmax(Z)3(dp)1/2)\delta=O(\mu^{3}\kappa^{3}r^{2}\log^{0.75}(d)\sigma_{\max}(Z)^{-3}(dp)^{-1/2}) and γ=λ/(2p)0\gamma=\lambda/(2p)\geqslant 0.

If XFrσmax(Z)2\|X\|_{F}\leqslant\sqrt{r\sigma_{\max}(Z)^{2}} we are done. When XFrσmax(Z)2\|X\|_{F}\geqslant\sqrt{r\sigma_{\max}(Z)^{2}}, by Lemma 5.4, we have that maxXi4α=O(μκr/d)\max\|X_{i}\|\leqslant 4\alpha=O(\mu\kappa r/\sqrt{d}), and therefore maxXiνXF\max\|X_{i}\|\leqslant\nu\|X\|_{F} with ν=O(μκr/σmax(Z))\nu=O(\mu\kappa\sqrt{r}/\sigma_{\max}(Z)). Then by Theorem D.2, we have that

where δ=O(μ3κ3r2log0.75(d)σmax(Z)3(dp)1/2)\delta=O(\mu^{3}\kappa^{3}r^{2}\log^{0.75}(d)\sigma_{\max}(Z)^{-3}(dp)^{-1/2}). These two imply equation (5.11). Moreover, we have

Suppose XX has singular value σ1σr\sigma_{1}\geqslant\dots\geqslant\sigma_{r}. Then we have ZZXF2ZZ2XF2σmax(Z)4XF2=σmax(Z)4(σ12++σr2)\left\lVert ZZ^{\top}X\right\rVert_{F}^{2}\leqslant\|ZZ^{\top}\|^{2}\|X\|_{F}^{2}\leqslant\sigma_{\max}(Z)^{4}\|X\|_{F}^{2}=\sigma_{\max}(Z)^{4}(\sigma_{1}^{2}+\dots+\sigma_{r}^{2}). On the other hand, XXXF2=σ16++σr6\left\lVert XX^{\top}X\right\rVert_{F}^{2}=\sigma_{1}^{6}+\dots+\sigma_{r}^{6}. Therefore, equation (5.12) implies that

Then we have (by Proposition E.2) we complete the proof.

Now we look at the second order optimality condition, this condition implies the smallest singular value of XX is large (similar to Lemma 4.8). Note that this lemma is also true even if xx only satisfies relaxed second order optimality condition with τ=0.01pσmin(Z)\tau=0.01p\sigma_{\min}(Z).

In the setting of Theorem 5.3. With high probability over the choice of Ω\Omega, suppose XX satisfies equation (5.9), (5.4) the 2nd order optimality condition (5.3). Then,

We claim that σmin(ZJ)12σmin(Z)\sigma_{\min}(Z_{J})\geqslant\frac{1}{2}\sigma_{\min}(Z). Let L=[d]JL=[d]-J. Since for any iLi\in L it holds that Xiα\|X_{i}\|\geqslant\alpha, we have Lα2XF22rσmax(Z)2|L|\alpha^{2}\leqslant\|X\|_{F}^{2}\leqslant 2r\sigma_{\max}(Z)^{2} (by equation (5.9)), and it follows that L2rσmax(Z)2/α2|L|\leqslant 2r\sigma_{\max}(Z)^{2}/\alpha^{2}. Therefore,

Therefore, ZJZ_{J} has column rank exactly rr. By variational characterization of singular values, we have that for there exists unit vector zJcol-span(ZJ)z_{J}\in\textup{col-span}(Z_{J}) such that XzJσmin(X)\|X^{\top}z_{J}\|\leqslant\sigma_{\min}(X). Since zJcol-span(ZJ)z_{J}\in\textup{col-span}(Z_{J}) is a unit vector, we have that zJz_{J} can be written as zJ=ZJβz_{J}=Z_{J}\beta where β1σmin(ZJ)O(1/σmin(Z)).\|\beta\|\leqslant\frac{1}{\sigma_{\min}(Z_{J})}\leqslant O(1/\sigma_{\min}(Z)). Therefore this in turn implies that zJZJ2βO(μr/d/σmin(Z))O(μκr/d)\|z_{J}\|_{\infty}\leqslant\|Z_{J}\|_{2\rightarrow\infty}\|\beta\|\leqslant O(\mu\sqrt{r/d}/\sigma_{\min}(Z))\leqslant O(\mu\kappa\sqrt{r/d}).

We will plug in V=zJvTV=z_{J}v^{T} in the 2nd order optimality condition (5.3). Note that since zJcol-span(ZJ)z_{J}\in\textup{col-span}(Z_{J}), it is supported on subset JJ, and therefore 2R(X)V=0\nabla^{2}R(X)V=0. Therefore the term about regularization in (5.3) will vanish. For simplicity, let y=XzJy=X^{\top}z_{J}, w=Xvw=Xv We obtain that taking V=zJvV=z_{J}v^{\top} in equation (5.3) will result in

Note that we have that wX2vμr/d\|w\|_{\infty}\leqslant\|X\|_{2\rightarrow\infty}\|v\|\leqslant\mu\sqrt{r/d}. Recalling that zJO(μκr/d)\|z_{J}\|_{\infty}\leqslant O(\mu\kappa\sqrt{r/d}), by Theorem D.1, we have that

where δ=O(μ2κr2(pd)1/2)\delta=O(\mu^{2}\kappa r^{2}(pd)^{-1/2}). Then simple algebraic manipulation gives that

Note that w,zJ=v,XzJ=y,v\langle w,z_{J}\rangle=\langle v,X^{\top}z_{J}\rangle=\langle y,v\rangle. Recall that zJ=1\|z_{J}\|=1 and zcol-span(ZJ)z\in\textup{col-span}(Z_{J}), and therefore ZzJ=ZJzJσmin2(ZJ)\|Z^{\top}z_{J}\|=\|Z_{J}^{\top}z_{J}\|\geqslant\sigma_{\min}^{2}(Z_{J}). Moreover, recall that y=XzJσmin(X)\|y\|=\|X^{\top}z_{J}\|\leqslant\sigma_{\min}(X). Using these with equation (5.14) we obtain that

Therefore together with equation (5.14) and ZzJσmin2(ZJ)\|Z^{\top}z_{J}\|\geqslant\sigma_{\min}^{2}(Z_{J}) we obtain that

Therefore combining equation (5.15) and the lower bound on σmin(ZJ)\sigma_{\min}(Z_{J}) we complete the proof. ∎

Similar as before, we show it is possible to remove the regularizer term here, again the intuition is that the regularizer is always in the same direction as XX.

Suppose XX satisfies equation (5.4) and (5.13) and (5.10), then for any γ0\gamma\geqslant 0,

Let L={i:Xiα}L=\{i:\|X_{i}\|\geqslant\alpha\}. For i∉Li\not\in L, we have that (R(X))i=0(\nabla R(X))_{i}=0. Therefore it suffices to prove that for every iLi\in L,

By proposition 5.2, we have R(X))i=ΓiiXi\nabla R(X))_{i}=\Gamma_{ii}X_{i} for Γii0\Gamma_{ii}\geqslant 0. Then

Therefore combining two equations above we obtain equation (5.17) which completes the proof. ∎

Finally we show the form in Equation (5.16) implies ZZTZZ^{T} is close to XXTXX^{T} (this is similar to Lemma 4.6).

Suppose XX and ZZ satisfies that σmin(X)1/4σmin(Z)\sigma_{\min}(X)\geqslant 1/4\cdot\sigma_{\min}(Z) and that

where δσmin3(Z)/C\delta\leqslant\sigma_{min}^{3}(Z)/C for a large enough constant CC, then

The proof is similar to the one-dimensional case, we will separate ZZ into the directions that are in column span of XX and its orthogonal subspace. We will then show the projection of ZZ in the column span is close to XX, and the projection on the orthogonal subspace must be small.

Let Z=U+VZ=U+V where U=\mboxProjspan(X)ZU=\mbox{Proj}_{span(X)}Z is the projection of ZZ to the column span of XX, and VV is the projection to the orthogonal subspace. Then since VTX=0V^{T}X=0 we know

Here columns of the first term UUTXUU^{T}X are in the column span of XX, and the columns second term VUTXVU^{T}X are in the orthogonal subspace. Therefore,

In particular, both terms should be bounded by δ2\delta^{2}. Therefore UUTXXTF2δ2/σmin2(X)16δ2/σmin2(Z)\|UU^{T}-XX^{T}\|_{F}^{2}\leqslant\delta^{2}/\sigma_{min}^{2}(X)\leqslant 16\delta^{2}/\sigma_{min}^{2}(Z).

Also, we know σmin(UUTX)σmin(XXTX)δσmin(Z)3/128\sigma_{min}(UU^{T}X)\geqslant\sigma_{min}(XX^{T}X)-\delta\geqslant\sigma_{min}(Z)^{3}/128 if δσmin(Z)3/128\delta\leqslant\sigma_{min}(Z)^{3}/128. Therefore σmin(UTX)\sigma_{min}(U^{T}X) is at least σmin(Z)3/Z128\sigma_{min}(Z)^{3}/\|Z\|128. Now VF2δ2/σmin(UTX)2O(δ2Z2/σmin(Z)6)\|V\|_{F}^{2}\leqslant\delta^{2}/\sigma_{min}(U^{T}X)^{2}\leqslant O(\delta^{2}\|Z\|^{2}/\sigma_{min}(Z)^{6}).

Finally, we can bound UVTF\|UV^{T}\|_{F} by UVFZVF\|U\|\|V\|_{F}\leqslant\|Z\|\|V\|_{F} (last inequality is because UU is a projection of ZZ), which at least Ω(VF2)\Omega(\|V\|_{F}^{2}) when δσmin(Z)3/128\delta\leqslant\sigma_{min}(Z)^{3}/128, therefore

Last thing we need to prove the main theorem is a result from Sun and Luo, which shows whenever XXTXX^{T} is close to ZZTZZ^{T}, the function is essentially strongly convex, and the only points that have gradient are points where XXT=ZZTXX^{T}=ZZ^{T}, this is explained in Lemma C.1. Now we are ready to prove Theorem 5.3:

Suppose XX satisfies 1st and 2nd order optimality condition. Then by Lemma 5.5 and Lemma 5.4, we have that XX satisfies equation (5.4), (5.9), (5.10) and (5.11). Then by Lemma 5.6, we obtain that σmin(X)1/6σmin(Z)\sigma_{\min}(X)\geqslant 1/6\cdot\sigma_{\min}(Z). Now by Lemma 5.7 and equation (5.11), we have that ZZTXXXTXFδ\left\lVert ZZ^{T}X-XX^{T}X\right\rVert_{F}\leqslant\delta for δcσmin(Z)3/κ2\delta\leqslant c\sigma_{\min}(Z)^{3}/\kappa^{2} for sufficiently small constant cc. Then by Lemma 5.8 we obtain that ZZXXFcσmin(Z)2\|ZZ^{\top}-XX^{\top}\|_{F}\leqslant c\sigma_{\min}(Z)^{2} for sufficiently small constant cc. By Lemma C.1, in this region the only points that satisfy the first order optimality condition must satisfy XXT=ZZTXX^{T}=ZZ^{T}. ∎

To handle noise, notice that we can only hope to get an approximate solution in presence of noise, and to get that our Lemmas only depend on concentration bounds which still apply in the noisy setting. See Section B for details.

Conclusions

Although the matrix completion objective is non-convex, we showed the objective function has very nice properties that ensures the local minima are also global. This property gives guarantees for many basic optimization algorithms. An important open problem is the robustness of this property under different model assumptions: Can we extend the result to handle asymmetric matrix completion? Is it possible to add weights to different entries (similar to the settings studied in )? Can we replace the objective function with a different distance measure rather than Frobenius norm (which is related to works on 1-bit matrix sensing )? We hope this framework of analyzing the geometry of objective function can be applied to other problems.

References

Appendix A Omitted Proofs in Section 4

We first prove the equivalent form of the first and second order optimality conditions:

The first order optimality condition of objective (4.1) is,

and the second order optimality condition requires:

Moreover, The τ\tau-relaxed second order optimality condition requires

We take the Taylor’s expansion around point xx. Let δ\delta be an infinitesimal vector, we have

By symmetry PΩ(Mxx),xδ=PΩ(Mxx),δx=PΩ(Mxx)x,δ\langle P_{\Omega}(M-xx^{\top}),x\delta^{\top}\rangle=\langle P_{\Omega}(M-xx^{\top}),\delta x^{\top}\rangle=\langle P_{\Omega}(M-xx^{\top})x,\delta\rangle, so the first order optimality condition is δ,2PΩ(Mxx)x+λR(x),δ=0\forall\delta,\langle-2P_{\Omega}(M-xx^{\top})x+\lambda\nabla R(x),\delta\rangle=0, which is equivalent to that 2PΩ(Mxx)x=λR(x)2P_{\Omega}(M-xx^{\top})x=\lambda\nabla R(x).

The second order optimality condition says PΩ(Mxx),δδ+12xδ+δxF2+12λδ2R(x)δ0-\langle P_{\Omega}(M-xx^{\top}),\delta\delta^{\top}\rangle+\frac{1}{2}\|x\delta^{\top}+\delta x^{\top}\|_{F}^{2}+\frac{1}{2}\lambda\delta^{\top}\nabla^{2}R(x)\delta\geqslant 0 for every δ\delta, which is exactly equivalent to Equation (4.3). ∎

Next we show the full proof for the second order optimality condition:

In the setting of Theorem 4.2, with high probability over the choice of Ω\Omega, suppose xBx\in\mathcal{B}^{\prime} satisfies second-order optimality condition (4.3) or τ\tau-relaxed condition for τ0.1p\tau\leqslant 0.1p, we have x21/8\|x\|^{2}\geqslant 1/8.

If x1\|x\|\geqslant 1, then we are done. Therefore in the rest of the proof we assume x1\|x\|\leqslant 1. The proof is very similar to Lemma 4.4. We plug in v=zJv=z_{J} instead into equation (4.3), where J={i:xiα}J=\{i:|x_{i}|\leqslant\alpha\}. Note that zJ2R(zJ)zJz_{J}^{\top}\nabla^{2}R(z_{J})z_{J} vanishes. We plug in v=zJv=z_{J} in the equation (4.3) and obtain that xx satisfies that

(Again notice that using τ\tau-relaxed second order optimality condition does not effect the RHS by too much, so it does not change later steps.) Therefore plugging the estimates above back into equation (A.1), we have that

Using Cauchy-Schwarz, we have x2zJ2x,zJ2\|x\|^{2}\|z_{J}\|^{2}\geqslant\langle x,z_{J}\rangle^{2}, and therefore we obtain that zJ2x213zJ4O(ε)\|z_{J}\|^{2}\|x\|^{2}\geqslant\frac{1}{3}\|z_{J}\|^{4}-O(\varepsilon).

Finally, we claim that zJ21/2\|z_{J}\|^{2}\geqslant 1/2, which completes the proof since x213zJ2O(ε)1/8\|x\|^{2}\geqslant\frac{1}{3}\|z_{J}\|^{2}-O(\varepsilon)\geqslant 1/8.

Suppose α4μd\alpha\geqslant\frac{4\mu}{\sqrt{d}} and xx satisfies x4α\|x\|_{\infty}\leqslant 4\alpha and x2\|x\|\leqslant 2. Let J={i:xiα}J=\{i:|x_{i}|\leqslant\alpha\}. Then we have that zJ1/2\|z_{J}\|\geqslant 1/2.

The claim can be simply proved as follows: Since x22\|x\|^{2}\leqslant 2 we have that Jc2/α2|J^{c}|\leqslant 2/\alpha^{2} and therefore zJc22μ2/(dα2)\|z_{J^{c}}\|^{2}\leqslant 2\mu^{2}/(d\alpha^{2}). This further implies that zJ2=z2zL2(12μ2/(dα2))12\|z_{J}\|^{2}=\|z\|^{2}-\|z_{L}\|^{2}\geqslant(1-2\mu^{2}/(d\alpha^{2}))\geqslant\frac{1}{2} because α2μd\alpha\geqslant\frac{2\mu}{\sqrt{d}}. ∎

Suppose xBx\in\mathcal{B}^{\prime} satisfies that x21/8\|x\|^{2}\geqslant 1/8, under the same assumption as Lemma 4.9. we have,

Let L={i:xiα}L=\{i:\|x_{i}\|\geqslant\alpha\}. For i∉Li\not\in L, we have that (R(x))i=0(\nabla R(x))_{i}=0. Therefore it suffices to prove that for every iLi\in L,

Since we have R(x)i=γixi\nabla R(x)_{i}=\gamma_{i}x_{i} for some γi0\gamma_{i}\geqslant 0, we have

Therefore combining two equations above we obtain equation (A.2) which completes the proof. ∎

Appendix B Handling Noise

Suppose instead of observing the matrix ZZTZZ^{T}, we actually observe a noisy version M=ZZT+NM=ZZ^{T}+N, where NN is a Gaussian matrix with independent N(0,σ2)N(0,\sigma^{2}) entries. In this case we should not hope to exactly recover ZZTZZ^{T} (as two close ZZ’s may generate the same observation). In this Section we show even with fairly large noise our arguments can still hold.

Let μ^=max{μ,4σdlogdr}\hat{\mu}=\max\{\mu,\sqrt{\frac{4\sigma d\sqrt{\log d}}{r}}\}. Suppose pCμ^6κ12r4d1ε2log1.5dp\geqslant C\hat{\mu}^{6}\kappa^{12}r^{4}d^{-1}\varepsilon^{-2}\log^{1.5}d where CC is a large enough constant. Let α=2μ^κr/d,λμ^2rp/α2\alpha=2\hat{\mu}\kappa r/\sqrt{d},\lambda\geqslant\hat{\mu}^{2}rp/\alpha^{2}. Then with high probability over the randomness of Ω\Omega, any local minimum XX of f()f(\cdot) satisfies

In fact, a noise level σlogdμ2r/d\sigma\sqrt{\log d}\leqslant\mu^{2}r/d (when the noise is almost as large as the maximum possible entry) does not change the conclusions of Lemmas in this Section.

There are only three places in the proof where the noise will make a difference. These are: 1. The infinity norm bound of MM, used in Lemma 5.4. 2. The LHS of first order optimality condition (Equation (5.2)). 3. The RHS of second order optimality condition (Equation (5.3)).

What we require in these three steps are: 1. M|M|_{\infty} should be smaller than μ2r/d\mu^{2}r/d. 2. PΩ(N),W\langle P_{\Omega}(N),W\rangle should be smaller than PΩ(N),PΩ(W)O(σZdrlogd+pd2rσ2WWFlogd)|\langle P_{\Omega}(N),P_{\Omega}(W)\rangle|\leqslant O(\sigma|Z|_{\infty}dr\log d+\sqrt{pd^{2}r\sigma^{2}|W|_{\infty}\|W\|_{F}\log d}). 3. PΩ(N)εpZZTF\|P_{\Omega}(N)\|\leqslant\varepsilon p\|ZZ^{T}\|_{F}. When we define the μ^=max{μ,4σdlogdr}\hat{\mu}=\max\{\mu,\sqrt{\frac{4\sigma d\sqrt{\log d}}{r}}\}, all of these are satisfied (by Lemma D.5 and D.6).

Now we can follow the proof and see δcεσmin(Z)/κ2\delta\leqslant c\varepsilon\sigma_{\min}(Z)/\kappa^{2} for small enough constant cc, and By Lemma 5.8 we know XXTZZTFε\|XX^{T}-ZZ^{T}\|_{F}\leqslant\varepsilon. ∎

Appendix C Finding the Exact Factorization

In Section 5, we showed that any point that satisfies the first and second order necessary condition must satisfy XXTZZTFc\|XX^{T}-ZZ^{T}\|_{F}\leqslant c for a small enough constant cc. In this section we will show that in fact XXTXX^{T} must be exactly equal to ZZTZZ^{T}. The proof technique here is mostly based on the work of Sun and Luo. However we have to modify their proof because we use slightly different regularizers, and we work in the symmetric case. The main Lemma in can be rephrased as follows in our setting:

Suppose pCμ4r6κ4d1logdp\geqslant C\mu^{4}r^{6}\kappa^{4}d^{-1}\log d for large enough absolute constant CC, and ε=σmin(Z)2/100\varepsilon=\sigma_{min}(Z)^{2}/100. with high probability over the randomness of Ω\Omega, we have that for any point XX in the set

there exists a matrix UU such that UUT=ZZTUU^{T}=ZZ^{T} and

As a consequence, any point XX in the set B\mathcal{B} that satisfies first order optimality condition must be a global optimum (or, equivalently, satisfy XXT=ZZTXX^{T}=ZZ^{T}).

Recall f(X)=12PΩ(MXXT)F2+λR(X)f(X)=\frac{1}{2}\|P_{\Omega}(M-XX^{T})\|_{F}^{2}+\lambda R(X). The proof of Lemma C.1 consists of three steps:

1. The regularizer has nonnegative correlation with (XU)(X-U): for any UU such that UUT=ZZTUU^{T}=ZZ^{T}, we have R(X),XU0\langle\nabla R(X),X-U\rangle\geqslant 0. (Claim C.3).

2. There exists a matrix UU such that UUT=ZZTUU^{T}=ZZ^{T}, and UU is close to XX. (Claim C.4)

3. Argue that f(x),XUp4PΩ(MXXT)F2\langle\nabla f(x),X-U\rangle\geqslant\frac{p}{4}\|P_{\Omega}(M-XX^{T})\|_{F}^{2} when UU is close to XX. (See proof of Lemma C.1).

Before going into details, the first useful observation is that all matrices UU with UUT=ZZTUU^{T}=ZZ^{T} have the same row norm.

Suppose UU=ZZUU^{\top}=ZZ^{\top}, then we have U=ZRU=ZR where RR is an orthonormal matrix. In particular, the ii-th row of UU is equal to

Note that this simple observation is only true in the symmetric case. This Claims serves as the same role of the bounds on row norms of U,VU,V in the asymmetric case (Propositions 4.1 and 4.2 of ).

Next we are ready to argue that the regularizer is always positively correlated with XUX-U.

For any UU such that UUT=ZZTUU^{T}=ZZ^{T}, we have,

Since the regularizer is applied independently to individual rows, we can rewrite R(X),XU=i=1nR(Xi),XiUi\langle\nabla R(X),X-U\rangle=\sum_{i=1}^{n}\langle\nabla R(X_{i}),X_{i}-U_{i}\rangle, and focus on ii-th row.

For each row XiX_{i}, R(Xi)\nabla R(X_{i}) is 0 when Xi2μr/d\|X_{i}\|\leqslant 2\mu\sqrt{r}/\sqrt{d}. In that case R(Xi),XiUi=0\langle\nabla R(X_{i}),X_{i}-U_{i}\rangle=0.

When Xi\|X_{i}\| is larger than 2μ/d2\mu/\sqrt{d}, we know R(Xi)\nabla R(X_{i}) is always in the same direction as XiX_{i}. In this case λR(Xi)=γXi\lambda\nabla R(X_{i})=\gamma X_{i} for some γ>0\gamma>0 and Xi2μr/d2Zi=2Ui\|X_{i}\|\geqslant 2\mu\sqrt{r}/\sqrt{d}\geqslant 2\|Z_{i}\|=2\|U_{i}\| (where last equality is by Claim C.2). Therefore by triangle inequality

This then implies λR(Xi),XiUi=γXi,XiUi>0\langle\lambda\nabla R(X_{i}),X_{i}-U_{i}\rangle=\gamma\langle X_{i},X_{i}-U_{i}\rangle>0.

Next we will prove the gradient of 12PΩ(MXXT)F2\frac{1}{2}\|P_{\Omega}(M-XX^{T})\|_{F}^{2} has a large correlation with XUX-U. This is analogous to Proposition 4.2 in .

Suppose XXTMF=εσmin(Z)2/100\|XX^{T}-M\|_{F}=\varepsilon\leqslant\sigma_{min}(Z)^{2}/100, there exists a matrix UU such that UUT=MUU^{T}=M and XUF5εr/σmin(Z)2\|X-U\|_{F}\leqslant 5\varepsilon\sqrt{r}/\sigma_{min}(Z)^{2}.

Without loss of generality we assume MM is a diagonal matrix with first rr diagonal terms being σ1(Z)2,σ2(Z)2,...,σr(Z)2\sigma_{1}(Z)^{2},\sigma_{2}(Z)^{2},...,\sigma_{r}(Z)^{2} (this can be done by a change of basis). That is, we assume M=diag(σ1(Z)2,,σr(Z)2),0,,0)M=\mathop{\mathbf{diag}}(\sigma_{1}(Z)^{2},\dots,\sigma_{r}(Z)^{2}),0,\dots,0). We use MM^{\prime} to denote the first r×rr\times r principle submatrix of MM.

In order to construct UU, we first notice that QQ must be constructed as a zero matrix since MM has non-zero diagonal only on the top-left corner. A natural guess of PP then becomes a “normalized” version of VV.

Concretely, we construct P:=VS=V(VT(M)1V)1/2P:=VS=V(V^{T}(M^{\prime})^{-1}V)^{-1/2} (where S:=(VT(M)1V)1/2S:=(V^{T}(M^{\prime})^{-1}V)^{-1/2}). Thus, the difference between UU and XX is equal to UXFPVF+WF\|U-X\|_{F}\leqslant\|P-V\|_{F}+\|W\|_{F}.

Since XXTMFε\|XX^{T}-M\|_{F}\leqslant\varepsilon, we know MVVTF2+2VWTF2ε2\|M^{\prime}-VV^{T}\|_{F}^{2}+2\|VW^{T}\|_{F}^{2}\leqslant\varepsilon^{2}. In particular both terms are smaller than ε2\varepsilon^{2}.

First, we bound WF\|W\|_{F}. Note that since MVVTFεσmin(Z)2/100\|M^{\prime}-VV^{T}\|_{F}\leqslant\varepsilon\leqslant\sigma_{min}(Z)^{2}/100, we know σmin(V)20.99σmin(Z)2\sigma_{min}(V)^{2}\geqslant 0.99\sigma_{min}(Z)^{2}. Therefore σmin(V)0.9σmin(Z)\sigma_{min}(V)\geqslant 0.9\sigma_{min}(Z). Now we know WFVWTF/σmin(V)2ε/σmin(Z)\|W\|_{F}\leqslant\|VW^{T}\|_{F}/\sigma_{min}(V)\leqslant 2\varepsilon/\sigma_{min}(Z).

Next we bound PVF2\|P-V\|_{F}^{2}. Since MVVTFεσmin(Z)2/100\|M^{\prime}-VV^{T}\|_{F}\leqslant\varepsilon\leqslant\sigma_{min}(Z)^{2}/100, we know (12ε/σmin(Z)2)VVTM(1+2ε2/σmin(Z)2)VVT(1-2\varepsilon/\sigma_{min}(Z)^{2})VV^{T}\preceq M^{\prime}\preceq(1+2\varepsilon^{2}/\sigma_{min}(Z)^{2})VV^{T}. This implies VF1.1ZF\|V\|_{F}\leqslant 1.1\|Z\|_{F}, and (12ε/σmin(Z)2)IVTM1V(1+2ε/σmin(Z)2)I(1-2\varepsilon/\sigma_{min}(Z)^{2})I\preceq V^{T}M^{-1}V\preceq(1+2\varepsilon/\sigma_{min}(Z)^{2})I. Therefore the matrix SS is also very close to identity, in particular, SI2ε/σmin(Z)2\|S-I\|\leqslant 2\varepsilon/\sigma_{min}(Z)^{2}. Now we know PVF=VFSI3εZF/σmin(Z)2\|P-V\|_{F}=\|V\|_{F}\|S-I\|\leqslant 3\varepsilon\|Z\|_{F}/\sigma_{min}(Z)^{2}. Using the fact that ZF=1\|Z\|_{F}=1 we know UXFPVF+WF5εr/σmin(Z)2\|U-X\|_{F}\leqslant\|P-V\|_{F}+\|W\|_{F}\leqslant 5\varepsilon\sqrt{r}/\sigma_{min}(Z)^{2}.

We can now combine this Claim with a sampling lemma to show PΩ((XU)(XU)T)F2\|P_{\Omega}((X-U)(X-U)^{T})\|_{F}^{2} is small:

Under the same setting of Lemma C.1, with probability at least 11/(2n)41-1/(2n)^{4} over the choice of Ω\Omega, if UU satisfies conclusion of Claim C.4, then,

Intuitively, this Lemma is true because (XU)(XU)TF25MXXTF2r/σmin(Z)4\|(X-U)(X-U)^{T}\|_{F}\leqslant 25\|M-XX^{T}\|_{F}^{2}r/\sigma_{min}(Z)^{4}, which is much smaller than MXXTF\|M-XX^{T}\|_{F} when MXXTF\|M-XX^{T}\|_{F} is small. By concentration inequalities we expect PΩ((XU)(XU)T)F2\|P_{\Omega}((X-U)(X-U)^{T})\|_{F}^{2} to be roughly equal to p(XU)(XU)TFp\|(X-U)(X-U)^{T}\|_{F}, therefore it must be much smaller than pMXXTF2p\|M-XX^{T}\|_{F}^{2}. The proof of this Lemma is exactly the same as Proposition 4.3 in (in fact, it is directly implied by Proposition 4.3), so we omit the proof here. We also need a different concentration bound for the projection of the norm of the matrix a=U(XU)T+(XU)UTa=U(X-U)^{T}+(X-U)U^{T}. Unlike the previous lemma, here we want PΩ(a)F\|P_{\Omega}(a)\|_{F} to be large.

Under the same setting of Lemma C.1, let a=U(XU)T+(XU)UTa=U(X-U)^{T}+(X-U)U^{T} where UU is constructed as in Claim C.4. Then, with high probability, we have that for any XBεX\in\mathcal{B}_{\varepsilon},

Intuitively this should be true because aa is in the tangent space {Z:Z=UWT+(W)UT}\{Z:Z=UW^{T}+(W^{\prime})U^{T}\} which has rank O(nr)O(nr). The proof of this follows from Theorem 3.4 , and is written in detail in Equations (37) - (41) in .

Finally we are ready to prove the main lemma. The proof is the same as the outline given in Section 4.1 of . We give it here for completeness.

Note that f(X)f(X) is equal to h(X)+λR(X)h(X)+\lambda R(X) where where h(X)=12PΩ(MXXT)F2h(X)=\frac{1}{2}\|P_{\Omega}(M-XX^{T})\|_{F}^{2}, and R(X)R(X) is the regularizer. By Claim C.3 we know R(X),XU0\langle\nabla R(X),X-U\rangle\geqslant 0, so we only need to prove there exists a UU such that UUT=ZUU^{T}=Z and g(X),XUp4MXXTF2\langle\nabla g(X),X-U\rangle\geqslant\frac{p}{4}\|M-XX^{T}\|_{F}^{2}.

Define a=U(XU)T+(XU)UTa=U(X-U)^{T}+(X-U)U^{T}, b=(UX)(UX)Tb=(U-X)(U-X)^{T}, then XXTM=a+bXX^{T}-M=a+b and (XU)XT+X(XU)T=a+2b(X-U)X^{T}+X(X-U)^{T}=a+2b.

Let ε=MXXTF\varepsilon=\|M-XX^{T}\|_{F}. Note that from Claim C.4 and Lemma C.5, we know

Therefore as long as we can show PΩ(a)F\|P_{\Omega}(a)\|_{F} is large we are done. This is true because aFMXXTFbF9ε/10\|a\|_{F}\geqslant\|M-XX^{T}\|_{F}-\|b\|_{F}\geqslant 9\varepsilon/10. Hence by Lemma C.6 we know

Combining the bounds for PΩ(a)F\|P_{\Omega}(a)\|_{F}, PΩ(b)F\|P_{\Omega}(b)\|_{F}, we know g(X),XUp4MXXTF2\langle\nabla g(X),X-U\rangle\geqslant\frac{p}{4}\|M-XX^{T}\|_{F}^{2}. Together with the fact that R(X),XU0\langle\nabla R(X),X-U\rangle\geqslant 0, we know

Appendix D Concentration inequality

In this section we prove the concentration inequalities used in the main part. We first show that the inner-product of two low rank matrices is preserved after restricting to the observed entries. This is mostly used in arguments about the second order necessary conditions.

Since both LHS and RHS are bilinaer in both WW and ZZ, without loss of generality we assume the Frobenius norms of WW and ZZ are all equal to 1. Note that in this case we should expect W1/d|W|_{\infty}\geqslant 1/d.

Let δi,j\delta_{i,j} be the indicator variable for Ω\Omega, we know

Now we can apply Bernstein’s inequality, with probability at most η\eta,

By Proposition E.3, there is a set Γ\Gamma of size dO(dr)d^{O(dr)} such that for any rank rr matrix XX, there is a matrix X^Γ\hat{X}\in\Gamma such that XX^F1/d3\|X-\hat{X}\|_{F}\leqslant 1/d^{3}. When WW and ZZ come from this set, we can set η=dCdr\eta=d^{-Cdr} for a large enough constant CC. By union bound, with high probability

When WW and ZZ are not from this set Γ\Gamma, let W^\hat{W} and Z^\hat{Z} be the closest matrix in Γ\Gamma, then we know PΩ(W),PΩ(Z)pW,Z(PΩ(W^),PΩ(Z^)pW^,Z^)O(1/d3)WZdrlogd|\langle P_{\Omega}(W),P_{\Omega}(Z)\rangle-p\langle W,Z\rangle-(\langle P_{\Omega}(\hat{W}),P_{\Omega}(\hat{Z})\rangle-p\langle\hat{W},\hat{Z}\rangle)|\leqslant O(1/d^{3})\ll|W|_{\infty}|Z|_{\infty}dr\log d. Therefore we still have

Next Theorem shows PΩ(XXT)XP_{\Omega}(XX^{T})X is roughly equal to pXXTXpXX^{T}X, this is one of the major terms in the gradient.

Without loss of generality we assume XF=1\|X\|_{F}=1. Let δi,j\delta_{i,j} be the indicator variable for Ω\Omega, we first prove the result when δi,j\delta_{i,j} are independent, then we will use standard techniques to show the same argument works for δi,j=δj,i\delta_{i,j}=\delta_{j,i}.

Let Zi\overline{Z}_{i} be a truncated version of ZiZ_{i}. That is, Zi=Zi\overline{Z}_{i}=Z_{i} when Zi[2pdν3(1/d)3/2]2Z_{i}\leqslant[2pd\nu^{3}(1/d)^{3/2}]^{2}, and Zi=[2pdν3(1/d)3/2]2\overline{Z}_{i}=[2pd\nu^{3}(1/d)^{3/2}]^{2} otherwise. It’s not hard to see Zi\overline{Z}_{i} has smaller mean and variance compared to ZiZ_{i}. Also, by vector’s Bernstein’s inequality (Lemma E.1), we know

Notice that this is only relevant when tO(pν3d1/2)t\leqslant O(p\nu^{3}d^{-1/2}) (because otherwise the probability is 0) and in that regime the variance term always dominates. Therefore Zi\overline{Z}_{i} is the square of a subgaussian random variable.

Now we can use the concentration bound for quadratics of the subgaussian random variables: we know that with probability exp(t)\exp(-t),

this means with probability exp(Cdrlogd)\exp(-Cdrlogd) with some large constant CC, we know i=1dZiO(pν6rlog2d/d)\sum_{i=1}^{d}\overline{Z}_{i}\leqslant O(p\nu^{6}r\log^{2}d/d). The probability is low enough for us to union bound over all XX in a standard ε\varepsilon-net such that every other XX is within distance (ε/d)6(\varepsilon/d)^{6}. Therefore we know with high probability for all XX in the ε\varepsilon-net we have i=1dZiO(pν6rlog2d/d)\sum_{i=1}^{d}\overline{Z}_{i}\leqslant O(p\nu^{6}r\log^{2}d/d), which is smaller than p2ε2p^{2}\varepsilon^{2} when pCν6rlog1.5ddε2p\geqslant\frac{C\nu^{6}r\log^{1.5}d}{d\varepsilon^{2}} for a large enough constant CC.

For any X^\hat{X} that is not in the ε\varepsilon-net, let XX be the closest point of XX in the net, then XX^F1/d6\|X-\hat{X}\|_{F}\leqslant 1/d^{6}, therefore the bound of X^\hat{X} clearly follows from the bound of XX.

Now to convert sum of Zi\overline{Z}_{i} to sum of ZiZ_{i}, notice that with high probability there are at most 2pd2pd entries in Ω\Omega for every row. When that happens ZiZ_{i} is always bounded by [2pdν3(1/d)3/2]2[2pd\nu^{3}(1/d)^{3/2}]^{2} so Zi=ZiZ_{i}=\overline{Z}_{i}. Let event 11 be i=1dZip2ε2\sum_{i=1}^{d}\overline{Z}_{i}\leqslant p^{2}\varepsilon^{2} for all XX, and let event 22 be that there are at most 2pd2pd entries per row, we know with high probability both event happens, and in that case i=1dZip2ε2\sum_{i=1}^{d}Z_{i}\leqslant p^{2}\varepsilon^{2} for all XX.

First notice that the diagonal entries δi,i\delta_{i,i}’s cannot change the Frobenius norm by more than O(ν3(1/d)3/2d)pεO(\nu^{3}(1/d)^{3/2}\cdot\sqrt{d})\leqslant p\varepsilon so we can ignore the diagonal terms. Now for independent terms δi,j\delta_{i,j}, let γj,i=δi,j\gamma_{j,i}=\delta_{i,j}, then by union bound both δi,j\delta_{i,j} and γi,j\gamma_{i,j} satisfy the equation, and by triangle’s inequality (δi,j+γi,j)/2(\delta_{i,j}+\gamma_{i,j})/2 also satisfies the inequality. Let τi,j\tau_{i,j} be the true indicator of Ω\Omega (hence τi,j=τj,i\tau_{i,j}=\tau_{j,i}), and τi,j\tau^{\prime}_{i,j} be an independent copy, we know (τi,j+τi,j)/2(\tau_{i,j}+\tau^{\prime}_{i,j})/2 has the same distribution as (δi,j+γi,j)/2(\delta_{i,j}+\gamma_{i,j})/2 (for off-diagonal entries), therefore with high probability the equation is true for (τi,j+τi,j)/2(\tau_{i,j}+\tau^{\prime}_{i,j})/2. The Theorem then follows from the standard Claim below for decoupling (note that supXF=1PΩ(XXT)XpXXTXF\sup_{\|X\|_{F}=1}\|P_{\Omega}(XX^{T})X-pXX^{T}X\|_{F} is a norm for the indicator variables of Ω\Omega):

Let X,YX,Y be two iid random variables, then

Let X,Y,ZX,Y,Z be iid random variables then,

Finally we argue that random sampling of a matrix gives a nice spectral approximation. This is a standard Lemma that is used in arguing about the PΩ(M)XP_{\Omega}(M)X term in the gradient (PΩ(MXXT)XP_{\Omega}(M-XX^{T})X).

where ε=O(νlogd/(pd))\varepsilon=O(\nu\sqrt{\log d/(pd)}).

Without loss of generality we assume WF=1\|W\|_{F}=1. The proof follows simply from application of Bernstein inequality. We view PΩ(W)P_{\Omega}(W) as

Concentration Lemmas for Noise Matrix N𝑁N

Next we will state the concentration lemmas that are necessary when observed matrix is perturbed by Gaussian noise. The proof of these Lemmas are really exactly the same (in fact even simpler) than the corresponding Theorem that we have just proven. The first Lemma is used in the same settings as Theorem D.1.

Let NN be a random matrix with independent Gaussian entries N(0,σ2)N(0,\sigma^{2}). With high probability over the support Ω\Omega and the Gaussian NN, for any low rank matrix WW, we have

The proof is exactly the same as Theorem D.1 as PΩ(N),PΩ(W)|\langle P_{\Omega}(N),P_{\Omega}(W)\rangle| is a sum of independent entries that follows from the same Bernstein’s inequality. ∎

Next we show that random sampling entries of a Gaussian matrix gives a matrix with low spectral norm.

Let NN be a random Gaussian matrix with independent Gaussian entries N(0,σ2)N(0,\sigma^{2}), with high probability over the choice of Ω\Omega and NN, we have

Again the proof follows from the same argument as Lemma D.4. ∎

Appendix E Auxiliary Lemmas

Let a1,,ar0a_{1},\dots,a_{r}\geqslant 0, C0C\geqslant 0. Then C4(a12++ar2)a16++ar6C^{4}(a_{1}^{2}+\dots+a_{r}^{2})\geqslant a_{1}^{6}+\dots+a_{r}^{6} implies that a12++ar2C2ra_{1}^{2}+\dots+a_{r}^{2}\leqslant C^{2}r and that maxaiCr1/6\max a_{i}\leqslant Cr^{1/6}.

Using the assumption and equation above we have that a12++ar2C2ra_{1}^{2}+\dots+a_{r}^{2}\leqslant C^{2}r. This implies with the condition that a16++ar6C6ra_{1}^{6}+\dots+a_{r}^{6}\leqslant C^{6}r, which implis that maxaiCr1/6\max a_{i}\leqslant Cr^{1/6}. ∎

For any ζ(0,1)\zeta\in(0,1), there is a set Γ\Gamma of rank rr d×dd\times d matrices, such that for any rank rr d×dd\times d matrix XX with Frobenius norm at most 11, there is a matrix X^Γ\hat{X}\in\Gamma with XX^Fζ\|X-\hat{X}\|_{F}\leqslant\zeta. The size of Γ\Gamma is bounded by (d/ζ)O(dr)(d/\zeta)^{O(dr)}.

Here we let ε=0.1ζ\varepsilon=0.1\zeta, and construct three sets P1,P2,P3P_{1},P_{2},P_{3} where P1P_{1} is an ε\varepsilon-net for d×rd\times r matrices with Frobenius norm at most r\sqrt{r}, P2P_{2} is an ε\varepsilon-net for r×rr\times r diagonal matrices whose Frobenius norm is bounded by 11, and P3P_{3} is an ε\varepsilon-net for r×dr\times d matrices with Frobenius norm at most r\sqrt{r}.

Now we define Γ={U^D^V^U^P1,D^P2,V^P3}\Gamma=\{\hat{U}\hat{D}\hat{V}|\hat{U}\in P_{1},\hat{D}\in P_{2},\hat{V}\in P_{3}\}. Clearly the size of Γ\Gamma is as promised. For any rank rr d×dd\times d matrix XX, suppose its Singular Value Decomposition is UDVUDV, we can find U^P1\hat{U}\in P_{1}, D^P2\hat{D}\in P_{2} and V^P3\hat{V}\in P_{3} that are ε\varepsilon close to U,D,VU,D,V respectively. Therefore U^D^V^Γ\hat{U}\hat{D}\hat{V}\in\Gamma and it is easy to check