Incoherence-Optimal Matrix Completion

Yudong Chen

Introduction

The matrix completion problem concerns recovering a low-rank matrix from an observed subset of its entries. Recent research has demonstrated the following remarkable fact: if a rank-rr n×nn\times n matrix satisfies certain incoherence properties, then it is possible to exactly reconstruct the matrix with high probability from nr\mboxpolylog(n)n2nr\mbox{polylog}(n)\ll n^{2} uniformly sampled entries using efficient polynomial-time algorithms.

In previous work, the sample complexity Θ(nr\mboxpolylog(n))\Theta(nr\mbox{polylog}(n)) is achieved only for matrices that satisfy two types of incoherence conditions with constant parameters. The first condition, known as standard incoherence, is a natural and necessary requirement; it prevents the information of the row and column spaces of the matrix from being too concentrated in a few rows or columns. A second condition, called joint incoherence (or strong incoherence), is also needed. It requires the left and right singular vectors of the matrix to be unaligned with each other. This condition is quite unintuitive, and does not seem to have a natural interpretation. As we demonstrate later, this condition is often restrictive and precludes a large class of otherwise well-conditioned matrices. For example, positive semidefinite matrices have a non-constant joint incoherence parameter on the order of Ω(r)\Omega(r), and previous results thus require the number of observations to be proportional to nr2nr^{2} instead of nrnr. In several applications of matrix completion discussed later, the joint incoherence condition leads to artificial and undesired constraints. In contrast, numerical experiments suggest that this condition is not needed.

In this paper, we prove that the joint incoherence condition is not necessary and can be completely eliminated. With Ω(nrlog2n)\Omega(nr\log^{2}n) uniformly sampled entries, one can recover a matrix that satisfies the standard incoherence condition (with a constant parameter) but is not jointly incoherent (e.g., a positive semidefinite matrix). As we show in Section 2, our sample complexity bounds are order-wise optimal with respect to not only the matrix dimensions nn and rr but also to its incoherence parameters except for a logn\log n factor. As a consequence, we improve the sample complexity of recovering a positive semidefinite matrix from O(nr2log2n)O(nr^{2}\log^{2}n) to O(nrlog2n)O(nr\log^{2}n), and the highest allowable rank from Θ(n/logn)\Theta(\sqrt{n}/\log n) to Θ(n/log2n)\Theta(n/\log^{2}n).

the analysis of a Singular Value Decomposition (SVD) projection algorithm for matrix completion;

structured matrix completion and semi-supervised clustering with side information.

Finally, we turn to the closely related problem of matrix decomposition, where one is asked to recover a low-rank matrix and a sparse matrix from their sum. We show that the joint incoherence condition is necessary in this setting based on the computational complexity assumption of the Planted Clique problem. In particular, any decomposition algorithm that does not require the joint incoherence condition would solve Planted Clique with clique size o(n)o(\sqrt{n}), a problem that has been extensively studied and is widely believed to be intractable in polynomial time. This implies that it is computationally hard in general to separate a rank-ω(n)\omega(\sqrt{n}) positive semidefinite matrix and a sparse matrix. Interestingly, our results show that the standard incoherence condition is inherently associated the information-theoretic (or statistical) aspect of the problem, whereas the joint incoherence condition reflects the computational aspect.

We briefly survey existing related work; detailed comparisons with these results are provided after we present our theorems. Matrix completion is first studied in , which initiates the use of the nuclear norm minimization approach. The work in provides state-of-the-art theoretical guarantees on exact completion. Alternative algorithms for matrix completion are considered in . All these works require the joint incoherence condition (or a sample complexity that is at least quadratic in rr). Our extensions to SVD projection, structured matrix completion and semi-supervised clustering are inspired by the work in ; we improve upon their results. The low-rank and sparse matrix decomposition problem is considered first in and subsequently in . The work in prove the success of specific algorithms assuming the standard and joint incoherence conditions. Our results show that these two incoherence conditions are in fact necessary for all algorithms, or all polynomial-time algorithms, due to statistical and computational reasons. The seminal work in is the first to use the Planted Clique problem to establish statistical limits under computational constraints; they consider the problem of sparse Principal Component Analysis (PCA). A similar approach is taken in for the submatrix detection problem.

In Section 2 we present our main result and show that the joint incoherence condition is not needed in matrix completion. We discuss extensions to SVD projection and structured matrix completion in Section 3. In Section 4 we turn to the matrix decomposition problem and show that the joint incoherence condition is unavoidable there. We prove our main theorem in Section 5, with some technical aspects of the proofs deferred to the appendix. The paper is concluded with a discussion in Section 6.

Main Results

where X\left\|\bm{X}\right\|_{*} is the nuclear norm of the matrix X\bm{X}, defined as the sum of its singular values. Our goal is to obtain sufficient conditions under which the optimal solution to the problem (1) is unique and equal to M\bm{M} with high probability.

It is observed in that if M\bm{M} is equal to zero in nearly all of rows or columns, then it is impossible to complete M\bm{M} unless all of its entries of are observed. To avoid such pathological situations, it has become standard to assume M\bm{M} to have additional properties known as incoherence. Suppose the rank-rr SVD of M\bm{M} is UΣV\bm{U}\mathbf{\Sigma}\bm{V}^{\top}. M\bm{M} is said to satisfy the standard incoherence condition with parameter μ0\mu_{0} if

where ei\bm{\bm{e}}_{i} are the ii-th standard basis with appropriate dimension. Note that 1μ0min{n1,n2}r1\leq\mu_{0}\leq\frac{\min\{n_{1},n_{2}\}}{r}. Previous work also requires M\bm{M} to satisfy an additional joint incoherence (or strong incoherence) condition with parameter μ1\mu_{1}, defined as

In the following main theorem of the paper, we show that the joint incoherence is not necessary. The theorem only requires the standard incoherence condition.

Suppose M\bm{M} satisfies the standard incoherence condition (2) with parameter μ0\mu_{0}. There exist universal constants c0,c1,c2>0c_{0},c_{1},c_{2}>0 such that if

then M\bm{M} is the unique optimal solution to (1) with probability at least 1c1(n1+n2)c21-c_{1}(n_{1}+n_{2})^{-c_{2}} .

We provide comments and discussion in the next two sub-sections.

Candes and Tao prove the following lower-bound on the sample complexity of matrix completion.

Suppose n1=n2=nn_{1}=n_{2}=n and Ω\Omega is sampled as above. If we do not have the condition

and the RHS above is less than 11, then with probability at least 14\frac{1}{4}, there exist infinitely many pairs of distinct matrices MM\bm{M}^{{}^{\prime}}\neq\bm{M}^{{}^{\prime\prime}} of rank at most rr and obeying the standard incoherence condition (2) with parameter μ0\mu_{0} such that Mij=MijM{}_{ij}^{{}^{\prime}}=M_{ij}^{{}^{\prime\prime}} for all (i,j)Ω(i,j)\in\Omega.

This shows that that pμ0rlog(n)/np\gtrsim\mu_{0}r\log(n)/n is necessary for any method to determine M\bm{M} (even if one knows rr and μ0\mu_{0} ahead of time). With an additional clog(n)c^{\prime}\log(n) factor, Theorem 1 matches this lower bound. In particular, it is optimal in terms of its scaling with the incoherence parameter μ0\mu_{0}.

We note that the condition in Proposition 1 is an information/statistical lower-bound: when the value of pp is below this bound, there is not enough information in the observed entries to uniquely determine an rank-rr, μ0\mu_{0}-incoherent matrix even if one has infinite computational power. In Section 4, we show that in the closely related problem of matrix decomposition, the incoherence parameters are associated with both information and computational lower bounds.

2 Consequences and Comparison with Prior Work

Using an alternative algorithm, Keshavan et al. show that recovery can be achieved with

Similar results are given in , which also requires the sample complexity to be proportional to μ1\mu_{1} (or equivalently, quadratic in rr). In light of Proposition 1, these results are not optimal with respect to the incoherence parameters due to the dependence on the joint incoherence μ1\mu_{1}. Theorem 1 eliminates this extra dependence.

The improvement in Theorem 1 is significant both qualitatively and quantitatively. The standard incoherence condition (2) is natural and necessary. A small standard incoherence parameter μ0\mu_{0} ensures that the information of the row and column spaces of M\bm{M} is not too concentrated on a small number of rows/columns. In contrast, the joint incoherence assumption (3), which requires the matrices U\bm{U} and V\bm{V} containing the left and right singular vectors to be “unaligned” with each other, does not have a natural explanation. In applications, the quantity μ0\mu_{0} often has clear physical meanings while μ1\mu_{1} does not. For example, in the application to recovering the affinity matrices between clustered objects from partial observations (discussed in Section 3), μ0\mu_{0} is a function of the minimum cluster size, but a bound on μ1\mu_{1} bears no natural motivation. As another example, the work in uses Hankel matrix completion to recover spectrally sparse signals obeying two types of conditions. The first condition can be traced to standard incoherence and is equivalent to (the natural requirement of) the supporting frequencies being spread out. On the other hand, the second set of conditions, which resemble joint incoherence, cannot be reduced to a property of only the frequencies. The manuscript , which appeared after this paper was posted online , removes these second set of conditions using similar techniques as in Theorem 1.

Quantitatively, the joint incoherence condition is much more restrictive than standard incoherence. By Cauchy-Schwarz inequality we always have μ1rμ02r2\mu_{1}r\leq\mu_{0}^{2}r^{2}. The equality

Extensions

Our first example is the derivation of error bounds for an SVD-projection algorithm for matrix completion. Let PΩM\mathcal{P}_{\Omega}\bm{M} be the matrix obtained from M\bm{M} by setting all the unobserved entries to zero. Given the partial observation PΩM\mathcal{P}_{\Omega}\bm{M}, Keshavan et al. propose the following two-step algorithm for approximating M\bm{M}. Step 1: Set to zero all columns and rows in PΩM\mathcal{P}_{\Omega}\bm{M} with degrees larger that 2pn2pn, where the degree of a column or row is the number of non-zero entries of this column/row. Let M~Ω\widetilde{\bm{M}}^{\Omega} be the output. Step 2: Compute the SVD of M~Ω\widetilde{\bm{M}}^{\Omega}

that is, M,2\left\|\bm{M}\right\|_{\infty,2} is the maximum of the row and column norms of M\bm{M}.

Suppose pc0lognnp\geq c_{0}\frac{\log n}{n}. With high probability, we have

2 Structured Matrix Completion and Semi-Supervised Clustering

Given PΩM\mathcal{P}_{\Omega}\bm{M}, Uˉ\bar{\bm{U}} and Vˉ\bar{\bm{V}}, we solve the following modified nuclear norm minimization problem:

For this formulation we have the following guarantee.

Suppose U\bm{U} and V\bm{V} satisfy the standard incoherence condition (2) with parameter μ0\mu_{0}, and Uˉ\bar{\bm{U}} and Vˉ\bar{\bm{V}} satisfies (2) with parameter μˉ0\bar{\mu}_{0}. For some universal constants c0,c1c_{0},c_{1} and c2c_{2}, X:=UˉMVˉ\bm{X}^{*}:=\bar{\bm{U}}^{\top}\bm{M}\bar{\bm{V}} is the unique optimal solution to the program (6) with probability at least 1c1nc21-c_{1}n^{-c_{2}} provided

Given X\bm{X}^{*}, we can recover M\bm{M} by M=UˉXVˉ\bm{M}=\bar{\bm{U}}\bm{X}^{*}\bar{\bm{V}}^{\top} since UˉUˉU=U\bar{\bm{U}}\bar{\bm{U}}^{\top}\bm{U}=\bm{U} and VˉVˉV=V\bar{\bm{V}}\bar{\bm{V}}^{\top}\bm{V}=\bm{V}. We prove this theorem in Appendix D.

Theorem 2 shows that with the knowledge of the rˉ\bar{r}-dimensional subspaces col(Uˉ)\text{col}(\bar{\bm{U}}) and col(Vˉ)\text{col}(\bar{\bm{V}}), the number of observations needed to complete M\bm{M} is on the order of pn2μ0μˉ0rrˉlog(μˉ0rˉ)log(μˉ0rˉ)lognpn^{2}\asymp\mu_{0}\bar{\mu}_{0}r\bar{r}\log(\bar{\mu}_{0}\bar{r})\log(\bar{\mu}_{0}\bar{r})\log n, which is Θ(rrˉlogrˉlogn)\Theta(r\bar{r}\log\bar{r}\log n) for constant μ0\mu_{0} and μˉ0\bar{\mu}_{0}. If rˉn\bar{r}\ll n, meaning that we have strong structural information, then this number is much smaller than the usual requirement Θ(nrlog2n)\Theta(nr\log^{2}n). On the other hand, setting rˉ=n\bar{r}=n recovers Theorem 1 for standard matrix completion where there is no additional structural information. We note that we assume M\bm{M} is a square matrix here for simplicity; the results can be trivially extended to general rectangular matrices.

Near the completion of the writing of this paper, an independent study on structured matrix completion was made available. There they require among other things the following condition:In they consider the sampling without replacement model for the observed entries. Their results can be translated to the Bernoulli model considered in this paper, as we have done here. See also foot note 1.

where μ1\mu_{1} is the joint incoherence parameter of U\bm{U} and V\bm{V} defined in (3). Theorem 2 is better than the result in (7) in two ways. First, Theorem 2 avoids the superfluous dependence on the joint incoherence parameter μ1\mu_{1}, which can be as large as μ02r\mu_{0}^{2}r as previously discussed. Second, even in the ideal setting with μ1=μ0\mu_{1}=\mu_{0}, the bound in (7) requires pp to scale with max(μ02,μˉ02)\max\left(\mu_{0}^{2},\bar{\mu}_{0}^{2}\right), whereas the bound in Theorem (2) scales with μ0μˉ0\mu_{0}\bar{\mu}_{0}, which is strictly smaller whenever μ0μˉ0\mu_{0}\neq\bar{\mu}_{0}. We note that discusses a nice application of structured matrix completion to the multi-label learning problem.

Specifically, considers the setup where the set of observed entries Ω\Omega are distributed according to the Bernoulli model with probability pp,To be precise, the diagonal entries Mii=1M_{ii}=1 are known; clearly having more observations cannot decrease the probability that the program (6) outputs the correct solution. Moreover, since the affinity matrix satisfies Mij=MjiM_{ij}=M_{ji}, each observation is a pair of entries of M\bm{M}. This technicality can be easily handled, and we omit the details here. the smallest cluster size is nminn_{\min}, and Uˉ\bar{\bm{U}} has standard incoherence parameter μˉ0\bar{\mu}_{0} as defined in (2). Note that the standard incoherence parameter of U\bm{U} is n/(rnmin)n/(rn_{\min}) due to the block diagonal structure of the affinity matrix M\bm{M}. Using previous techniques in matrix completion, it is shown in that X:=UˉMUˉ\bm{X}^{*}:=\bar{\bm{U}}^{\top}\bm{M}\bar{\bm{U}} is the unique optimal solution to the program (6) w.h.p. provided

Note the quadratic term nmin2n_{\min}^{2} on the RHS, which is due to the joint incoherence parameter of U\bm{U} taking the value n2/(rnmin2)n^{2}/(rn_{\min}^{2}). Suppose rˉ=n\bar{r}=n; a consequence of (7) is that, even if M\bm{M} is fully observed (p=1p=1), the cluster size must be at least nmin=Θ(n)n_{\min}=\Theta(\sqrt{n}) and thus the possible number of clusters rr cannot exceed n/nmin=Θ(n)n/n_{\min}=\Theta(\sqrt{n}). These restrictions are undesirable, and clearly unnecessary when p=1p=1.

Using Theorem 2, we can eliminate these n\sqrt{n} restrictions and significantly reduce the sample complexity. Plugging μ0=n/(rnmin)\mu_{0}=n/(rn_{\min}) into the theorem, we obtain that the program (6) succeeds with high probability provided

The last RHS is order-wise smaller than the RHS of the previous bound (8) by a multiplicative factor of nminnlog(μˉ0rˉ)logn\frac{n_{\min}}{n}\cdot\frac{\log(\bar{\mu}_{0}\bar{r})}{\log n}. In particular, when rˉ=n\bar{r}=n and ignoring logarithm factors, we allow the size of the clusters to be as small as nmin=Θ(1)n_{\min}=\Theta\left(1\right) and the number of clusters be as large as r=Θ(n)r=\Theta\left(n\right). These significantly improve over the results in which require nmin=Ω(n)n_{\min}=\Omega(\sqrt{n}) and r=O(n)r=O(\sqrt{n}). Moreover, if nmin=nn_{\min}=\sqrt{n}, then our result require n/nmin=nn/n_{\min}=\sqrt{n} times fewer observations than the previous bound 8.

Incoherence in Matrix Decomposition: Information and Computational Lower Bounds

Having shown that the joint incoherence is not needed in matrix completion, we now turn to a closely related problem, namely low-rank and sparse matrix decomposition . In contrast to matrix completion, we show that the joint incoherence condition is unavoidable in matrix decomposition, at least if one asks for polynomial-time algorithms.

In fact, we show that the joint incoherence condition is not specific to the formulation (9), but is in fact required by all polynomial-time algorithms under a widely-believed computational complexity assumption. We prove this by connecting the matrix decomposition problem to the Planted Clique problem , defined as follows. A graph on nn nodes is generated by connecting each pairs of nodes independently with probability 12\frac{1}{2}, and then randomly picking a subset of nminn_{\min} nodes and making them fully connected (hence a clique). The goal is to find the planted clique given the graph. The Planted Clique problem has been extensively studied; cf. for an overview of the known results. In the regime of nmin=o(n)n_{\min}=o(\sqrt{n}), there is no known polynomial-time algorithm for this problem despite years of effort. In fact, this regime is widely believed to be intractable in polynomial time. The average case hardness of this regime has been proved under certain computational models , and has been utilized in cryptography and other applications . The work is the first to use this hardness assumption to obtain bounds on statistical accuracy of sparse PCA given computational constraints, and a similar approach is taken in for submatrix detection. We therefore adopt the following computational assumption on the Planted Clique problem, where we recall that a size nminn_{\min} clique is planted in an Erdos-Renyi random graph G(n,12)G(n,\frac{1}{2}) with nn nodes and edge probability 12\frac{1}{2}.

A1 For any constant ϵ>0\epsilon>0, there is no algorithm with running time polynomial in nn that, for all nn and with probability at least 12\frac{1}{2}, finds the planted clique with size nminn12ϵn_{\min}\leq n^{\frac{1}{2}-\epsilon} given the random graph.

This version of the assumption is similar to Conjecture 4.3 in .

The following theorem provides necessary conditions for the success of matrix decomposition algorithms. The proof is given in Appendix E.

The following two statements are true for the matrix decomposition problem with τ=1/3\tau=1/3.

Suppose r=1r=1 and the assumption A1 holds. For any constant ϵ>0\epsilon^{\prime}>0, there is no algorithm with running time polynomial in nn that, for all nn and with probability at least 12\frac{1}{2}, solves the matrix decomposition problemThis statement still holds if we restrict to matrix decomposition problems with L\bm{L}^{*} and S\bm{S}^{*} taking finitely many values, which can be encoded using a finite number of bits. This can be easily seen from the proof of the theorem. with

Suppose μ02\mu_{0}\geq 2. There is no algorithm that, for all nn and with probability at least 12\frac{1}{2}, solves the matrix decomposition problem with

If we modify the assumption A1 by assuming that the Planted rr-Clique problem with rr disjoint planted cliques of size o(n)o(\sqrt{n}) is intractable in polynomial time, then the first part of the theorem holds with

Together with the second part of the theorem, this result shows that, under the planted clique assumption, the standard and joint incoherence conditions are both necessary for solving matrix decomposition in polynomial time. Therefore, the bound in (10) is unlikely to be improvable (up to a polylog factor) using polynomial-time algorithms. In particular, this implies that the matrix decomposition problem is intractable in general for positive semidefinite matrices with rank r=ω(n)r=\omega(\sqrt{n}) since in this case μ1r=μ02r2r2.\mu_{1}r=\mu_{0}^{2}r^{2}\geq r^{2}.

We note that the first part of Theorem 3 is a computational limit. It is proved by showing that if there is a matrix decomposition algorithm that does not require the joint incoherence condition, then the algorithm would solve the computationally hard problem of finding a planted clique with size nmin=o(n)n_{\min}=o(\sqrt{n}). On the other hand, the second part of the theorem is an information/statistical limit applicable to all algorithms regardless of their computational complexity, and is proved by an information-theoretic argument. Interestingly, Theorem 3 shows that the standard incoherence and the joint incoherence are associated with the statistical and computational aspects of the matrix decomposition problem, respectively.

Proof of Theorem 1

We now turn to the details. To simplify notion, we prove the results for square matrices (n1=n2=nn_{1}=n_{2}=n); the results for non-square matrices are proven in exactly the same fashion. Some additional notation is needed. We use cc and its derivatives (c,c0c^{\prime},c_{0}, etc.) for universal positive constants. By with high probability (w.h.p.) we mean with probability at least 1c1nc21-c_{1}n{}^{-c_{2}} for some constants c1,c2>0c_{1},c_{2}>0 independent of the problem parameters (n,r,p,μ0,μ1n,r,p,\mu_{0},\mu_{1}). Throughout the proof the constant c2c_{2} can be made arbitrarily large by choosing the constant c0c_{0} in Theorem 1 sufficiently large. The proof below involves 80logn+180\log n+1 random events, each of which is shown to occur with high probability. By the union bound their intersection also occurs with high probability.

A few additional notations are needed. The inner product between two matrices is given by X,Z:=\mboxtrace(XZ)\left\langle\bm{X},\bm{Z}\right\rangle:=\mbox{trace}(\bm{X}^{\top}\bm{Z}). The projections PT\mathcal{P}_{T} and PT\mathcal{P}_{T^{\bot}} are given by

Following our proof roadmap, we now state a sufficient condition for M\bm{M} to be the unique optimal solution to the optimization problem (1).

Suppose p1np\geq\frac{1}{n}. The matrix M\bm{M} is the unique optimal solution to (1) if the following conditions hold:

PTRΩPTPTop12.\left\|\mathcal{P}_{T}\mathcal{R}_{\Omega}\mathcal{P}_{T}-\mathcal{P}_{T}\right\|_{op}\leq\frac{1}{2}.

PT(Y)12\left\|\mathcal{P}_{T^{\bot}}(\bm{Y})\right\|\leq\frac{1}{2},

PT(Y)UVF14n.\left\|\mathcal{P}_{T}(\bm{Y})-\bm{U}\bm{V}^{\top}\right\|_{F}\leq\frac{1}{4n}.

A somewhat different version of the proposition appears in . We prove the proposition in Appendix A.

The requirement p1np\geq\frac{1}{n} in Proposition 2 clearly holds under the conditions of Theorem 1. The following standard result shows that the approximate isometry in Condition 1 is also satisfied.

If pc0μ0rlognnp\geq c_{0}\frac{\mu_{0}r\log n}{n} for some c0c_{0} sufficiently large, then w.h.p.

Suppose Z\bm{Z} is a fixed n×nn\times n matrix. For a universal constant c>1c>1, we have w.h.p.

Suppose Z\bm{Z} is a fixed matrix. If pc0μ0rlognnp\geq c_{0}\frac{\mu_{0}r\log n}{n} for some c0c_{0} sufficiently large, then w.h.p.

Suppose Z\bm{Z} is a fixed n×nn\times n matrix in TT. If pc0μrlognnp\geq c_{0}\frac{\mu r\log n}{n} for some c0c_{0} sufficiently large, then w.h.p.

Equipped with the lemmas above, we are ready to validate Condition 2 in Proposition 2.

Set Dk:=UVPT(Wk)\bm{D}_{k}:=\bm{U}\bm{V}^{\top}-\mathcal{P}_{T}(\bm{W}_{k}) for k=0,,k0k=0,\ldots,k_{0}. By definition of Wk\bm{W}_{k}, we have D0=UV\bm{D}_{0}=\bm{U}\bm{V}^{\top}and

Note that Ωk\Omega_{k} is independent of Dk1\bm{D}_{k-1} and qp/k0c0μ0rlog(n)/nq\geq p/k_{0}\geq c_{0}\mu_{0}r\log(n)/n under the conditions in Theorem 1. Applying Lemma 1 with Ω\Omega replaced by Ωk\Omega_{k}, we obtain that w.h.p.

for each kk. Applying the above inequality recursively with k=k0,k01,,1k=k_{0,}k_{0}-1,\ldots,1 gives

Note that Y=k=1k0RΩkPT(Dk1)\bm{Y}=\sum_{k=1}^{k_{0}}\mathcal{R}_{\Omega_{k}}\mathcal{P}_{T}\left(\bm{D}_{k-1}\right) by construction. We therefore have

Applying Lemma 2 with Ω\Omega replaced by Ωk\Omega_{k} to each summand of the last R.H.S., we get that w.h.p.

where the last inequality follows from qc0μ0rlog(n)/nq\geq c_{0}\mu_{0}r\log(n)/n. We proceed by bounding Dk1\left\|\bm{D}_{k-1}\right\|_{\infty} and Dk1,2\left\|\bm{D}_{k-1}\right\|_{\infty,2}. Using (13), and repeatedly applying Lemma 4 with Ω\Omega replaced by Ωk\Omega_{k}, we obtain that w.h.p.

By Lemma 3 with Ω\Omega replaced by Ωk\Omega_{k}, we obtain that w.h.p.

Using (13) and combining the last two display equations gives w.h.p.

But the standard incoherence condition (2) implies that

provided c0c_{0} is sufficiently large. This completes the proof of Theorem 1.

Discussion

In this paper, we consider exact matrix completion and show that the joint incoherence condition imposed by all previous work is in fact not necessary. We discuss two extensions of this result, namely in bounding the approximation errors of SVD projection, and in structured matrix completion and semi-supervised clustering. We then show that the joint incoherence condition is unavoidable in the apparently similar problem of low-rank and sparse matrix decomposition based on the computational hardness assumption of the Planted Clique problem.

Acknowledgment

The author would like to thank Constantine Caramanis, Yuxin Chen, Sujay Sanghavi and Rachel Ward for their support and helpful comments. This work is supported by NSF grant EECS-1056028 and DTRA grant HDTRA 1-08-0029.

Appendix A Proof of Proposition 2

Consider any feasible solution X\bm{X} to (1) with PΩ(X)=PΩ(M)\mathcal{P}_{\Omega}(\bm{X})=\mathcal{P}_{\Omega}(\bm{M}). Let G\bm{G} be an n×nn\times n matrix which satisfies PTG=1\left\|\mathcal{P}_{T^{\bot}}\bm{G}\right\|=1 and PTG,PT(XM)=PT(XM)\left\langle\mathcal{P}_{T^{\bot}}\bm{G},\mathcal{P}_{T^{\bot}}(\bm{X}-\bm{M})\right\rangle=\left\|\mathcal{P}_{T^{\bot}}(\bm{X}-\bm{M})\right\|_{*}. Such GG always exists by duality between the nuclear norm and the spectral norm. Because UV+PTG\bm{U}\bm{V}^{\top}+\mathcal{P}_{T^{\bot}}\bm{G} is a sub-gradient of Z\left\|\bm{Z}\right\|_{*} at Z=M\bm{Z}=\bm{M}, we get

We also have Y,XM=PΩ(Y),PΩ(XM)=0\left\langle\bm{Y},\bm{X}-\bm{M}\right\rangle=\left\langle\mathcal{P}_{\Omega}(\bm{Y}),\mathcal{P}_{\Omega}(\bm{X}-\bm{M})\right\rangle=0 since PΩ(Y)=Y\mathcal{P}_{\Omega}(\bm{Y})=\bm{Y}. It follows that

where in the last inequality we use Conditions 1 and 2 in the proposition. Applying Lemma 5 below, we further obtain

The RHS is strictly positive for all X\bm{X} with PΩ(XM)=0\mathcal{P}_{\Omega}(\bm{X}-\bm{M})=0 and XM\bm{X}\neq\bm{M}. Otherwise we must have PT(XM)=XM\mathcal{P}_{T}(\bm{X}-\bm{M})=\bm{X}-\bm{M} and PTRΩPT(XM)=0\mathcal{P}_{T}\mathcal{R}_{\Omega}\mathcal{P}_{T}(\bm{X}-\bm{M})=0, contradicting the assumption PTRΩPTPTop12\left\|\mathcal{P}_{T}\mathcal{R}_{\Omega}\mathcal{P}_{T}-\mathcal{P}_{T}\right\|_{op}\leq\frac{1}{2}. This proves that M\bm{M} is the unique optimum.

If p1n10p\geq\frac{1}{n^{10}} and PTRΩPTPTop12\left\|\mathcal{P}_{T}\mathcal{R}_{\Omega}\mathcal{P}_{T}-\mathcal{P}_{T}\right\|_{op}\leq\frac{1}{2}, then we have

where the last inequality follows from the assumption PTRΩPTPTop12\left\|\mathcal{P}_{T}\mathcal{R}_{\Omega}\mathcal{P}_{T}-\mathcal{P}_{T}\right\|_{op}\leq\frac{1}{2}. On the other hand, PΩ(Z)=0\mathcal{P}_{\Omega}(\bm{Z})=0 implies RΩ(Z)=0\mathcal{R}_{\Omega}(\bm{Z})=0 and thus

Combining the last two display equations gives

Appendix B Proofs of Technical Lemmas in Section 5

We prove the technical lemmas that are used in the proof of Theorem 1. The proofs use the matrix Bernstein inequality, restated below.

and XkB\left\|\bm{X}_{k}\right\|\leq B almost surely for all kk. Then for any c>1c>1, we have

with probability at least 1(2n)(c1).1-(2n)^{-(c-1)}.

We also make use of the following facts: for all ii and jj, we have

This follows from the definition of PT\mathcal{P}_{T} and the standard incoherence condition (2).

B.2 Proof of Lemma 3

Fix b[n]b\in[n]. The bb-th column of the matrix (PTRΩPT)Z\left(\mathcal{P}_{T}\mathcal{R}_{\Omega}-\mathcal{P}_{T}\right)\bm{Z} can be written as

where the last inequality follows from the assumption of pp in the statement of the lemma. We also have

using the incoherence condition (2). It follows that

provided c0c_{0} in the lemma statement is large enough. In a similar fashion we prove that ea((PTRΩPT)Z)\left\|\bm{e}_{a}^{\top}\left(\left(\mathcal{P}_{T}\mathcal{R}_{\Omega}-\mathcal{P}_{T}\right)\bm{Z}\right)\right\| is bounded by the same quantity w.h.p. The lemma follows from a union bound over all (a,b)[n]×[n](a,b)\in[n]\times[n].

Appendix C Proof of Corollary 1

When plog2nnp\gtrsim\frac{\log^{2}n}{n}, the standard Bernstein inequality and a union bound implies that w.h.p. the degrees (i.e., the number of observed entries) of the rows and columns of PΩM\mathcal{P}_{\Omega}\bm{M} are bounded by 2pn2pn. This means 1pM~Ω=1pPΩM=RΩM\frac{1}{p}\widetilde{\bm{M}}^{\Omega}=\frac{1}{p}\mathcal{P}_{\Omega}\bm{M}=\mathcal{R}_{\Omega}\bm{M}. By Lemma 4, we have

where we use (16) and (17) in the last inequality. Since the rank of MTr(M~Ω)\bm{M}-\textsf{T}_{r}(\widetilde{\bm{M}}^{\Omega}) is at most rr, we have MTr(M~Ω)FrMTr(M~Ω)\left\|\bm{M}-\textsf{T}_{r}(\widetilde{\bm{M}}^{\Omega})\right\|_{F}\leq\sqrt{r}\left\|\bm{M}-\textsf{T}_{r}(\widetilde{\bm{M}}^{\Omega})\right\| and the corollary follows.

Appendix D Proof of Theorem 2

The proof is similar to that of Theorem 1, and we shall point out where they differ. We use the same notations as in the proof of Theorem 1, except that throughout this section we re-define the two projections:

Note that PTZ+PTZ=UˉUˉZVˉVˉ\mathcal{P}_{T}\bm{Z}+\mathcal{P}_{T^{\bot}}\bm{Z}=\bar{\bm{U}}\bar{\bm{U}}^{\top}\bm{Z}\bar{\bm{V}}\bar{\bm{V}}^{\top}. Since \mboxcol(U)\mboxcol(Uˉ)\mbox{col}(\bm{U})\subseteq\mbox{col}(\bar{\bm{U}}) and μ0rnμˉ0rˉn\frac{\mu_{0}r}{n}\leq\frac{\bar{\mu}_{0}\bar{r}}{n}, one can verify that under the incoherence assumption on U\bm{U} and Uˉ\bar{\bm{U}} in the theorem statement, we have for all i,j,b[n]i,j,b\in[n],

We have the following subgradient optimality condition.

X:=UˉMVˉ\bm{X}^{*}:=\bar{\bm{U}}^{\top}\bm{M}\bar{\bm{V}} is the unique optimal solution to the program (6) if the following conditions are satisfied: 1. PTPTRΩPTop12\left\|\mathcal{P}_{T}-\mathcal{P}_{T}\mathcal{R}_{\Omega}\mathcal{P}_{T}\right\|_{op}\leq\frac{1}{2} and 1pPΩPTop2μˉ0rˉμ0r\frac{1}{\sqrt{p}}\left\|\mathcal{P}_{\Omega}\mathcal{P}_{T^{\bot}}\right\|_{op}\leq\sqrt{\frac{2\bar{\mu}_{0}\bar{r}}{\mu_{0}r}}; 2. there exist a dual certificate Y\bm{Y} with PΩY=Y\mathcal{P}_{\Omega}\bm{Y}=\bm{Y} and obeys (a) PTYUVFμ0r32μˉ0rˉ\left\|\mathcal{P}_{T}\bm{Y}-\bm{U}\bm{V}^{\top}\right\|_{F}\leq\sqrt{\frac{\mu_{0}r}{32\bar{\mu}_{0}\bar{r}}} and (b) PTY12.\left\|\mathcal{P}_{T^{\bot}}\bm{Y}\right\|\leq\frac{1}{2}.

On the other hand, since PTRΩPTPTop12\left\|\mathcal{P}_{T}\mathcal{R}_{\Omega}\mathcal{P}_{T}-\mathcal{P}_{T}\right\|_{op}\leq\frac{1}{2} by assumption, we have

Because UˉUˉΔVˉVˉ=Δ\bar{\bm{U}}\bar{\bm{U}}^{\top}\boldsymbol{\Delta}\bar{\bm{V}}\bar{\bm{V}}^{\top}=\boldsymbol{\Delta}, we have 0=PΩ(Δ)=PΩ(PT+PT)Δ0=\mathcal{P}_{\Omega}(\boldsymbol{\Delta})=\mathcal{P}_{\Omega}\left(\mathcal{P}_{T}+\mathcal{P}_{T^{\bot}}\right)\boldsymbol{\Delta} and thus

where the last inequality follows from Condition 1 in the statement of the proposition. Combining the last two display equations gives PTΔF2μˉ0rˉμ0rPTΔ.\left\|\mathcal{P}_{T}\boldsymbol{\Delta}\right\|_{F}\leq\sqrt{\frac{2\bar{\mu}_{0}\bar{r}}{\mu_{0}r}}\left\|\mathcal{P}_{T^{\bot}}\boldsymbol{\Delta}\right\|_{*}. It follows from (21) that

The last RHS is strictly positive for all Δ\boldsymbol{\Delta} with PΩΔ=0\mathcal{P}_{\Omega}\boldsymbol{\Delta}=0 and Δ0\boldsymbol{\Delta}\neq 0; otherwise we would have PTΔ=(PT+PT)Δ=Δ\mathcal{P}_{T}\boldsymbol{\Delta}=\left(\mathcal{P}_{T}+\mathcal{P}_{T^{\bot}}\right)\boldsymbol{\Delta}=\boldsymbol{\Delta} and thus PTRΩPTΔ=0\mathcal{P}_{T}\mathcal{R}_{\Omega}\mathcal{P}_{T}\boldsymbol{\Delta}=0, contradicting PTRΩPTPTop12\left\|\mathcal{P}_{T}\mathcal{R}_{\Omega}\mathcal{P}_{T}-\mathcal{P}_{T}\right\|_{op}\leq\frac{1}{2}. This proves that X:=UˉMVˉ\bm{X}^{*}:=\bar{\bm{U}}^{\top}\bm{M}\bar{\bm{V}} is the unique optimal solution to (6). ∎

We proceed by showing that Condition 1 in Proposition 3 is satisfied w.h.p. under the conditions of Theorem 2. This is done in the lemma below, which is proved in Section D.1 to follow.

If pc0μ0μˉ0rrˉn2lognp\geq c_{0}\frac{\mu_{0}\bar{\mu}_{0}r\bar{r}}{n^{2}}\log n for some sufficiently large constant c0c_{0}, then w.h.p. we have

We now construct a dual certificate Y\bm{Y} using the golfing scheme. This is done similarly as before; in particular, we let k0:=20log(32μˉ0rˉ)k_{0}:=20\log(32\bar{\mu}_{0}\bar{r}), q:=1(1p)1/k0pk0q:=1-(1-p)^{1/k_{0}}\geq\frac{p}{k_{0}}, Wk\bm{W}_{k} be given by (12) (with the re-defined PT\mathcal{P}_{T}) and Y:=Wk0\bm{Y}:=\bm{W}_{k_{0}}. Clearly PΩ(Y)=Y\mathcal{P}_{\Omega}(\bm{Y})=\bm{Y} by construction. Note that for k[k0]k\in[k_{0}], the matrix Dk:=UVPT(Wk)\bm{D}_{k}:=\bm{U}\bm{V}^{\top}-\mathcal{P}_{T}(\bm{W}_{k}) again satisfies (13). It follows that DkF12Dk1F\left\|\bm{D}_{k}\right\|_{F}\leq\frac{1}{2}\left\|\bm{D}_{k-1}\right\|_{F} by the first inequality in Lemma 6 and thus

proving Condition 2(a) in Proposition 3. To prove Condition 2(b), we need three lemmas which are analogues of Lemmas 2, 3 and 4 in the proof of Theorem 1.

Suppose Z\bm{Z} is a fixed n×nn\times n matrix. For some universal constant c>1c>1, we have w.h.p.

Suppose Z\bm{Z} is a fixed matrix. If pc0μ0μˉ0rrˉlognnp\geq c_{0}\frac{\mu_{0}\bar{\mu}_{0}r\bar{r}\log n}{n} for some c0c_{0} sufficiently large, then w.h.p.

Suppose Z\bm{Z} is a fixed n×nn\times n matrix. If pc0μ0μˉ0rrˉlognnp\geq c_{0}\frac{\mu_{0}\bar{\mu}_{0}r\bar{r}\log n}{n} for some c0c_{0} sufficiently large, then w.h.p.

We prove these lemmas in Sections D.2–D.4 to follow. Following the same lines as in the proof of Theorem 1, we obtain

Applying Lemma 7 with Ω\Omega replaced by Ωk\Omega_{k} to each summand of the last R.H.S, we get that w.h.p.

where the last inequality follows from qp20log(32μˉ0rˉ)c0μ0μˉ0rrˉlogn20n2.q\geq\frac{p}{20\log(32\bar{\mu}_{0}\bar{r})}\geq c_{0}\frac{\mu_{0}\bar{\mu}_{0}r\bar{r}\log n}{20n^{2}}. Again following the same lines as in the proof of Theorem 1, but using the new Lemmas 9 and 8, we can bound the two terms above as

The inequality PT(Y)12\left\|\mathcal{P}_{T^{\bot}}(\bm{Y})\right\|\leq\frac{1}{2} then follows from the incoherence conditions (2) of U\bm{U} and V\bm{V} provided c0c_{0} is sufficiently large. This proves Condition 2(b) in Proposition 3 and hence completes the proof of Theorem 2.

The proof of the first inequality is identical to that of Lemma 1 except that we use (18) instead of (15) (cf. Theorem 4.1 in and Lemma 11 in ).

Applying the the matrix Bernstein inequality in Theorem 4, we obtain w.h.p.

for some constant cc^{\prime}, where the last inequality follows from μ0rμˉ0rˉ\mu_{0}r\leq\bar{\mu}_{0}\bar{r} and the assumption pc0μ0μˉ0rrˉn2log(2n)p\geq c_{0}\frac{\mu_{0}\bar{\mu}_{0}r\bar{r}}{n^{2}}\log(2n). It follows that

where in the last inequality we again use the assumption pc0μ0μˉ0rrˉn2log(2n).p\geq c_{0}\frac{\mu_{0}\bar{\mu}_{0}r\bar{r}}{n^{2}}\log(2n).

D.2 Proof of Lemma 7

by (19). Moreover, since \mboxcol(U)\mboxcol(Uˉ)\mbox{col}(\bm{U})\subseteq\mbox{col}(\bar{\bm{U}}), \mboxcol(V)\mboxcol(Vˉ)\mbox{col}(\bm{V})\subseteq\mbox{col}(\bar{\bm{V}}) and UˉUˉUU\bar{\bm{U}}\bar{\bm{U}}^{\top}-\bm{U}\bm{U}^{\top}, VˉVˉVV\bar{\bm{V}}\bar{\bm{V}}^{\top}-\bm{V}\bm{V}^{\top} are projections, we have

D.3 Proof of Lemma 8

Fix b[n]b\in[n]. The bb-th column of the matrix (PTRΩPT)Z\left(\mathcal{P}_{T}\mathcal{R}_{\Omega}-\mathcal{P}_{T}\right)\bm{Z} can be written as

by (20) and the assumption on pp. We also have

provided c0c_{0} in the statement of the lemma is sufficiently large. In a similar fashion we can prove that the ea((PTRΩPT)Z)2\left\|\bm{e}_{a}^{\top}\left(\left(\mathcal{P}_{T}\mathcal{R}_{\Omega}-\mathcal{P}_{T}\right)\bm{Z}\right)\right\|_{2} is bounded by the same quantity w.h.p. The lemma follows from a union bound over all (a,b)[n]×[n](a,b)\in[n]\times[n].

D.4 Proof of Lemma 9

Fix (a,b)[n]×[n](a,b)\in[n]\times[n]. We can write the (a,b)(a,b) entry of the matrix (PTRΩPT)Z\left(\mathcal{P}_{T}\mathcal{R}_{\Omega}-\mathcal{P}_{T}\right)\bm{Z} as

Applying the Bernstein inequality in Theorem 4, we conclude that w.h.p. [(PTRΩPTPT)Z]ab12Z()\left|\left[\left(\mathcal{P}_{T}\mathcal{R}_{\Omega}\mathcal{P}_{T}-\mathcal{P}_{T}\right)\bm{Z}\right]_{ab}\right|\leq\frac{1}{2}\left\|\bm{Z}\right\|_{(\infty)} for c0c_{0} sufficiently large. The lemma follows from a union bound over all (a,b)[n]×[n](a,b)\in[n]\times[n].

Appendix E Proof of Theorem 3

We reduce the planted problem above to the matrix decomposition problem using subsampling. Given the matrix Aˉ\bar{\bm{A}}, we set each Aˉij\bar{A}_{ij} to zero with probability 23\frac{2}{3} independently, and let A\bm{A} be the resulting matrix. If we let S:=AL\bm{S}^{*}:=\bm{A}-\bm{L}^{*}, then each pair Sij=SjiS_{ij}^{*}=S_{ji}^{*} is non-zero with probability τ=13\tau=\frac{1}{3}. Moreover, the matrix L\bm{L}^{*} has rank 11 and satisfies the standard and joint incoherence conditions (2) and (3) with parameters μ0=1/nmin\mu_{0}=1/n_{\min} and μ1=n2/nmin2\mu_{1}=n^{2}/n_{\min}^{2}. Hence recovering (L,S)\left(\bm{L}^{*},\bm{S}^{*}\right) from A\bm{A} is a special case of the matrix decomposition problem. If there exists a polynomial-time algorithm that, for all nn, finds L\bm{L}^{*} given A\bm{A} with probability at least 12\frac{1}{2} when

then it means this algorithm recovers the planted clique with nminn12ϵ2(1ϵ)n_{\min}\leq n^{\frac{1}{2}-\frac{\epsilon^{\prime}}{2(1-\epsilon^{\prime})}} from Aˉ\bar{\bm{A}}, which violates the assumption A1.

E.2 Part 2 of the theorem

With slight abuse of notation, we use D(q1q2):=q1logq1q2+(1q1)log1q11q2D(q_{1}\|q_{2}):=q_{1}\log\frac{q_{1}}{q_{2}}+(1-q_{1})\log\frac{1-q_{1}}{1-q_{2}} to denote the KL divergence between two Bernoulli distributions with parameters q1q_{1} and q2q_{2}. Direct computation gives

for all l,l=1,,Ml,l^{\prime}=1,\ldots,M, where the inequality above follows from logxx1.\log x\leq x-1. It follows that I(L;A)K+1.I(\bm{L}^{*};\bm{A})\leq K+1.

We now apply the Fano’s inequality to obtain that for any measurable function L^\hat{\bm{L}} of A\bm{A},

where the probability is with respect to the randomness of L\bm{L}^{*} and S\bm{S}^{*}, and the last inequality holds when logn12K=μ0rlogn12n1\frac{\log n}{12K}=\frac{\mu_{0}r\log n}{12n}\geq 1 and n10n\geq 10. Because the supremum is lower bounded by the average, we obtain

where the probability is with respect to the randomness of S\bm{S}^{*}.

References