Incoherence-Optimal Matrix Completion
Yudong Chen
Introduction
The matrix completion problem concerns recovering a low-rank matrix from an observed subset of its entries. Recent research has demonstrated the following remarkable fact: if a rank- matrix satisfies certain incoherence properties, then it is possible to exactly reconstruct the matrix with high probability from uniformly sampled entries using efficient polynomial-time algorithms.
In previous work, the sample complexity is achieved only for matrices that satisfy two types of incoherence conditions with constant parameters. The first condition, known as standard incoherence, is a natural and necessary requirement; it prevents the information of the row and column spaces of the matrix from being too concentrated in a few rows or columns. A second condition, called joint incoherence (or strong incoherence), is also needed. It requires the left and right singular vectors of the matrix to be unaligned with each other. This condition is quite unintuitive, and does not seem to have a natural interpretation. As we demonstrate later, this condition is often restrictive and precludes a large class of otherwise well-conditioned matrices. For example, positive semidefinite matrices have a non-constant joint incoherence parameter on the order of , and previous results thus require the number of observations to be proportional to instead of . In several applications of matrix completion discussed later, the joint incoherence condition leads to artificial and undesired constraints. In contrast, numerical experiments suggest that this condition is not needed.
In this paper, we prove that the joint incoherence condition is not necessary and can be completely eliminated. With uniformly sampled entries, one can recover a matrix that satisfies the standard incoherence condition (with a constant parameter) but is not jointly incoherent (e.g., a positive semidefinite matrix). As we show in Section 2, our sample complexity bounds are order-wise optimal with respect to not only the matrix dimensions and but also to its incoherence parameters except for a factor. As a consequence, we improve the sample complexity of recovering a positive semidefinite matrix from to , and the highest allowable rank from to .
the analysis of a Singular Value Decomposition (SVD) projection algorithm for matrix completion;
structured matrix completion and semi-supervised clustering with side information.
Finally, we turn to the closely related problem of matrix decomposition, where one is asked to recover a low-rank matrix and a sparse matrix from their sum. We show that the joint incoherence condition is necessary in this setting based on the computational complexity assumption of the Planted Clique problem. In particular, any decomposition algorithm that does not require the joint incoherence condition would solve Planted Clique with clique size , a problem that has been extensively studied and is widely believed to be intractable in polynomial time. This implies that it is computationally hard in general to separate a rank- positive semidefinite matrix and a sparse matrix. Interestingly, our results show that the standard incoherence condition is inherently associated the information-theoretic (or statistical) aspect of the problem, whereas the joint incoherence condition reflects the computational aspect.
We briefly survey existing related work; detailed comparisons with these results are provided after we present our theorems. Matrix completion is first studied in , which initiates the use of the nuclear norm minimization approach. The work in provides state-of-the-art theoretical guarantees on exact completion. Alternative algorithms for matrix completion are considered in . All these works require the joint incoherence condition (or a sample complexity that is at least quadratic in ). Our extensions to SVD projection, structured matrix completion and semi-supervised clustering are inspired by the work in ; we improve upon their results. The low-rank and sparse matrix decomposition problem is considered first in and subsequently in . The work in prove the success of specific algorithms assuming the standard and joint incoherence conditions. Our results show that these two incoherence conditions are in fact necessary for all algorithms, or all polynomial-time algorithms, due to statistical and computational reasons. The seminal work in is the first to use the Planted Clique problem to establish statistical limits under computational constraints; they consider the problem of sparse Principal Component Analysis (PCA). A similar approach is taken in for the submatrix detection problem.
In Section 2 we present our main result and show that the joint incoherence condition is not needed in matrix completion. We discuss extensions to SVD projection and structured matrix completion in Section 3. In Section 4 we turn to the matrix decomposition problem and show that the joint incoherence condition is unavoidable there. We prove our main theorem in Section 5, with some technical aspects of the proofs deferred to the appendix. The paper is concluded with a discussion in Section 6.
Main Results
where is the nuclear norm of the matrix , defined as the sum of its singular values. Our goal is to obtain sufficient conditions under which the optimal solution to the problem (1) is unique and equal to with high probability.
It is observed in that if is equal to zero in nearly all of rows or columns, then it is impossible to complete unless all of its entries of are observed. To avoid such pathological situations, it has become standard to assume to have additional properties known as incoherence. Suppose the rank- SVD of is . is said to satisfy the standard incoherence condition with parameter if
where are the -th standard basis with appropriate dimension. Note that . Previous work also requires to satisfy an additional joint incoherence (or strong incoherence) condition with parameter , defined as
In the following main theorem of the paper, we show that the joint incoherence is not necessary. The theorem only requires the standard incoherence condition.
Suppose satisfies the standard incoherence condition (2) with parameter . There exist universal constants such that if
then is the unique optimal solution to (1) with probability at least .
We provide comments and discussion in the next two sub-sections.
Candes and Tao prove the following lower-bound on the sample complexity of matrix completion.
Suppose and is sampled as above. If we do not have the condition
and the RHS above is less than , then with probability at least , there exist infinitely many pairs of distinct matrices of rank at most and obeying the standard incoherence condition (2) with parameter such that for all .
This shows that that is necessary for any method to determine (even if one knows and ahead of time). With an additional factor, Theorem 1 matches this lower bound. In particular, it is optimal in terms of its scaling with the incoherence parameter .
We note that the condition in Proposition 1 is an information/statistical lower-bound: when the value of is below this bound, there is not enough information in the observed entries to uniquely determine an rank-, -incoherent matrix even if one has infinite computational power. In Section 4, we show that in the closely related problem of matrix decomposition, the incoherence parameters are associated with both information and computational lower bounds.
2 Consequences and Comparison with Prior Work
Using an alternative algorithm, Keshavan et al. show that recovery can be achieved with
Similar results are given in , which also requires the sample complexity to be proportional to (or equivalently, quadratic in ). In light of Proposition 1, these results are not optimal with respect to the incoherence parameters due to the dependence on the joint incoherence . Theorem 1 eliminates this extra dependence.
The improvement in Theorem 1 is significant both qualitatively and quantitatively. The standard incoherence condition (2) is natural and necessary. A small standard incoherence parameter ensures that the information of the row and column spaces of is not too concentrated on a small number of rows/columns. In contrast, the joint incoherence assumption (3), which requires the matrices and containing the left and right singular vectors to be “unaligned” with each other, does not have a natural explanation. In applications, the quantity often has clear physical meanings while does not. For example, in the application to recovering the affinity matrices between clustered objects from partial observations (discussed in Section 3), is a function of the minimum cluster size, but a bound on bears no natural motivation. As another example, the work in uses Hankel matrix completion to recover spectrally sparse signals obeying two types of conditions. The first condition can be traced to standard incoherence and is equivalent to (the natural requirement of) the supporting frequencies being spread out. On the other hand, the second set of conditions, which resemble joint incoherence, cannot be reduced to a property of only the frequencies. The manuscript , which appeared after this paper was posted online , removes these second set of conditions using similar techniques as in Theorem 1.
Quantitatively, the joint incoherence condition is much more restrictive than standard incoherence. By Cauchy-Schwarz inequality we always have . The equality
Extensions
Our first example is the derivation of error bounds for an SVD-projection algorithm for matrix completion. Let be the matrix obtained from by setting all the unobserved entries to zero. Given the partial observation , Keshavan et al. propose the following two-step algorithm for approximating . Step 1: Set to zero all columns and rows in with degrees larger that , where the degree of a column or row is the number of non-zero entries of this column/row. Let be the output. Step 2: Compute the SVD of
that is, is the maximum of the row and column norms of .
Suppose . With high probability, we have
2 Structured Matrix Completion and Semi-Supervised Clustering
Given , and , we solve the following modified nuclear norm minimization problem:
For this formulation we have the following guarantee.
Suppose and satisfy the standard incoherence condition (2) with parameter , and and satisfies (2) with parameter . For some universal constants and , is the unique optimal solution to the program (6) with probability at least provided
Given , we can recover by since and . We prove this theorem in Appendix D.
Theorem 2 shows that with the knowledge of the -dimensional subspaces and , the number of observations needed to complete is on the order of , which is for constant and . If , meaning that we have strong structural information, then this number is much smaller than the usual requirement . On the other hand, setting recovers Theorem 1 for standard matrix completion where there is no additional structural information. We note that we assume is a square matrix here for simplicity; the results can be trivially extended to general rectangular matrices.
Near the completion of the writing of this paper, an independent study on structured matrix completion was made available. There they require among other things the following condition:In they consider the sampling without replacement model for the observed entries. Their results can be translated to the Bernoulli model considered in this paper, as we have done here. See also foot note 1.
where is the joint incoherence parameter of and defined in (3). Theorem 2 is better than the result in (7) in two ways. First, Theorem 2 avoids the superfluous dependence on the joint incoherence parameter , which can be as large as as previously discussed. Second, even in the ideal setting with , the bound in (7) requires to scale with , whereas the bound in Theorem (2) scales with , which is strictly smaller whenever . We note that discusses a nice application of structured matrix completion to the multi-label learning problem.
Specifically, considers the setup where the set of observed entries are distributed according to the Bernoulli model with probability ,To be precise, the diagonal entries are known; clearly having more observations cannot decrease the probability that the program (6) outputs the correct solution. Moreover, since the affinity matrix satisfies , each observation is a pair of entries of . This technicality can be easily handled, and we omit the details here. the smallest cluster size is , and has standard incoherence parameter as defined in (2). Note that the standard incoherence parameter of is due to the block diagonal structure of the affinity matrix . Using previous techniques in matrix completion, it is shown in that is the unique optimal solution to the program (6) w.h.p. provided
Note the quadratic term on the RHS, which is due to the joint incoherence parameter of taking the value . Suppose ; a consequence of (7) is that, even if is fully observed (), the cluster size must be at least and thus the possible number of clusters cannot exceed . These restrictions are undesirable, and clearly unnecessary when .
Using Theorem 2, we can eliminate these restrictions and significantly reduce the sample complexity. Plugging into the theorem, we obtain that the program (6) succeeds with high probability provided
The last RHS is order-wise smaller than the RHS of the previous bound (8) by a multiplicative factor of . In particular, when and ignoring logarithm factors, we allow the size of the clusters to be as small as and the number of clusters be as large as . These significantly improve over the results in which require and . Moreover, if , then our result require times fewer observations than the previous bound 8.
Incoherence in Matrix Decomposition: Information and Computational Lower Bounds
Having shown that the joint incoherence is not needed in matrix completion, we now turn to a closely related problem, namely low-rank and sparse matrix decomposition . In contrast to matrix completion, we show that the joint incoherence condition is unavoidable in matrix decomposition, at least if one asks for polynomial-time algorithms.
In fact, we show that the joint incoherence condition is not specific to the formulation (9), but is in fact required by all polynomial-time algorithms under a widely-believed computational complexity assumption. We prove this by connecting the matrix decomposition problem to the Planted Clique problem , defined as follows. A graph on nodes is generated by connecting each pairs of nodes independently with probability , and then randomly picking a subset of nodes and making them fully connected (hence a clique). The goal is to find the planted clique given the graph. The Planted Clique problem has been extensively studied; cf. for an overview of the known results. In the regime of , there is no known polynomial-time algorithm for this problem despite years of effort. In fact, this regime is widely believed to be intractable in polynomial time. The average case hardness of this regime has been proved under certain computational models , and has been utilized in cryptography and other applications . The work is the first to use this hardness assumption to obtain bounds on statistical accuracy of sparse PCA given computational constraints, and a similar approach is taken in for submatrix detection. We therefore adopt the following computational assumption on the Planted Clique problem, where we recall that a size clique is planted in an Erdos-Renyi random graph with nodes and edge probability .
A1 For any constant , there is no algorithm with running time polynomial in that, for all and with probability at least , finds the planted clique with size given the random graph.
This version of the assumption is similar to Conjecture 4.3 in .
The following theorem provides necessary conditions for the success of matrix decomposition algorithms. The proof is given in Appendix E.
The following two statements are true for the matrix decomposition problem with .
Suppose and the assumption A1 holds. For any constant , there is no algorithm with running time polynomial in that, for all and with probability at least , solves the matrix decomposition problemThis statement still holds if we restrict to matrix decomposition problems with and taking finitely many values, which can be encoded using a finite number of bits. This can be easily seen from the proof of the theorem. with
Suppose . There is no algorithm that, for all and with probability at least , solves the matrix decomposition problem with
If we modify the assumption A1 by assuming that the Planted -Clique problem with disjoint planted cliques of size is intractable in polynomial time, then the first part of the theorem holds with
Together with the second part of the theorem, this result shows that, under the planted clique assumption, the standard and joint incoherence conditions are both necessary for solving matrix decomposition in polynomial time. Therefore, the bound in (10) is unlikely to be improvable (up to a polylog factor) using polynomial-time algorithms. In particular, this implies that the matrix decomposition problem is intractable in general for positive semidefinite matrices with rank since in this case
We note that the first part of Theorem 3 is a computational limit. It is proved by showing that if there is a matrix decomposition algorithm that does not require the joint incoherence condition, then the algorithm would solve the computationally hard problem of finding a planted clique with size . On the other hand, the second part of the theorem is an information/statistical limit applicable to all algorithms regardless of their computational complexity, and is proved by an information-theoretic argument. Interestingly, Theorem 3 shows that the standard incoherence and the joint incoherence are associated with the statistical and computational aspects of the matrix decomposition problem, respectively.
Proof of Theorem 1
We now turn to the details. To simplify notion, we prove the results for square matrices (); the results for non-square matrices are proven in exactly the same fashion. Some additional notation is needed. We use and its derivatives (, etc.) for universal positive constants. By with high probability (w.h.p.) we mean with probability at least for some constants independent of the problem parameters (). Throughout the proof the constant can be made arbitrarily large by choosing the constant in Theorem 1 sufficiently large. The proof below involves random events, each of which is shown to occur with high probability. By the union bound their intersection also occurs with high probability.
A few additional notations are needed. The inner product between two matrices is given by . The projections and are given by
Following our proof roadmap, we now state a sufficient condition for to be the unique optimal solution to the optimization problem (1).
Suppose . The matrix is the unique optimal solution to (1) if the following conditions hold:
,
A somewhat different version of the proposition appears in . We prove the proposition in Appendix A.
The requirement in Proposition 2 clearly holds under the conditions of Theorem 1. The following standard result shows that the approximate isometry in Condition 1 is also satisfied.
If for some sufficiently large, then w.h.p.
Suppose is a fixed matrix. For a universal constant , we have w.h.p.
Suppose is a fixed matrix. If for some sufficiently large, then w.h.p.
Suppose is a fixed matrix in . If for some sufficiently large, then w.h.p.
Equipped with the lemmas above, we are ready to validate Condition 2 in Proposition 2.
Set for . By definition of , we have and
Note that is independent of and under the conditions in Theorem 1. Applying Lemma 1 with replaced by , we obtain that w.h.p.
for each . Applying the above inequality recursively with gives
Note that by construction. We therefore have
Applying Lemma 2 with replaced by to each summand of the last R.H.S., we get that w.h.p.
where the last inequality follows from . We proceed by bounding and . Using (13), and repeatedly applying Lemma 4 with replaced by , we obtain that w.h.p.
By Lemma 3 with replaced by , we obtain that w.h.p.
Using (13) and combining the last two display equations gives w.h.p.
But the standard incoherence condition (2) implies that
provided is sufficiently large. This completes the proof of Theorem 1.
Discussion
In this paper, we consider exact matrix completion and show that the joint incoherence condition imposed by all previous work is in fact not necessary. We discuss two extensions of this result, namely in bounding the approximation errors of SVD projection, and in structured matrix completion and semi-supervised clustering. We then show that the joint incoherence condition is unavoidable in the apparently similar problem of low-rank and sparse matrix decomposition based on the computational hardness assumption of the Planted Clique problem.
Acknowledgment
The author would like to thank Constantine Caramanis, Yuxin Chen, Sujay Sanghavi and Rachel Ward for their support and helpful comments. This work is supported by NSF grant EECS-1056028 and DTRA grant HDTRA 1-08-0029.
Appendix A Proof of Proposition 2
Consider any feasible solution to (1) with . Let be an matrix which satisfies and . Such always exists by duality between the nuclear norm and the spectral norm. Because is a sub-gradient of at , we get
We also have since . It follows that
where in the last inequality we use Conditions 1 and 2 in the proposition. Applying Lemma 5 below, we further obtain
The RHS is strictly positive for all with and . Otherwise we must have and , contradicting the assumption . This proves that is the unique optimum.
If and , then we have
where the last inequality follows from the assumption . On the other hand, implies and thus
Combining the last two display equations gives
Appendix B Proofs of Technical Lemmas in Section 5
We prove the technical lemmas that are used in the proof of Theorem 1. The proofs use the matrix Bernstein inequality, restated below.
and almost surely for all . Then for any , we have
with probability at least
We also make use of the following facts: for all and , we have
This follows from the definition of and the standard incoherence condition (2).
B.2 Proof of Lemma 3
Fix . The -th column of the matrix can be written as
where the last inequality follows from the assumption of in the statement of the lemma. We also have
using the incoherence condition (2). It follows that
provided in the lemma statement is large enough. In a similar fashion we prove that is bounded by the same quantity w.h.p. The lemma follows from a union bound over all .
Appendix C Proof of Corollary 1
When , the standard Bernstein inequality and a union bound implies that w.h.p. the degrees (i.e., the number of observed entries) of the rows and columns of are bounded by . This means . By Lemma 4, we have
where we use (16) and (17) in the last inequality. Since the rank of is at most , we have and the corollary follows.
Appendix D Proof of Theorem 2
The proof is similar to that of Theorem 1, and we shall point out where they differ. We use the same notations as in the proof of Theorem 1, except that throughout this section we re-define the two projections:
Note that . Since and , one can verify that under the incoherence assumption on and in the theorem statement, we have for all ,
We have the following subgradient optimality condition.
is the unique optimal solution to the program (6) if the following conditions are satisfied: 1. and ; 2. there exist a dual certificate with and obeys (a) and (b)
On the other hand, since by assumption, we have
Because , we have and thus
where the last inequality follows from Condition 1 in the statement of the proposition. Combining the last two display equations gives It follows from (21) that
The last RHS is strictly positive for all with and ; otherwise we would have and thus , contradicting . This proves that is the unique optimal solution to (6). ∎
We proceed by showing that Condition 1 in Proposition 3 is satisfied w.h.p. under the conditions of Theorem 2. This is done in the lemma below, which is proved in Section D.1 to follow.
If for some sufficiently large constant , then w.h.p. we have
We now construct a dual certificate using the golfing scheme. This is done similarly as before; in particular, we let , , be given by (12) (with the re-defined ) and . Clearly by construction. Note that for , the matrix again satisfies (13). It follows that by the first inequality in Lemma 6 and thus
proving Condition 2(a) in Proposition 3. To prove Condition 2(b), we need three lemmas which are analogues of Lemmas 2, 3 and 4 in the proof of Theorem 1.
Suppose is a fixed matrix. For some universal constant , we have w.h.p.
Suppose is a fixed matrix. If for some sufficiently large, then w.h.p.
Suppose is a fixed matrix. If for some sufficiently large, then w.h.p.
We prove these lemmas in Sections D.2–D.4 to follow. Following the same lines as in the proof of Theorem 1, we obtain
Applying Lemma 7 with replaced by to each summand of the last R.H.S, we get that w.h.p.
where the last inequality follows from Again following the same lines as in the proof of Theorem 1, but using the new Lemmas 9 and 8, we can bound the two terms above as
The inequality then follows from the incoherence conditions (2) of and provided is sufficiently large. This proves Condition 2(b) in Proposition 3 and hence completes the proof of Theorem 2.
The proof of the first inequality is identical to that of Lemma 1 except that we use (18) instead of (15) (cf. Theorem 4.1 in and Lemma 11 in ).
Applying the the matrix Bernstein inequality in Theorem 4, we obtain w.h.p.
for some constant , where the last inequality follows from and the assumption . It follows that
where in the last inequality we again use the assumption
D.2 Proof of Lemma 7
by (19). Moreover, since , and , are projections, we have
D.3 Proof of Lemma 8
Fix . The -th column of the matrix can be written as
by (20) and the assumption on . We also have
provided in the statement of the lemma is sufficiently large. In a similar fashion we can prove that the is bounded by the same quantity w.h.p. The lemma follows from a union bound over all .
D.4 Proof of Lemma 9
Fix . We can write the entry of the matrix as
Applying the Bernstein inequality in Theorem 4, we conclude that w.h.p. for sufficiently large. The lemma follows from a union bound over all .
Appendix E Proof of Theorem 3
We reduce the planted problem above to the matrix decomposition problem using subsampling. Given the matrix , we set each to zero with probability independently, and let be the resulting matrix. If we let , then each pair is non-zero with probability . Moreover, the matrix has rank and satisfies the standard and joint incoherence conditions (2) and (3) with parameters and . Hence recovering from is a special case of the matrix decomposition problem. If there exists a polynomial-time algorithm that, for all , finds given with probability at least when
then it means this algorithm recovers the planted clique with from , which violates the assumption A1.
E.2 Part 2 of the theorem
With slight abuse of notation, we use to denote the KL divergence between two Bernoulli distributions with parameters and . Direct computation gives
for all , where the inequality above follows from It follows that
We now apply the Fano’s inequality to obtain that for any measurable function of ,
where the probability is with respect to the randomness of and , and the last inequality holds when and . Because the supremum is lower bounded by the average, we obtain
where the probability is with respect to the randomness of .