Incoherence-Optimal Matrix Completion

Yudong Chen

Introduction

The matrix completion problem concerns recovering a low-rank matrix from an observed subset of its entries. Recent research has demonstrated the following remarkable fact: if a rank- $r$ $n\times n$ matrix satisfies certain incoherence properties, then it is possible to exactly reconstruct the matrix with high probability from $nr\mbox{polylog}(n)\ll n^{2}$ uniformly sampled entries using efficient polynomial-time algorithms.

In previous work, the sample complexity $\Theta(nr\mbox{polylog}(n))$ is achieved only for matrices that satisfy two types of incoherence conditions with constant parameters. The first condition, known as standard incoherence, is a natural and necessary requirement; it prevents the information of the row and column spaces of the matrix from being too concentrated in a few rows or columns. A second condition, called joint incoherence (or strong incoherence), is also needed. It requires the left and right singular vectors of the matrix to be unaligned with each other. This condition is quite unintuitive, and does not seem to have a natural interpretation. As we demonstrate later, this condition is often restrictive and precludes a large class of otherwise well-conditioned matrices. For example, positive semidefinite matrices have a non-constant joint incoherence parameter on the order of $\Omega(r)$ , and previous results thus require the number of observations to be proportional to $nr^{2}$ instead of $nr$ . In several applications of matrix completion discussed later, the joint incoherence condition leads to artificial and undesired constraints. In contrast, numerical experiments suggest that this condition is not needed.

In this paper, we prove that the joint incoherence condition is not necessary and can be completely eliminated. With $\Omega(nr\log^{2}n)$ uniformly sampled entries, one can recover a matrix that satisfies the standard incoherence condition (with a constant parameter) but is not jointly incoherent (e.g., a positive semidefinite matrix). As we show in Section 2, our sample complexity bounds are order-wise optimal with respect to not only the matrix dimensions $n$ and $r$ but also to its incoherence parameters except for a $\log n$ factor. As a consequence, we improve the sample complexity of recovering a positive semidefinite matrix from $O(nr^{2}\log^{2}n)$ to $O(nr\log^{2}n)$ , and the highest allowable rank from $\Theta(\sqrt{n}/\log n)$ to $\Theta(n/\log^{2}n)$ .

the analysis of a Singular Value Decomposition (SVD) projection algorithm for matrix completion;

structured matrix completion and semi-supervised clustering with side information.

Finally, we turn to the closely related problem of matrix decomposition, where one is asked to recover a low-rank matrix and a sparse matrix from their sum. We show that the joint incoherence condition is necessary in this setting based on the computational complexity assumption of the Planted Clique problem. In particular, any decomposition algorithm that does not require the joint incoherence condition would solve Planted Clique with clique size $o(\sqrt{n})$ , a problem that has been extensively studied and is widely believed to be intractable in polynomial time. This implies that it is computationally hard in general to separate a rank- $\omega(\sqrt{n})$ positive semidefinite matrix and a sparse matrix. Interestingly, our results show that the standard incoherence condition is inherently associated the information-theoretic (or statistical) aspect of the problem, whereas the joint incoherence condition reflects the computational aspect.

We briefly survey existing related work; detailed comparisons with these results are provided after we present our theorems. Matrix completion is first studied in , which initiates the use of the nuclear norm minimization approach. The work in provides state-of-the-art theoretical guarantees on exact completion. Alternative algorithms for matrix completion are considered in . All these works require the joint incoherence condition (or a sample complexity that is at least quadratic in $r$ ). Our extensions to SVD projection, structured matrix completion and semi-supervised clustering are inspired by the work in ; we improve upon their results. The low-rank and sparse matrix decomposition problem is considered first in and subsequently in . The work in prove the success of specific algorithms assuming the standard and joint incoherence conditions. Our results show that these two incoherence conditions are in fact necessary for all algorithms, or all polynomial-time algorithms, due to statistical and computational reasons. The seminal work in is the first to use the Planted Clique problem to establish statistical limits under computational constraints; they consider the problem of sparse Principal Component Analysis (PCA). A similar approach is taken in for the submatrix detection problem.

In Section 2 we present our main result and show that the joint incoherence condition is not needed in matrix completion. We discuss extensions to SVD projection and structured matrix completion in Section 3. In Section 4 we turn to the matrix decomposition problem and show that the joint incoherence condition is unavoidable there. We prove our main theorem in Section 5, with some technical aspects of the proofs deferred to the appendix. The paper is concluded with a discussion in Section 6.

Main Results

where $\left\|\bm{X}\right\|_{*}$ is the nuclear norm of the matrix $\bm{X}$ , defined as the sum of its singular values. Our goal is to obtain sufficient conditions under which the optimal solution to the problem (1) is unique and equal to $\bm{M}$ with high probability.

It is observed in that if $\bm{M}$ is equal to zero in nearly all of rows or columns, then it is impossible to complete $\bm{M}$ unless all of its entries of are observed. To avoid such pathological situations, it has become standard to assume $\bm{M}$ to have additional properties known as incoherence. Suppose the rank- $r$ SVD of $\bm{M}$ is $\bm{U}\mathbf{\Sigma}\bm{V}^{\top}$ . $\bm{M}$ is said to satisfy the standard incoherence condition with parameter $\mu_{0}$ if

where $\bm{\bm{e}}_{i}$ are the $i$ -th standard basis with appropriate dimension. Note that $1\leq\mu_{0}\leq\frac{\min\{n_{1},n_{2}\}}{r}$ . Previous work also requires $\bm{M}$ to satisfy an additional joint incoherence (or strong incoherence) condition with parameter $\mu_{1}$ , defined as

In the following main theorem of the paper, we show that the joint incoherence is not necessary. The theorem only requires the standard incoherence condition.

Suppose $\bm{M}$ satisfies the standard incoherence condition (2) with parameter $\mu_{0}$ . There exist universal constants $c_{0},c_{1},c_{2}>0$ such that if

then $\bm{M}$ is the unique optimal solution to (1) with probability at least $1-c_{1}(n_{1}+n_{2})^{-c_{2}}$ .

We provide comments and discussion in the next two sub-sections.

Candes and Tao prove the following lower-bound on the sample complexity of matrix completion.

Suppose $n_{1}=n_{2}=n$ and $\Omega$ is sampled as above. If we do not have the condition

and the RHS above is less than $1$ , then with probability at least $\frac{1}{4}$ , there exist infinitely many pairs of distinct matrices $\bm{M}^{{}^{\prime}}\neq\bm{M}^{{}^{\prime\prime}}$ of rank at most $r$ and obeying the standard incoherence condition (2) with parameter $\mu_{0}$ such that $M{}_{ij}^{{}^{\prime}}=M_{ij}^{{}^{\prime\prime}}$ for all $(i,j)\in\Omega$ .

This shows that that $p\gtrsim\mu_{0}r\log(n)/n$ is necessary for any method to determine $\bm{M}$ (even if one knows $r$ and $\mu_{0}$ ahead of time). With an additional $c^{\prime}\log(n)$ factor, Theorem 1 matches this lower bound. In particular, it is optimal in terms of its scaling with the incoherence parameter $\mu_{0}$ .

We note that the condition in Proposition 1 is an information/statistical lower-bound: when the value of $p$ is below this bound, there is not enough information in the observed entries to uniquely determine an rank- $r$ , $\mu_{0}$ -incoherent matrix even if one has infinite computational power. In Section 4, we show that in the closely related problem of matrix decomposition, the incoherence parameters are associated with both information and computational lower bounds.

2 Consequences and Comparison with Prior Work

Using an alternative algorithm, Keshavan et al. show that recovery can be achieved with

Similar results are given in , which also requires the sample complexity to be proportional to $\mu_{1}$ (or equivalently, quadratic in $r$ ). In light of Proposition 1, these results are not optimal with respect to the incoherence parameters due to the dependence on the joint incoherence $\mu_{1}$ . Theorem 1 eliminates this extra dependence.

The improvement in Theorem 1 is significant both qualitatively and quantitatively. The standard incoherence condition (2) is natural and necessary. A small standard incoherence parameter $\mu_{0}$ ensures that the information of the row and column spaces of $\bm{M}$ is not too concentrated on a small number of rows/columns. In contrast, the joint incoherence assumption (3), which requires the matrices $\bm{U}$ and $\bm{V}$ containing the left and right singular vectors to be “unaligned” with each other, does not have a natural explanation. In applications, the quantity $\mu_{0}$ often has clear physical meanings while $\mu_{1}$ does not. For example, in the application to recovering the affinity matrices between clustered objects from partial observations (discussed in Section 3), $\mu_{0}$ is a function of the minimum cluster size, but a bound on $\mu_{1}$ bears no natural motivation. As another example, the work in uses Hankel matrix completion to recover spectrally sparse signals obeying two types of conditions. The first condition can be traced to standard incoherence and is equivalent to (the natural requirement of) the supporting frequencies being spread out. On the other hand, the second set of conditions, which resemble joint incoherence, cannot be reduced to a property of only the frequencies. The manuscript , which appeared after this paper was posted online , removes these second set of conditions using similar techniques as in Theorem 1.

Quantitatively, the joint incoherence condition is much more restrictive than standard incoherence. By Cauchy-Schwarz inequality we always have $\mu_{1}r\leq\mu_{0}^{2}r^{2}$ . The equality

Extensions

Our first example is the derivation of error bounds for an SVD-projection algorithm for matrix completion. Let $\mathcal{P}_{\Omega}\bm{M}$ be the matrix obtained from $\bm{M}$ by setting all the unobserved entries to zero. Given the partial observation $\mathcal{P}_{\Omega}\bm{M}$ , Keshavan et al. propose the following two-step algorithm for approximating $\bm{M}$ . Step 1: Set to zero all columns and rows in $\mathcal{P}_{\Omega}\bm{M}$ with degrees larger that $2pn$ , where the degree of a column or row is the number of non-zero entries of this column/row. Let $\widetilde{\bm{M}}^{\Omega}$ be the output. Step 2: Compute the SVD of $\widetilde{\bm{M}}^{\Omega}$

that is, $\left\|\bm{M}\right\|_{\infty,2}$ is the maximum of the row and column norms of $\bm{M}$ .

Suppose $p\geq c_{0}\frac{\log n}{n}$ . With high probability, we have

2 Structured Matrix Completion and Semi-Supervised Clustering

Given $\mathcal{P}_{\Omega}\bm{M}$ , $\bar{\bm{U}}$ and $\bar{\bm{V}}$ , we solve the following modified nuclear norm minimization problem:

For this formulation we have the following guarantee.

Suppose $\bm{U}$ and $\bm{V}$ satisfy the standard incoherence condition (2) with parameter $\mu_{0}$ , and $\bar{\bm{U}}$ and $\bar{\bm{V}}$ satisfies (2) with parameter $\bar{\mu}_{0}$ . For some universal constants $c_{0},c_{1}$ and $c_{2}$ , $\bm{X}^{*}:=\bar{\bm{U}}^{\top}\bm{M}\bar{\bm{V}}$ is the unique optimal solution to the program (6) with probability at least $1-c_{1}n^{-c_{2}}$ provided

Given $\bm{X}^{*}$ , we can recover $\bm{M}$ by $\bm{M}=\bar{\bm{U}}\bm{X}^{*}\bar{\bm{V}}^{\top}$ since $\bar{\bm{U}}\bar{\bm{U}}^{\top}\bm{U}=\bm{U}$ and $\bar{\bm{V}}\bar{\bm{V}}^{\top}\bm{V}=\bm{V}$ . We prove this theorem in Appendix D.

Theorem 2 shows that with the knowledge of the $\bar{r}$ -dimensional subspaces $\text{col}(\bar{\bm{U}})$ and $\text{col}(\bar{\bm{V}})$ , the number of observations needed to complete $\bm{M}$ is on the order of $pn^{2}\asymp\mu_{0}\bar{\mu}_{0}r\bar{r}\log(\bar{\mu}_{0}\bar{r})\log(\bar{\mu}_{0}\bar{r})\log n$ , which is $\Theta(r\bar{r}\log\bar{r}\log n)$ for constant $\mu_{0}$ and $\bar{\mu}_{0}$ . If $\bar{r}\ll n$ , meaning that we have strong structural information, then this number is much smaller than the usual requirement $\Theta(nr\log^{2}n)$ . On the other hand, setting $\bar{r}=n$ recovers Theorem 1 for standard matrix completion where there is no additional structural information. We note that we assume $\bm{M}$ is a square matrix here for simplicity; the results can be trivially extended to general rectangular matrices.

Near the completion of the writing of this paper, an independent study on structured matrix completion was made available. There they require among other things the following condition:In they consider the sampling without replacement model for the observed entries. Their results can be translated to the Bernoulli model considered in this paper, as we have done here. See also foot note 1.

where $\mu_{1}$ is the joint incoherence parameter of $\bm{U}$ and $\bm{V}$ defined in (3). Theorem 2 is better than the result in (7) in two ways. First, Theorem 2 avoids the superfluous dependence on the joint incoherence parameter $\mu_{1}$ , which can be as large as $\mu_{0}^{2}r$ as previously discussed. Second, even in the ideal setting with $\mu_{1}=\mu_{0}$ , the bound in (7) requires $p$ to scale with $\max\left(\mu_{0}^{2},\bar{\mu}_{0}^{2}\right)$ , whereas the bound in Theorem (2) scales with $\mu_{0}\bar{\mu}_{0}$ , which is strictly smaller whenever $\mu_{0}\neq\bar{\mu}_{0}$ . We note that discusses a nice application of structured matrix completion to the multi-label learning problem.

Specifically, considers the setup where the set of observed entries $\Omega$ are distributed according to the Bernoulli model with probability $p$ ,To be precise, the diagonal entries $M_{ii}=1$ are known; clearly having more observations cannot decrease the probability that the program (6) outputs the correct solution. Moreover, since the affinity matrix satisfies $M_{ij}=M_{ji}$ , each observation is a pair of entries of $\bm{M}$ . This technicality can be easily handled, and we omit the details here. the smallest cluster size is $n_{\min}$ , and $\bar{\bm{U}}$ has standard incoherence parameter $\bar{\mu}_{0}$ as defined in (2). Note that the standard incoherence parameter of $\bm{U}$ is $n/(rn_{\min})$ due to the block diagonal structure of the affinity matrix $\bm{M}$ . Using previous techniques in matrix completion, it is shown in that $\bm{X}^{*}:=\bar{\bm{U}}^{\top}\bm{M}\bar{\bm{U}}$ is the unique optimal solution to the program (6) w.h.p. provided

Note the quadratic term $n_{\min}^{2}$ on the RHS, which is due to the joint incoherence parameter of $\bm{U}$ taking the value $n^{2}/(rn_{\min}^{2})$ . Suppose $\bar{r}=n$ ; a consequence of (7) is that, even if $\bm{M}$ is fully observed ( $p=1$ ), the cluster size must be at least $n_{\min}=\Theta(\sqrt{n})$ and thus the possible number of clusters $r$ cannot exceed $n/n_{\min}=\Theta(\sqrt{n})$ . These restrictions are undesirable, and clearly unnecessary when $p=1$ .

Using Theorem 2, we can eliminate these $\sqrt{n}$ restrictions and significantly reduce the sample complexity. Plugging $\mu_{0}=n/(rn_{\min})$ into the theorem, we obtain that the program (6) succeeds with high probability provided

The last RHS is order-wise smaller than the RHS of the previous bound (8) by a multiplicative factor of $\frac{n_{\min}}{n}\cdot\frac{\log(\bar{\mu}_{0}\bar{r})}{\log n}$ . In particular, when $\bar{r}=n$ and ignoring logarithm factors, we allow the size of the clusters to be as small as $n_{\min}=\Theta\left(1\right)$ and the number of clusters be as large as $r=\Theta\left(n\right)$ . These significantly improve over the results in which require $n_{\min}=\Omega(\sqrt{n})$ and $r=O(\sqrt{n})$ . Moreover, if $n_{\min}=\sqrt{n}$ , then our result require $n/n_{\min}=\sqrt{n}$ times fewer observations than the previous bound 8.

Incoherence in Matrix Decomposition: Information and Computational Lower Bounds

Having shown that the joint incoherence is not needed in matrix completion, we now turn to a closely related problem, namely low-rank and sparse matrix decomposition . In contrast to matrix completion, we show that the joint incoherence condition is unavoidable in matrix decomposition, at least if one asks for polynomial-time algorithms.

In fact, we show that the joint incoherence condition is not specific to the formulation (9), but is in fact required by all polynomial-time algorithms under a widely-believed computational complexity assumption. We prove this by connecting the matrix decomposition problem to the Planted Clique problem , defined as follows. A graph on $n$ nodes is generated by connecting each pairs of nodes independently with probability $\frac{1}{2}$ , and then randomly picking a subset of $n_{\min}$ nodes and making them fully connected (hence a clique). The goal is to find the planted clique given the graph. The Planted Clique problem has been extensively studied; cf. for an overview of the known results. In the regime of $n_{\min}=o(\sqrt{n})$ , there is no known polynomial-time algorithm for this problem despite years of effort. In fact, this regime is widely believed to be intractable in polynomial time. The average case hardness of this regime has been proved under certain computational models , and has been utilized in cryptography and other applications . The work is the first to use this hardness assumption to obtain bounds on statistical accuracy of sparse PCA given computational constraints, and a similar approach is taken in for submatrix detection. We therefore adopt the following computational assumption on the Planted Clique problem, where we recall that a size $n_{\min}$ clique is planted in an Erdos-Renyi random graph $G(n,\frac{1}{2})$ with $n$ nodes and edge probability $\frac{1}{2}$ .

A1 For any constant $\epsilon>0$ , there is no algorithm with running time polynomial in $n$ that, for all $n$ and with probability at least $\frac{1}{2}$ , finds the planted clique with size $n_{\min}\leq n^{\frac{1}{2}-\epsilon}$ given the random graph.

This version of the assumption is similar to Conjecture 4.3 in .

The following theorem provides necessary conditions for the success of matrix decomposition algorithms. The proof is given in Appendix E.

The following two statements are true for the matrix decomposition problem with $\tau=1/3$ .

Suppose $r=1$ and the assumption A1 holds. For any constant $\epsilon^{\prime}>0$ , there is no algorithm with running time polynomial in $n$ that, for all $n$ and with probability at least $\frac{1}{2}$ , solves the matrix decomposition problemThis statement still holds if we restrict to matrix decomposition problems with $\bm{L}^{*}$ and $\bm{S}^{*}$ taking finitely many values, which can be encoded using a finite number of bits. This can be easily seen from the proof of the theorem. with

Suppose $\mu_{0}\geq 2$ . There is no algorithm that, for all $n$ and with probability at least $\frac{1}{2}$ , solves the matrix decomposition problem with

If we modify the assumption A1 by assuming that the Planted $r$ -Clique problem with $r$ disjoint planted cliques of size $o(\sqrt{n})$ is intractable in polynomial time, then the first part of the theorem holds with

Together with the second part of the theorem, this result shows that, under the planted clique assumption, the standard and joint incoherence conditions are both necessary for solving matrix decomposition in polynomial time. Therefore, the bound in (10) is unlikely to be improvable (up to a polylog factor) using polynomial-time algorithms. In particular, this implies that the matrix decomposition problem is intractable in general for positive semidefinite matrices with rank $r=\omega(\sqrt{n})$ since in this case $\mu_{1}r=\mu_{0}^{2}r^{2}\geq r^{2}.$

We note that the first part of Theorem 3 is a computational limit. It is proved by showing that if there is a matrix decomposition algorithm that does not require the joint incoherence condition, then the algorithm would solve the computationally hard problem of finding a planted clique with size $n_{\min}=o(\sqrt{n})$ . On the other hand, the second part of the theorem is an information/statistical limit applicable to all algorithms regardless of their computational complexity, and is proved by an information-theoretic argument. Interestingly, Theorem 3 shows that the standard incoherence and the joint incoherence are associated with the statistical and computational aspects of the matrix decomposition problem, respectively.

Proof of Theorem 1

We now turn to the details. To simplify notion, we prove the results for square matrices ( $n_{1}=n_{2}=n$ ); the results for non-square matrices are proven in exactly the same fashion. Some additional notation is needed. We use $c$ and its derivatives ( $c^{\prime},c_{0}$ , etc.) for universal positive constants. By with high probability (w.h.p.) we mean with probability at least $1-c_{1}n{}^{-c_{2}}$ for some constants $c_{1},c_{2}>0$ independent of the problem parameters ( $n,r,p,\mu_{0},\mu_{1}$ ). Throughout the proof the constant $c_{2}$ can be made arbitrarily large by choosing the constant $c_{0}$ in Theorem 1 sufficiently large. The proof below involves $80\log n+1$ random events, each of which is shown to occur with high probability. By the union bound their intersection also occurs with high probability.

A few additional notations are needed. The inner product between two matrices is given by $\left\langle\bm{X},\bm{Z}\right\rangle:=\mbox{trace}(\bm{X}^{\top}\bm{Z})$ . The projections $\mathcal{P}_{T}$ and $\mathcal{P}_{T^{\bot}}$ are given by

Following our proof roadmap, we now state a sufficient condition for $\bm{M}$ to be the unique optimal solution to the optimization problem (1).

Suppose $p\geq\frac{1}{n}$ . The matrix $\bm{M}$ is the unique optimal solution to (1) if the following conditions hold:

$\left\|\mathcal{P}_{T}\mathcal{R}_{\Omega}\mathcal{P}_{T}-\mathcal{P}_{T}\right\|_{op}\leq\frac{1}{2}.$

$\left\|\mathcal{P}_{T^{\bot}}(\bm{Y})\right\|\leq\frac{1}{2}$ ,

$\left\|\mathcal{P}_{T}(\bm{Y})-\bm{U}\bm{V}^{\top}\right\|_{F}\leq\frac{1}{4n}.$

A somewhat different version of the proposition appears in . We prove the proposition in Appendix A.

The requirement $p\geq\frac{1}{n}$ in Proposition 2 clearly holds under the conditions of Theorem 1. The following standard result shows that the approximate isometry in Condition 1 is also satisfied.

If $p\geq c_{0}\frac{\mu_{0}r\log n}{n}$ for some $c_{0}$ sufficiently large, then w.h.p.

Suppose $\bm{Z}$ is a fixed $n\times n$ matrix. For a universal constant $c>1$ , we have w.h.p.

Suppose $\bm{Z}$ is a fixed matrix. If $p\geq c_{0}\frac{\mu_{0}r\log n}{n}$ for some $c_{0}$ sufficiently large, then w.h.p.

Suppose $\bm{Z}$ is a fixed $n\times n$ matrix in $T$ . If $p\geq c_{0}\frac{\mu r\log n}{n}$ for some $c_{0}$ sufficiently large, then w.h.p.

Equipped with the lemmas above, we are ready to validate Condition 2 in Proposition 2.

Set $\bm{D}_{k}:=\bm{U}\bm{V}^{\top}-\mathcal{P}_{T}(\bm{W}_{k})$ for $k=0,\ldots,k_{0}$ . By definition of $\bm{W}_{k}$ , we have $\bm{D}_{0}=\bm{U}\bm{V}^{\top}$ and

Note that $\Omega_{k}$ is independent of $\bm{D}_{k-1}$ and $q\geq p/k_{0}\geq c_{0}\mu_{0}r\log(n)/n$ under the conditions in Theorem 1. Applying Lemma 1 with $\Omega$ replaced by $\Omega_{k}$ , we obtain that w.h.p.

for each $k$ . Applying the above inequality recursively with $k=k_{0,}k_{0}-1,\ldots,1$ gives

Note that $\bm{Y}=\sum_{k=1}^{k_{0}}\mathcal{R}_{\Omega_{k}}\mathcal{P}_{T}\left(\bm{D}_{k-1}\right)$ by construction. We therefore have

Applying Lemma 2 with $\Omega$ replaced by $\Omega_{k}$ to each summand of the last R.H.S., we get that w.h.p.

where the last inequality follows from $q\geq c_{0}\mu_{0}r\log(n)/n$ . We proceed by bounding $\left\|\bm{D}_{k-1}\right\|_{\infty}$ and $\left\|\bm{D}_{k-1}\right\|_{\infty,2}$ . Using (13), and repeatedly applying Lemma 4 with $\Omega$ replaced by $\Omega_{k}$ , we obtain that w.h.p.

By Lemma 3 with $\Omega$ replaced by $\Omega_{k}$ , we obtain that w.h.p.

Using (13) and combining the last two display equations gives w.h.p.

But the standard incoherence condition (2) implies that

provided $c_{0}$ is sufficiently large. This completes the proof of Theorem 1.

Discussion

In this paper, we consider exact matrix completion and show that the joint incoherence condition imposed by all previous work is in fact not necessary. We discuss two extensions of this result, namely in bounding the approximation errors of SVD projection, and in structured matrix completion and semi-supervised clustering. We then show that the joint incoherence condition is unavoidable in the apparently similar problem of low-rank and sparse matrix decomposition based on the computational hardness assumption of the Planted Clique problem.

Acknowledgment

The author would like to thank Constantine Caramanis, Yuxin Chen, Sujay Sanghavi and Rachel Ward for their support and helpful comments. This work is supported by NSF grant EECS-1056028 and DTRA grant HDTRA 1-08-0029.

Appendix A Proof of Proposition 2

Consider any feasible solution $\bm{X}$ to (1) with $\mathcal{P}_{\Omega}(\bm{X})=\mathcal{P}_{\Omega}(\bm{M})$ . Let $\bm{G}$ be an $n\times n$ matrix which satisfies $\left\|\mathcal{P}_{T^{\bot}}\bm{G}\right\|=1$ and $\left\langle\mathcal{P}_{T^{\bot}}\bm{G},\mathcal{P}_{T^{\bot}}(\bm{X}-\bm{M})\right\rangle=\left\|\mathcal{P}_{T^{\bot}}(\bm{X}-\bm{M})\right\|_{*}$ . Such $G$ always exists by duality between the nuclear norm and the spectral norm. Because $\bm{U}\bm{V}^{\top}+\mathcal{P}_{T^{\bot}}\bm{G}$ is a sub-gradient of $\left\|\bm{Z}\right\|_{*}$ at $\bm{Z}=\bm{M}$ , we get

We also have $\left\langle\bm{Y},\bm{X}-\bm{M}\right\rangle=\left\langle\mathcal{P}_{\Omega}(\bm{Y}),\mathcal{P}_{\Omega}(\bm{X}-\bm{M})\right\rangle=0$ since $\mathcal{P}_{\Omega}(\bm{Y})=\bm{Y}$ . It follows that

where in the last inequality we use Conditions 1 and 2 in the proposition. Applying Lemma 5 below, we further obtain

The RHS is strictly positive for all $\bm{X}$ with $\mathcal{P}_{\Omega}(\bm{X}-\bm{M})=0$ and $\bm{X}\neq\bm{M}$ . Otherwise we must have $\mathcal{P}_{T}(\bm{X}-\bm{M})=\bm{X}-\bm{M}$ and $\mathcal{P}_{T}\mathcal{R}_{\Omega}\mathcal{P}_{T}(\bm{X}-\bm{M})=0$ , contradicting the assumption $\left\|\mathcal{P}_{T}\mathcal{R}_{\Omega}\mathcal{P}_{T}-\mathcal{P}_{T}\right\|_{op}\leq\frac{1}{2}$ . This proves that $\bm{M}$ is the unique optimum.

If $p\geq\frac{1}{n^{10}}$ and $\left\|\mathcal{P}_{T}\mathcal{R}_{\Omega}\mathcal{P}_{T}-\mathcal{P}_{T}\right\|_{op}\leq\frac{1}{2}$ , then we have

where the last inequality follows from the assumption $\left\|\mathcal{P}_{T}\mathcal{R}_{\Omega}\mathcal{P}_{T}-\mathcal{P}_{T}\right\|_{op}\leq\frac{1}{2}$ . On the other hand, $\mathcal{P}_{\Omega}(\bm{Z})=0$ implies $\mathcal{R}_{\Omega}(\bm{Z})=0$ and thus

Combining the last two display equations gives

Appendix B Proofs of Technical Lemmas in Section 5

We prove the technical lemmas that are used in the proof of Theorem 1. The proofs use the matrix Bernstein inequality, restated below.

and $\left\|\bm{X}_{k}\right\|\leq B$ almost surely for all $k$ . Then for any $c>1$ , we have

with probability at least $1-(2n)^{-(c-1)}.$

We also make use of the following facts: for all $i$ and $j$ , we have

This follows from the definition of $\mathcal{P}_{T}$ and the standard incoherence condition (2).

B.2 Proof of Lemma 3

Fix $b\in[n]$ . The $b$ -th column of the matrix $\left(\mathcal{P}_{T}\mathcal{R}_{\Omega}-\mathcal{P}_{T}\right)\bm{Z}$ can be written as

where the last inequality follows from the assumption of $p$ in the statement of the lemma. We also have

using the incoherence condition (2). It follows that

provided $c_{0}$ in the lemma statement is large enough. In a similar fashion we prove that $\left\|\bm{e}_{a}^{\top}\left(\left(\mathcal{P}_{T}\mathcal{R}_{\Omega}-\mathcal{P}_{T}\right)\bm{Z}\right)\right\|$ is bounded by the same quantity w.h.p. The lemma follows from a union bound over all $(a,b)\in[n]\times[n]$ .

Appendix C Proof of Corollary 1

When $p\gtrsim\frac{\log^{2}n}{n}$ , the standard Bernstein inequality and a union bound implies that w.h.p. the degrees (i.e., the number of observed entries) of the rows and columns of $\mathcal{P}_{\Omega}\bm{M}$ are bounded by $2pn$ . This means $\frac{1}{p}\widetilde{\bm{M}}^{\Omega}=\frac{1}{p}\mathcal{P}_{\Omega}\bm{M}=\mathcal{R}_{\Omega}\bm{M}$ . By Lemma 4, we have

where we use (16) and (17) in the last inequality. Since the rank of $\bm{M}-\textsf{T}_{r}(\widetilde{\bm{M}}^{\Omega})$ is at most $r$ , we have $\left\|\bm{M}-\textsf{T}_{r}(\widetilde{\bm{M}}^{\Omega})\right\|_{F}\leq\sqrt{r}\left\|\bm{M}-\textsf{T}_{r}(\widetilde{\bm{M}}^{\Omega})\right\|$ and the corollary follows.

Appendix D Proof of Theorem 2

The proof is similar to that of Theorem 1, and we shall point out where they differ. We use the same notations as in the proof of Theorem 1, except that throughout this section we re-define the two projections:

Note that $\mathcal{P}_{T}\bm{Z}+\mathcal{P}_{T^{\bot}}\bm{Z}=\bar{\bm{U}}\bar{\bm{U}}^{\top}\bm{Z}\bar{\bm{V}}\bar{\bm{V}}^{\top}$ . Since $\mbox{col}(\bm{U})\subseteq\mbox{col}(\bar{\bm{U}})$ and $\frac{\mu_{0}r}{n}\leq\frac{\bar{\mu}_{0}\bar{r}}{n}$ , one can verify that under the incoherence assumption on $\bm{U}$ and $\bar{\bm{U}}$ in the theorem statement, we have for all $i,j,b\in[n]$ ,

We have the following subgradient optimality condition.

$\bm{X}^{*}:=\bar{\bm{U}}^{\top}\bm{M}\bar{\bm{V}}$ is the unique optimal solution to the program (6) if the following conditions are satisfied: 1. $\left\|\mathcal{P}_{T}-\mathcal{P}_{T}\mathcal{R}_{\Omega}\mathcal{P}_{T}\right\|_{op}\leq\frac{1}{2}$ and $\frac{1}{\sqrt{p}}\left\|\mathcal{P}_{\Omega}\mathcal{P}_{T^{\bot}}\right\|_{op}\leq\sqrt{\frac{2\bar{\mu}_{0}\bar{r}}{\mu_{0}r}}$ ; 2. there exist a dual certificate $\bm{Y}$ with $\mathcal{P}_{\Omega}\bm{Y}=\bm{Y}$ and obeys (a) $\left\|\mathcal{P}_{T}\bm{Y}-\bm{U}\bm{V}^{\top}\right\|_{F}\leq\sqrt{\frac{\mu_{0}r}{32\bar{\mu}_{0}\bar{r}}}$ and (b) $\left\|\mathcal{P}_{T^{\bot}}\bm{Y}\right\|\leq\frac{1}{2}.$

On the other hand, since $\left\|\mathcal{P}_{T}\mathcal{R}_{\Omega}\mathcal{P}_{T}-\mathcal{P}_{T}\right\|_{op}\leq\frac{1}{2}$ by assumption, we have

Because $\bar{\bm{U}}\bar{\bm{U}}^{\top}\boldsymbol{\Delta}\bar{\bm{V}}\bar{\bm{V}}^{\top}=\boldsymbol{\Delta}$ , we have $0=\mathcal{P}_{\Omega}(\boldsymbol{\Delta})=\mathcal{P}_{\Omega}\left(\mathcal{P}_{T}+\mathcal{P}_{T^{\bot}}\right)\boldsymbol{\Delta}$ and thus

where the last inequality follows from Condition 1 in the statement of the proposition. Combining the last two display equations gives $\left\|\mathcal{P}_{T}\boldsymbol{\Delta}\right\|_{F}\leq\sqrt{\frac{2\bar{\mu}_{0}\bar{r}}{\mu_{0}r}}\left\|\mathcal{P}_{T^{\bot}}\boldsymbol{\Delta}\right\|_{*}.$ It follows from (21) that

The last RHS is strictly positive for all $\boldsymbol{\Delta}$ with $\mathcal{P}_{\Omega}\boldsymbol{\Delta}=0$ and $\boldsymbol{\Delta}\neq 0$ ; otherwise we would have $\mathcal{P}_{T}\boldsymbol{\Delta}=\left(\mathcal{P}_{T}+\mathcal{P}_{T^{\bot}}\right)\boldsymbol{\Delta}=\boldsymbol{\Delta}$ and thus $\mathcal{P}_{T}\mathcal{R}_{\Omega}\mathcal{P}_{T}\boldsymbol{\Delta}=0$ , contradicting $\left\|\mathcal{P}_{T}\mathcal{R}_{\Omega}\mathcal{P}_{T}-\mathcal{P}_{T}\right\|_{op}\leq\frac{1}{2}$ . This proves that $\bm{X}^{*}:=\bar{\bm{U}}^{\top}\bm{M}\bar{\bm{V}}$ is the unique optimal solution to (6). ∎

We proceed by showing that Condition 1 in Proposition 3 is satisfied w.h.p. under the conditions of Theorem 2. This is done in the lemma below, which is proved in Section D.1 to follow.

If $p\geq c_{0}\frac{\mu_{0}\bar{\mu}_{0}r\bar{r}}{n^{2}}\log n$ for some sufficiently large constant $c_{0}$ , then w.h.p. we have

We now construct a dual certificate $\bm{Y}$ using the golfing scheme. This is done similarly as before; in particular, we let $k_{0}:=20\log(32\bar{\mu}_{0}\bar{r})$ , $q:=1-(1-p)^{1/k_{0}}\geq\frac{p}{k_{0}}$ , $\bm{W}_{k}$ be given by (12) (with the re-defined $\mathcal{P}_{T}$ ) and $\bm{Y}:=\bm{W}_{k_{0}}$ . Clearly $\mathcal{P}_{\Omega}(\bm{Y})=\bm{Y}$ by construction. Note that for $k\in[k_{0}]$ , the matrix $\bm{D}_{k}:=\bm{U}\bm{V}^{\top}-\mathcal{P}_{T}(\bm{W}_{k})$ again satisfies (13). It follows that $\left\|\bm{D}_{k}\right\|_{F}\leq\frac{1}{2}\left\|\bm{D}_{k-1}\right\|_{F}$ by the first inequality in Lemma 6 and thus

proving Condition 2(a) in Proposition 3. To prove Condition 2(b), we need three lemmas which are analogues of Lemmas 2, 3 and 4 in the proof of Theorem 1.

Suppose $\bm{Z}$ is a fixed $n\times n$ matrix. For some universal constant $c>1$ , we have w.h.p.

Suppose $\bm{Z}$ is a fixed matrix. If $p\geq c_{0}\frac{\mu_{0}\bar{\mu}_{0}r\bar{r}\log n}{n}$ for some $c_{0}$ sufficiently large, then w.h.p.

Suppose $\bm{Z}$ is a fixed $n\times n$ matrix. If $p\geq c_{0}\frac{\mu_{0}\bar{\mu}_{0}r\bar{r}\log n}{n}$ for some $c_{0}$ sufficiently large, then w.h.p.

We prove these lemmas in Sections D.2–D.4 to follow. Following the same lines as in the proof of Theorem 1, we obtain

Applying Lemma 7 with $\Omega$ replaced by $\Omega_{k}$ to each summand of the last R.H.S, we get that w.h.p.

where the last inequality follows from $q\geq\frac{p}{20\log(32\bar{\mu}_{0}\bar{r})}\geq c_{0}\frac{\mu_{0}\bar{\mu}_{0}r\bar{r}\log n}{20n^{2}}.$ Again following the same lines as in the proof of Theorem 1, but using the new Lemmas 9 and 8, we can bound the two terms above as

The inequality $\left\|\mathcal{P}_{T^{\bot}}(\bm{Y})\right\|\leq\frac{1}{2}$ then follows from the incoherence conditions (2) of $\bm{U}$ and $\bm{V}$ provided $c_{0}$ is sufficiently large. This proves Condition 2(b) in Proposition 3 and hence completes the proof of Theorem 2.

The proof of the first inequality is identical to that of Lemma 1 except that we use (18) instead of (15) (cf. Theorem 4.1 in and Lemma 11 in ).

Applying the the matrix Bernstein inequality in Theorem 4, we obtain w.h.p.

for some constant $c^{\prime}$ , where the last inequality follows from $\mu_{0}r\leq\bar{\mu}_{0}\bar{r}$ and the assumption $p\geq c_{0}\frac{\mu_{0}\bar{\mu}_{0}r\bar{r}}{n^{2}}\log(2n)$ . It follows that

where in the last inequality we again use the assumption $p\geq c_{0}\frac{\mu_{0}\bar{\mu}_{0}r\bar{r}}{n^{2}}\log(2n).$

D.2 Proof of Lemma 7

by (19). Moreover, since $\mbox{col}(\bm{U})\subseteq\mbox{col}(\bar{\bm{U}})$ , $\mbox{col}(\bm{V})\subseteq\mbox{col}(\bar{\bm{V}})$ and $\bar{\bm{U}}\bar{\bm{U}}^{\top}-\bm{U}\bm{U}^{\top}$ , $\bar{\bm{V}}\bar{\bm{V}}^{\top}-\bm{V}\bm{V}^{\top}$ are projections, we have

D.3 Proof of Lemma 8

Fix $b\in[n]$ . The $b$ -th column of the matrix $\left(\mathcal{P}_{T}\mathcal{R}_{\Omega}-\mathcal{P}_{T}\right)\bm{Z}$ can be written as

by (20) and the assumption on $p$ . We also have

provided $c_{0}$ in the statement of the lemma is sufficiently large. In a similar fashion we can prove that the $\left\|\bm{e}_{a}^{\top}\left(\left(\mathcal{P}_{T}\mathcal{R}_{\Omega}-\mathcal{P}_{T}\right)\bm{Z}\right)\right\|_{2}$ is bounded by the same quantity w.h.p. The lemma follows from a union bound over all $(a,b)\in[n]\times[n]$ .

D.4 Proof of Lemma 9

Fix $(a,b)\in[n]\times[n]$ . We can write the $(a,b)$ entry of the matrix $\left(\mathcal{P}_{T}\mathcal{R}_{\Omega}-\mathcal{P}_{T}\right)\bm{Z}$ as

Applying the Bernstein inequality in Theorem 4, we conclude that w.h.p. $\left|\left[\left(\mathcal{P}_{T}\mathcal{R}_{\Omega}\mathcal{P}_{T}-\mathcal{P}_{T}\right)\bm{Z}\right]_{ab}\right|\leq\frac{1}{2}\left\|\bm{Z}\right\|_{(\infty)}$ for $c_{0}$ sufficiently large. The lemma follows from a union bound over all $(a,b)\in[n]\times[n]$ .

Appendix E Proof of Theorem 3

We reduce the planted problem above to the matrix decomposition problem using subsampling. Given the matrix $\bar{\bm{A}}$ , we set each $\bar{A}_{ij}$ to zero with probability $\frac{2}{3}$ independently, and let $\bm{A}$ be the resulting matrix. If we let $\bm{S}^{*}:=\bm{A}-\bm{L}^{*}$ , then each pair $S_{ij}^{*}=S_{ji}^{*}$ is non-zero with probability $\tau=\frac{1}{3}$ . Moreover, the matrix $\bm{L}^{*}$ has rank $1$ and satisfies the standard and joint incoherence conditions (2) and (3) with parameters $\mu_{0}=1/n_{\min}$ and $\mu_{1}=n^{2}/n_{\min}^{2}$ . Hence recovering $\left(\bm{L}^{*},\bm{S}^{*}\right)$ from $\bm{A}$ is a special case of the matrix decomposition problem. If there exists a polynomial-time algorithm that, for all $n$ , finds $\bm{L}^{*}$ given $\bm{A}$ with probability at least $\frac{1}{2}$ when

then it means this algorithm recovers the planted clique with $n_{\min}\leq n^{\frac{1}{2}-\frac{\epsilon^{\prime}}{2(1-\epsilon^{\prime})}}$ from $\bar{\bm{A}}$ , which violates the assumption A1.

E.2 Part 2 of the theorem

With slight abuse of notation, we use $D(q_{1}\|q_{2}):=q_{1}\log\frac{q_{1}}{q_{2}}+(1-q_{1})\log\frac{1-q_{1}}{1-q_{2}}$ to denote the KL divergence between two Bernoulli distributions with parameters $q_{1}$ and $q_{2}$ . Direct computation gives

for all $l,l^{\prime}=1,\ldots,M$ , where the inequality above follows from $\log x\leq x-1.$ It follows that $I(\bm{L}^{*};\bm{A})\leq K+1.$

We now apply the Fano’s inequality to obtain that for any measurable function $\hat{\bm{L}}$ of $\bm{A}$ ,

where the probability is with respect to the randomness of $\bm{L}^{*}$ and $\bm{S}^{*}$ , and the last inequality holds when $\frac{\log n}{12K}=\frac{\mu_{0}r\log n}{12n}\geq 1$ and $n\geq 10$ . Because the supremum is lower bounded by the average, we obtain

where the probability is with respect to the randomness of $\bm{S}^{*}$ .