Convergence Analysis for Rectangular Matrix Completion Using Burer-Monteiro Factorization and Gradient Descent

Qinqing Zheng, John Lafferty

Introduction

A growing body of recent research is shedding new light on the role of nonconvex optimization for tackling large scale problems in machine learning, signal processing, and convex programming. This work is developing techniques that help to explain the surprising effectiveness of relatively simple first-order algorithms for certain nonconvex optimizations.

When applied to problems that can be formulated as semidefinite programs, these techniques can often be viewed as part of a framework proposed by Burer and Monteiro . The Burer-Monteiro technique is based on factoring the semidefinite variable, and applying classical optimization techniques to the resulting nonconvex objective over the factor. While worst-case complexity considerations imply that such an approach cannot succeed in general, a series of recent papers has shown the strategy to be remarkably effective for a number of problems of practical interest, with analytical convergence guarantees and strong empirical performance.

In order for this problem to be well-posed, it is important to understand when $X^{\star}$ is identifiable and, in particular, the unique minimizer of (1). Moreover, because the problem is in general NP-hard, it is essential to identify tractable families of instances, together with efficient algorithms having global convergence guarantees.

In the following section we give a full description of our approach. Our theoretical results are presented in Section 3, with detailed proofs contained in the appendix. Our analysis subsumes the case where $X^{\star}$ is positive semidefinite. In Section 4 we briefly review related work. The experimental results are presented in Section 5, and we conclude with a brief discussion of future work in Section 6.

Semidefinite Lifting, Factorization, and Gradient Descent

In this paper, we focus on completing an incoherent or “non-spiky” matrix $X^{\star}$ . With $U^{\star}\Sigma^{\star}V^{\star}$ denoting the rank- $r$ SVD of $X^{\star}$ , we assume $X^{\star}$ is $\mu$ -incoherent, as defined below.

The matrix $X^{\star}$ is $\mu$ -incoherent with respect to the canonical basis if its singular vectors satisfy

where $\mu$ is a constant.Note that $\mu\geq 1$ , since $r=\|U^{\star}\|^{2}_{F}=\sum_{i\in[n_{1}]}\left\|U^{\star}_{(i)}\right\|^{2}_{2}\leq\mu r$ .

Our main interest is the uniform model where $m$ entries of $X^{\star}$ are observed uniformly at random, though we shall analyze a Bernoulli sampling model, where each entry of $X^{\star}$ is observed with probability $p={m}/{n_{1}n_{2}}$ . One can transfer the results back to the uniform model, as the probability of failure under the uniform model is at most twice that under the Bernoulli model; see .

Using the rank- $r$ SVD of $X^{\star}$ , we can lift $X^{\star}$ to

The symmetric decomposition of $Y^{\star}$ is not unique; our goal is to find a matrix in the set

since for any $\widetilde{Z}\in\mathcal{S}$ we have $X^{\star}=\widetilde{Z}_{U}{\widetilde{Z}_{V}}^{\top}$ . Let $\underline{\Omega}$ denote the corresponding observed entries of $Y^{\star}$ , and consider minimization of the squared error

Note that $Y^{\star}$ is not the unique minimizer of (6), nor is it the only possible positive semidefinite lifting of $X^{\star}$ . For example, let $P$ be an $r\times r$ nonsingular matrix, and form the matrices

Since $\underline{\Omega}$ does not contain any entry in the top-left or bottom-right block, $Y^{\prime}$ is also a minimizer of (6). Thus, the solution set of the lifted problem is much larger than the set $\mathcal{S}$ of actual interest. For the sake of simple analysis, we shall focus on exact recovery of $Y^{\star}$ only, and thus impose an additional regularizer to align the column spaces of $Z_{U}$ and $Z_{V}$ , as in . The regularized loss is

While this apparently introduces an extra tuning parameter, our analysis establishes linear convergence of the projected gradient descent algorithm when $\lambda=\frac{1}{2}$ , and thus one may treat $\lambda$ as a fixed number.

It is discussed in that one needs to ensure the iterates stay incoherent. Let $\mathcal{C}$ be the set of incoherent matrices

where we assume $\mu$ is known and $Z^{0}$ will be determined.

Our algorithm is simply gradient descent on $f(Z)$ , with projection onto $\mathcal{C}$ . Let $M=p^{-1}\mathcal{P}_{\Omega}(UV^{\top}-X^{\star})$ . Then the gradient of $f$ is given by

The projection $\mathcal{P}_{\mathcal{C}}$ to the feasible set $\mathcal{C}$ has closed form solution, given by row-wise clipping:

Note that $X^{0}\equiv p^{-1}\mathcal{P}_{\Omega}(X^{\star})$ is an unbiased estimator of $X^{\star}$ under the Bernoulli model. To initialize, we thus construct $Z^{0}$ from the top rank- $r$ factors of $X^{0}$ . This leads to the following algorithm.

Remarks. (i) The step size $\eta$ is normalized by $\left\|Z^{0}\right\|^{2}$ . Our analysis will establish linear convergence when taking step sizes of the form $\eta/\sigma^{\star}_{1}$ , where $\eta$ is a sufficiently small constant. We replace $\sigma^{\star}_{1}$ by $\left\|Z^{0}\right\|^{2}$ in the actual algorithm since it is unknown in practice. (ii) The feasible set (9) depends on $\left\|Z^{0}\right\|$ as well. Under the above spectral initialization, our analysis shows that when $p\geq O({\mu\kappa^{2}r^{2}\log n}/{n_{1}\land n_{2}})$ , the term $\sqrt{\frac{2\mu r}{n_{1}\land n_{2}}}\left\|Z^{0}\right\|$ is an upper bound of $\left\|Z^{\star}\right\|_{2,\infty}$ with high probability (see Corollary 1 below). This means $\mathcal{S}$ is a subset of $\mathcal{C}$ . Note that this does not change the global optimality of $Z^{\star}$ and its equivalent elements, since $f(Z^{\star})=0$ . In practice, we find that the iterates of our algorithm remain incoherent, so that one may drop the projection step. (iii) The column space regularizer (8) is needed in our analysis. We also found that when $\lambda=0$ , our algorithm typically converges to another PSD lifted matrix of $X^{\star}$ , with minor difference from $Y^{\star}$ in the top-left and bottom-right blocks.

In the following section we state and sketch a proof of our main convergence result for this algorithm.

Main Result: Convergence Analysis

Suppose that $X^{\star}$ is of rank $r$ , with condition number $\kappa=\sigma^{\star}_{1}/\sigma^{\star}_{r}$ , and $\mu$ -incoherent as defined in Definition 1. Suppose further that we observe $m$ entries of $X^{\star}$ chosen uniformly at random. Let $Y^{\star}=Z^{\star}{Z^{\star}}^{\top}$ be the lifted matrix as in (4) and write $n=\max(n_{1},n_{2})$ . Then there exist universal constants $c_{0},c_{1},c_{2},c_{3}$ such that if

then with probability at least $1-c_{1}n^{-c_{2}}$ the iterates of Algorithm 1 converge to $Z^{\star}$ geometrically, when using regularization parameter $\lambda=1/2$ , correctly specified input rank $r$ , and constant step size $\eta/\sigma^{\star}_{1}$ with $\eta\leq\displaystyle{c_{3}}/{\mu^{2}r^{2}\kappa}$ .

We shall analyze the Bernoulli sampling model, as justified in Section 2. Let us define the distance to $Z^{\star}$ in terms of the solution set $\mathcal{S}$ .

Define the distance between $Z$ and $Z^{\star}$ as

The next theorem establishes the global convergence of Algorithm 1, assuming that the input rank is correctly specified. The proof sketch is given in the next subsection.

There exist universal constants $c_{0},c_{1},c_{2}$ such that if $p\geq\dfrac{c_{0}\mu r^{2}\kappa^{2}\log n}{n_{1}\land n_{2}}$ , with probability at least $1-c_{1}n^{-c_{2}}$ , the initialization $Z^{1}\in\mathcal{C}$ satisfies

Moreover, there exist universal constants $c_{3},c_{4},c_{5},c_{6}$ such that if $p\geq\dfrac{c_{3}\max(\mu r^{2}\kappa^{2},\mu r\log n)}{n_{1}\land n_{2}}$ , when using constant step size $\eta/\sigma^{\star}_{1}$ with $\eta\leq\dfrac{c_{4}}{\mu^{2}r^{2}\kappa}$ and initial value $Z^{1}\in\mathcal{C}$ obeying (13), the $k$ th step of Algorithm 1 with $\lambda=1/2$ satisfies

with probability at least $1-c_{5}n^{-c_{6}}$ .

After each update, the distance of our iterates to $Z^{\star}$ is reduced by at least a factor of $1-O(1/\mu^{2}r^{2}\kappa^{2})$ .

Hence, the output $\widehat{Z}$ satisfies $d(\widehat{Z},Z^{\star})\leq\varepsilon$ after at most $\left\lceil 2\log^{-1}\left({1}/{(1-\frac{99}{256}\cdot\frac{\eta}{\kappa})}\right)\log\left({\sqrt{\sigma^{\star}_{r}}}/{4\varepsilon}\right)\right\rceil$ iterations.

Our proof idea is of the same nature as the analysis in . We show two appealing properties when sufficient entries are observed. First, our spectral initialization produces a starting point within the $O(\sigma^{\star}_{r})$ neighborhood of the solution set.

There exist universal constants $c,c_{1},c_{2}$ , such that if $p\geq\dfrac{c\mu r^{2}\kappa^{2}\log n}{n_{1}\land n_{2}}$ then with probability at least $1-c_{1}n^{-c_{2}}$ ,

To demonstrate this, we exploit the concentration around the mean of $p^{-1}\mathcal{P}_{\Omega}(X^{\star})$ . See Appendix B for the proof. Using this lemma, we can immediately show that $Z^{\star}$ and all other elements of $\mathcal{S}$ are contained in the feasible set (9).

With probability at least $1-c_{1}n^{-c_{2}}$ , $\left\|Z^{\star}\right\|_{2,\infty}\leq\sqrt{\frac{2\mu r}{n_{1}\land n_{2}}}\left\|Z^{0}\right\|.$

The second crucial property is that $f(Z)$ is well-behaved within the $O(\sqrt{\sigma^{\star}_{r}})$ neighborhood, so that the iterates move closer to the optima in every iteration. The key step is to set up a local regularity condition similar to Nesterov’s conditions .

Let $\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu=\operatorname*{arg\,min}_{\widetilde{Z}\in\mathcal{S}}\big{\|}Z-\widetilde{Z}\big{\|}_{F}$ denote the matrix closest to $Z$ in the solution set. We say that $f$ satisfies the regularity condition $RC(\varepsilon,\alpha,\beta)$ if there exist constants $\alpha$ , $\beta$ such that for any $Z\in\mathcal{C}$ satisfying $d(Z,Z^{\star})\leq\varepsilon$ , we have

Using this condition, one can show the iterates converge linearly to the optima if we start close enough to $Z^{\star}$ .

Consider the update $Z^{k+1}=\mathcal{P}_{\mathcal{C}}\left(Z^{k}-\dfrac{\mu}{\sigma^{\star}_{1}}\nabla f(Z^{k})\right)$ . If $f$ satisfies $RC(\varepsilon,\alpha,\beta)$ , $d(Z^{k},Z^{\star})\leq\varepsilon$ and $0<\mu\leq\min(\alpha/2,2/\beta)$ , then

The following lemma illustrates the local regularity of $f(Z)$ . Nesterov’s criterion is established upon strong convexity and strong smoothness of the objective. Here we show analogous curvature and smoothness conditions holds for $f(Z)$ locally – within the $O(\sqrt{\sigma^{\star}_{r}})$ neighborhood – with high probability. Interestingly, we found that to show the local curvature condition holds, it suffices to set $\lambda=\frac{1}{2}$ . The proof can be found in Appendix C, for which we have generalized some technical lemmas of .

Let the regularization constant be set to $\lambda=\frac{1}{2}$ . There exists universal constant $c,c_{1},c_{2}$ , such that if $p\geq\dfrac{c\max(\mu^{2}r^{2}\kappa^{2},\mu r\log n)}{n_{1}\land n_{2}}$ , then $f$ satisfies $RC(\frac{1}{4}\sqrt{\sigma^{\star}_{r}},512/99,13196\mu^{2}r^{2}\kappa)$ , with probability at least $1-c_{1}n^{-c_{2}}$ .

Related Work

Matrix completion is one instance of the general low rank linear inverse problem

where $\mathcal{A}$ is an affine transformation and $b=\mathcal{A}(X^{\star})$ is the measurement of the ground truth $X^{\star}$ . Considerable progress has been made towards algorithms for recovering $X^{\star}$ including both convex and nonconvex approaches. One of the most popular methods is nuclear norm minimization, a convenient convex relaxation of rank minimization. It was first proposed in , and analyzed under a certain restricted isometry property (RIP). Subsequent work clarified the conditions for reconstruction, and studied recovery guarantees for both exact and approximately low rank matrices, with or without noise . One significant advantage for this approach is its near-optimal sample complexity. Under the same incoherence assumption as ours, Chen establishes the currently best-known lower bound of $O(\mu rn\log^{2}n)$ samples. Using a closely related notion of incoherence, Negahban and Wainwright show that if $X^{\star}$ is “ $\alpha$ -nonspiky” with $\frac{\left\|X^{\star}\right\|_{\infty}}{\left\|X^{\star}\right\|_{F}}\leq\frac{\alpha}{\sqrt{n_{1}n_{2}}}$ , then $O(\alpha^{2}rn\log n)$ samples are sufficient for exact recovery. However, convexity and low sample complexity aside, in practice the power of nuclear norm relaxation is limited due to high computational cost. The popular algorithms for nuclear norm minimization are proximal methods that perform iterative singular value thresholding . However, such algorithms don’t scale to large instances because the per-iteration SVD is expensive.

In the same year, Jain et al. suggested minimizing the squared residual $\left\|\mathcal{A}(X)-b\right\|^{2}$ under a rank constraint $\operatorname{\text{rank}}(X)\leq r$ . While this constraint is nonconvex, projection onto the feasible set can be computed using low rank SVD. Under certain RIP assumption on $\mathcal{A}$ , Jain et al. establish the global convergence of projected gradient descent for (14). This algorithm is named Singular Value Projection (SVP). Yet in the setting of completion, only experimental support for the effectiveness of SVP is provided. More importantly, SVP also suffers from expensive per-iteration SVD for large scale problems.

Another theoretical disadvantage of the resampling scheme is that the sample complexity depends on the desired accuracy $\varepsilon$ , as established by . As the accuracy goes to zero, the sample complexity increases. In contrast, our algorithm doesn’t require resampling, and the sample complexity is independent of $\varepsilon$ .

In 2014, Candès et al. proposed Wirtinger flow for phase retrieval. Wirtinger flow is a fast first-order algorithm that minimizes a fourth order (nonconvex) objective, geometrically converging to the global optimum. While previous work lifts the phase retrieval problem into an SDP where the solution is rank one, this work bridges SDP and first-order algorithms via the Burer-Monteiro technique. It has inspired further research on related topics; last year, the authors of considered factorizations for (14), assuming $X^{\star}$ is semidefinite, and proved global optimality of first-order algorithms under appropriate initializations. Tu et al. have extended this algorithm to handle rectangular matrix via asymmetric factorization, and have shown exact recovery of $X^{\star}$ , assuming $\mathcal{A}$ satisfies a certain RIP. They use lifting implicitly, factorizing $X=Z_{U}Z_{V}^{\top}$ and applying gradient updates on both factors $Z_{U}$ and $Z_{V}$ simultaneously, with the nonconvex objective function

Their proof strategy also shows convergence of $Z$ in the lifted space. For the specific case of matrix completion, Chen and Wainwright obtained guarantees when $X^{\star}$ is semidefinite. Our work generalizes the results obtained in , extending the recent literature on first-order algorithms for factorized models.

After completing this work we learned of independent research of Sun and Luo , who also analysed a gradient algorithm for rectangular matrix completion. Their formulation is similar to ours, with additional Frobenius norm constraints on the factors. The authors established a sample complexity of $O(r^{7}\kappa^{6})$ observations; in comparison our bound scales as $O(r^{2}\kappa^{2})$ . The authors also analyzed block coordinate descent type alternating minimization, which cyclically updates the rows of $U$ and then the rows of $V$ , showing exact recovery of this algorithm without resampling. Recent independent work of Yi et al. analyzes a gradient scheme for Robust PCA. Under the setting of partial observation without corruption, this is the standard matrix completion problem. In other related work, also study nonconvex optimization methods for matrix completion, using algorithms that still require low rank SVD in each iteration.

Experiments

where $\lambda=0$ will enforce the minimizer fitting the observed values exactly. We use ADMM to solve (16). It is based on the algorithm for the matrix approach in , and can neatly handle the case $\lambda=0$ . We emphasize there is no computational difference between cases whether $\lambda$ is zero or not. All methods are implemented in MATLAB. We use the toolbox Manopt for trustRegion and the implementation of OptSpace from the authors. For AltMin, we use the same sample sets in every iteration. The experiments were run on a Linux machine with a 3.4GHz Intel Core i7 processor and 8 GB memory.

Runtime Comparison

We randomly generated a true matrix $X^{\star}$ of size $4000\times 2000$ and rank 3. It is constructed from the rank- $3$ SVD of a random $4000\times 2000$ matrix with i.i.d normal entries. We sampled $m=199057$ entries of $X^{\star}$ uniformly at random, where $m$ is roughly equal to $2nr\log n$ with $n=4000$ and $r=3$ . For simplicity, we feed SVP, OptSpace and GD with the true rank. For all these methods, we use the randomized algorithm of Halko et al. to compute the low rank SVD, which is approximately 15 times faster than MATLAB built-in SVD on instances of such size. We report relative error measured in the Frobenius norm, defined as $\|\widehat{X}-X^{\star}\|_{F}/\left\|X^{\star}\right\|_{F}$ . For nuclear, we set $\lambda=0$ to enforce exact fitting. The convergence speed of ADMM mildly depends on the choice of penalty parameter. We tested $5$ values $0.1,0.2,0.5,1,1.5$ and selected $0.2$ , which leads to fastest convergence. Similarly, for SVP, we would like to choose the largest step size for which the algorithm is converging. We evaluated $15,20,30,35,40$ and selected $30$ . The step size is chosen for GD in the same way. Five values $20,50,70,75,80$ are tested for $\eta$ and we picked $70$ . For OptSpace, we compared fixed step sizes $0.50.10.050.010.005$ with line search, and found the algorithm converged fastest under line search. Figure 1a shows the results. GD is slightly slower than trustRegion and faster than competing approaches.

To further illustrate how runtime scales as the dimension increases, we run larger instances of size $10000\times 5000$ and $20000\times 5000$ , where the true rank is $40$ . The parameters are selected in the same manner, and we terminate the computation once the relative error is below $1e^{-9}$ . We report the results of AltMin GD, SVP and trustRegion in Figure 2a; nuclear, OptSpace do not scale well to such sizes so that we didn’t include them. The runtime of AltMin scales the slowest, while the runtimes of GD and trustRegion increase slower than SVP.

Sample Complexity

We evaluate the number of observations required by GD for exact recovery. For simplicity, we consider square but asymmetric $X^{\star}$ . We conducted experiments in 4 cases, where the randomly generated $X^{\star}$ is of size $500\times 500$ or $1000\times 1000$ , and of rank $10$ or $20$ . In each case, we compute the solutions of GD given $m$ random observations, and a solution with relative error below $1e^{-6}$ is considered to be successful. We run 20 trials and compute the empirical probability of successful recovery. The results are shown in Figure 2b. For all four cases, the phase transitions occur around $m\approx 3.5nr$ . This suggests that the actual sample complexity of GD may scale linearly with both the dimension $n$ and the rank $r$ .

Conclusion

We propose a lifting procedure together with Burer-Monteiro factorization and a first-order algorithm to carry out rectangular matrix completion. While optimizing a nonconvex objective, we establish linear convergence of our method to the global optimum with $O(\mu r^{2}\kappa^{2}n\max(\mu,\log n))$ random observations. We conjecture that $O(nr)$ observations are sufficient for exact recovery, and that the column space regularizer can be dropped. We provide empirical evidence showing this simple algorithm is fast and scalable, suggesting that lifting techniques may be promising for much more general classes of problems.

Acknowledgements

The authors thank Rina Foygel Barber, Yudong Chen and Ruoyu Sun for helpful comments. Research supported in part by ONR grant N00014-15-1-2379 and NSF grant DMS-1513594.

Appendix A Technical Lemmas

Another way of writing the objective function is

where $l$ is an index of $\underline{\Omega}$ , $A_{l}$ is a matrix with $1$ at the corresponding observed entry and elsewhere. Let $H=Z-\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu$ , the gradient can be written as

We will use the following facts throughout the proof:

Inequality (17) is a direct result of Definition 1. To see (18), note that $\left\|H\right\|_{2,\infty}\leq\left\|Z\right\|_{2,\infty}+\left\|\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu\right\|_{2,\infty}\leq\sqrt{\frac{2\mu r}{n_{1}\land n_{2}}\sigma_{1}}+\sqrt{\frac{\mu r}{n_{1}\land n_{2}}\sigma^{\star}_{1}}$ , and $|\sigma_{1}-\sigma^{\star}_{1}|\leq\frac{1}{16}\sigma^{\star}_{1}$ by the discussion of initialization in Appendix B. For (20), it holds that

where $A\Lambda B^{\top}$ is the SVD of ${Z^{\star}}^{\top}Z$ . Clearly, $Z^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu$ is positive semidefinite, and $H^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu=Z^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu-{\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu}^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu=B\Lambda B^{\top}-{\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu}^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu$ is symmetric.

Next, we list several technical lemmas that are utilized later. We will use $c$ to denote a numerical constant, whose value may vary from line to line.

For any $Z$ of the form $Z=\begin{bmatrix}Z_{U}\\ Z_{V}\end{bmatrix}=\begin{bmatrix}U\Sigma^{\frac{1}{2}}R\\ V\Sigma^{\frac{1}{2}}R\end{bmatrix}$ , where $U,V,R$ are unitary matrices and $\Sigma\succeq 0$ is a diagonal matrix, we have

where $X^{\star}=U^{\star}\Sigma^{\star}{V^{\star}}^{\top}$ . We have

Plugging (24) back into (LABEL:eq:ZZ_to_X_helper0), we obtain the lemma. ∎

Recall that $n=\max(n_{1},n_{2})$ . We will exploit the following two known concentration results.

Let $\mathcal{P}_{T}$ be the Euclidean projection onto $T$ . There is a numerical constant $c$ such that for any $\delta\in(0,1]$ , if $p\geq\dfrac{c}{\delta^{2}}\dfrac{\mu r\log n}{n_{1}\land n_{2}}$ , then with probability $1-3n^{-3}$ , we have

Lemma 7 upper bounds the spectral norm of the adjacency matrix of a random Erdős-Rényi graph. It is a variant of Lemma 7.1 of Keshavan et al. , which uses known results of Feige and Ofek .

We refer readers to for a complete proof, in particular noticing that one can choose $p$ large enough so that the constant factor in the first term in (26) is only $1+\delta$ .

Lemma 8, 9 and 10 are direct generalizations of Lemma 4 and 5 of .

There exists a constant $c$ such that, for any $\delta\in(0,1]$ , if $p\geq\frac{c}{\delta^{2}}\max\left(\frac{\log(n_{1}+n_{2})}{n_{1}+n_{2}},\frac{\mu^{2}r^{2}\kappa^{2}}{n_{1}\land n_{2}}\right)$ , then with probability at least $1-\frac{1}{2}(n_{1}+n_{2})^{-4}$ , uniformly for all $H$ such that $\left\|H\right\|_{2,\infty}\leq 3\sqrt{\frac{\mu r}{n_{1}\land n_{2}}\sigma^{\star}_{1}}$ , we have

where $(a)$ follows from Lemma 7 and $(b)$ follows from $\left\|H\right\|_{2,\infty}\leq 3\sqrt{\frac{\mu r}{n_{1}\land n_{2}}\sigma^{\star}_{1}}$ .

Let us further assume $p\geq\frac{162c^{2}_{2}\mu^{2}r^{2}\kappa^{2}\gamma}{\delta^{2}(n_{1}\land n_{2})}$ , where $\gamma=n/(n_{1}\land n_{2})$ is a fixed constant, then we can bound

The final threshold we obtain is thus $p\geq\frac{c}{\delta^{2}}\max\left(\frac{\log(n_{1}+n_{2})}{n_{1}+n_{2}},\frac{\mu^{2}r^{2}\kappa^{2}}{n_{1}\land n_{2}}\right)$ for some constant $c$ . ∎

There exists a constant $c$ , if $p\geq\dfrac{c\log n}{n_{1}\land n_{2}}$ , then with probability at least $1-2n_{1}^{-4}-2n_{2}^{-4}$ , uniformly for all matrices $A$ , $B$ such that $AB^{\top}$ is of size $(n_{1}+n_{2})\times(n_{1}+n_{2})$ ,

Let $\Omega_{Y_{i}}=\left\{j:(i,j)\in\underline{\Omega}\right\}$ denote the set of entries sampled in the $i$ th row of $AB^{\top}$ . Note that because of the structure of $\underline{\Omega}$ , at most $n_{2}$ entries are sampled at the frist $n_{1}$ rows, and at most $n_{1}$ entries are sampled at the rest $n_{2}$ rows.

Using a binomial tail bound, if $p\geq\dfrac{c\log n_{2}}{n_{2}}$ for sufficiently large $c$ , the event $\max_{i\in[n_{1}]}|\Omega_{Y_{i}}|\leq 2pn_{2}$ holds with probability at least $1-n_{2}^{-4}$ . Similarly for the rest $n_{2}$ rows. Hence, if $p\geq\dfrac{c\log n}{n_{1}\land n_{2}}$ for some constant $c$ , with probability at least $1-n_{1}^{-4}-n_{2}^{-4}$ , we have $\max_{i\in[n_{1}+n_{2}]}|\Omega_{Y_{i}}|\leq 2pn$ .

Conditioning on this event, we then have for all $A,B$ of proper size,

Similarly we can prove with probability at least $1-n_{1}^{-4}-n_{2}^{-4}$ ,

The following lemma establishes restricted strong convexity and smoothness of the observation operator for matrices in $T$ .

Let $T$ be the subspace defined in (25). There exists a universal constant $c$ such that, if $p\geq\dfrac{c}{\delta^{2}}\dfrac{\mu r\log n}{n_{1}\land n_{2}}$ , with probability at least $1-3n^{-3}$ , uniformly for all $A\in T$ , we have

Consequently, uniformly for all $A,B\in T$ ,

Let $A$ be a matrix in $T$ . Rewriting $\left\|\mathcal{P}_{\Omega}(A)\right\|^{2}_{F}=\langle\mathcal{P}_{\Omega}\mathcal{P}_{T}(A),\mathcal{P}_{\Omega}\mathcal{P}_{T}(A)\rangle=\langle A,\mathcal{P}_{T}\mathcal{P}_{\Omega}\mathcal{P}_{T}(A)\rangle$ , and using the Cauchy-Schwarz inequality and (32) we can bound

where $(a)$ follows from Lemma 6. Combining (33) and (34) proves (30). To show (31), let $A^{\prime}=\frac{A}{\left\|A\right\|_{F}}$ and $B^{\prime}=\frac{B}{\left\|B\right\|_{F}}$ . Both $A^{\prime}+B^{\prime}$ and $A^{\prime}-B^{\prime}$ are in $T$ . We have

where $(b)$ follows from (30). Thus, we have

Last, we want to show the projection onto feasible set $\mathcal{C}$ is a contraction.

If $\left\|x\right\|_{2}\leq\theta$ , then $\mathcal{P}_{\left\|\cdot\right\|_{2}\leq\theta}(x)=x$ . Otherwise $\mathcal{P}_{\left\|\cdot\right\|_{2}\leq\theta}(x)=\theta\bar{x}$ , where $\bar{x}=\frac{x}{\left\|x\right\|_{2}}$ . Write $y=(y^{\top}\bar{x})\bar{x}+\mathcal{P}^{\perp}_{x}(y)$ , we have

If $y^{\top}\bar{x}\leq 0$ , then (39) holds because $\left\|x\right\|>\theta$ . If $y^{\top}\bar{x}>0$ , (39) still holds since $\left\|x\right\|>\theta\geq\left\|y\right\|\geq y^{\top}\bar{x}$ . ∎

Appendix B Initialization

Let $\delta$ denote the upper bound of $\left\|p^{-1}\mathcal{P}_{\Omega}(X^{\star})-X^{\star}\right\|$ as in Lemma 5, and let $\sigma_{1}\geq\ldots\geq\sigma_{n}$ denote the singular values of $p^{-1}\mathcal{P}_{\Omega}(X^{\star})$ . By Weyl’s theorem, we have

Note this implies $\sigma_{r+1}\leq\delta$ , as $\sigma^{\star}_{r+1}=0$ .

By definition, $Z^{0}=[U;V]\Sigma^{\frac{1}{2}}$ , where $U\Sigma V^{\top}$ is the rank- $r$ SVD of $p^{-1}\mathcal{P}_{\Omega}(X^{\star})$ . According to Lemma 4, one has

where $(a)$ holds because $\operatorname{\text{rank}}(U\Sigma V^{\top}-X^{\star})\leq 2r$ , $(b)$ holds since $\left\|U\Sigma V^{\top}-p^{-1}\mathcal{P}_{\Omega}(X^{\star})\right\|=\sigma_{r+1}\leq\delta$ .

Let $H=Z^{0}-\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{0}$ . We want to bound $d(Z^{0},Z^{\star})^{2}=\left\|H\right\|^{2}_{F}$ . According to (20), $H^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{0}$ is symmetric and ${Z^{0}}^{\top}{\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{0}}$ is positive semidefinite. Hence we can write

where in the second line we used that $H^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{0}$ is symmetric. Besides, as ${Z^{0}}^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{0}$ is positive semidefinite, $(4-\sqrt{2})\operatorname{\text{tr}}((H^{\top}H)({Z^{0}}^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{0}))$ is nonnegative. Therefore,

where in $(a)$ we replaced $\delta$ using Lemma 5, and $(b)$ holds since by our incoherence assumption (3) we have

Note that for $(c)$ we used $\left\|AB^{\top}\right\|_{2,\infty}\leq\left\|A\right\|_{2,\infty}\left\|B\right\|$ .

Hence, to obtain $d(Z^{0},Z^{\star})^{2}\leq\frac{1}{16}\sigma^{\star}_{r}$ , it suffices to have

Since $\mathcal{P}_{\mathcal{C}}$ is just row-wise clipping, by Lemma 11 we have

B.2 Proof of Corollary 1

By the incoherence assumption, we have $\left\|Z^{\star}\right\|_{2,\infty}\leq\sqrt{\frac{\mu r}{n_{1}\land n_{2}}\sigma^{\star}_{1}}$ , see (17). It suffices to show $2\sigma_{1}\geq\sigma^{\star}_{1}$ . From the above discussion, we can see that

By Wely’s theorem, we have $|\sigma_{1}-\sigma^{\star}_{1}|\leq\frac{1}{16}\sigma^{\star}_{r}$ . As a result, $2\sigma_{1}\geq\sigma^{\star}_{1}$ .

Appendix C Regularity Condition

Analogous to the restricted strong convexity (RSC) and restricted strong smoothness (RSS), we show that with high probability our objective function $f$ satisfies the local curvature and local smoothness conditions defined below.

There exists constant $c_{1},c_{2}$ such that for any $Z\in\mathcal{C}$ satisfying $d(Z,Z^{\star})\leq\frac{1}{4}\sqrt{\sigma^{\star}_{r}}$ ,

There exist constants $c_{3},c_{4}$ such that for any $Z\in\mathcal{C}$ satisfying $d(Z,Z^{\star})\leq\frac{1}{4}\sqrt{\sigma^{\star}_{r}}$ ,

where we used equation (19) for $(i)$ , the Cauchy-Schwarz inequality for $(ii)$ , inequality $(a-b)^{2}\geq\frac{a^{2}}{2}-b^{2}$ for $(iii)$ . Finally, in the last line we used $\sum_{l=1}^{2m}\langle A_{l},M\rangle^{2}=\left\|\mathcal{P}_{\underline{\Omega}}(M)\right\|^{2}_{F}$ .

We first lower bound $\frac{1}{2}p^{-1}\left\|\mathcal{P}_{\underline{\Omega}}(H{\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu}^{\top}+\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5muH^{\top})\right\|^{2}_{F}$ . By the symmetry of $\underline{\Omega}$ , it is equal to $p^{-1}\left\|\mathcal{P}_{\Omega}(H_{U}{\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu}^{\top}_{V}+\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu_{U}H^{\top}_{V})\right\|^{2}_{F}$ , which expands to

As both $H_{U}{\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu}^{\top}_{V}$ and $\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu_{U}H^{\top}_{V}$ belong to $T$ , we use Lemma 10 to lower bound above three terms, respectively. This gives us

where we used $\left\|H_{U}{\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu}^{\top}_{V}\right\|^{2}_{F}\geq\sigma^{\star}_{r}\left\|H_{U}\right\|^{2}_{F}$ and $\left\|\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu_{U}H^{\top}_{V}\right\|^{2}_{F}\geq\sigma^{\star}_{r}\left\|H_{V}\right\|^{2}_{F}$ for $(iv)$ .

Next, we lower bound $2\langle H_{U}{\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu}^{\top}_{V},\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu_{U}H^{\top}_{V}\rangle+\lambda\operatorname{\text{tr}}(H^{\top}\Gamma)$ together. Rewriting

and plugging in $\Gamma=DZZ^{\top}DZ$ , we then have

Equality $(a)$ holds because ${\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu}^{\top}D\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu=0$ . We plug in (53) in $(b)$ . For $(c)$ , we use ${\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu}^{\top}D\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu=0$ and that $H^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu$ is symmetric. Finally, we take $\lambda=\frac{1}{2}$ and use Lemma 8 to upper bound $p^{-1}\left\|\mathcal{P}_{\underline{\Omega}}(HH^{\top})\right\|^{2}_{F}$ :

For simplicity, we take $\delta=\frac{1}{16}$ . We also have $\left\|H\right\|^{2}_{F}\leq\frac{1}{16}\sigma^{\star}_{r}$ . This leads to

Note that this lower bound holds with high probability uniformly for all $Z\in\mathcal{C}$ such that $d(Z,Z^{\star})\leq\frac{1}{4}\sqrt{\sigma^{\star}_{r}}$ , since Lemma 8 and 10 hold uniformly.

When the ground truth $X^{\star}$ is positive semidefinite, we don’t need to do lifitng nor impose the regularizer. Using Lemma 10, we can lower bound $\frac{1}{2}p^{-1}\left\|\mathcal{P}_{\underline{\Omega}}(H{\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu}^{\top}+\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5muH^{\top})\right\|^{2}_{F}\gtrsim(1-\delta)\sigma^{\star}_{r}\left\|H\right\|^{2}_{F}$ directly. Taking proper constants, we can obtain the standard restricted strong convexity condition:

C.2 Proof of the Local Smoothness Condition

To upper bound $\left\|\nabla f(Z)\right\|^{2}_{F}=\max_{\left\|W\right\|_{F}=1}|\langle\nabla f(Z),W\rangle|^{2}$ , it suffices to show that for any $n\times r$ $W$ of unit Frobenius norm, $|\langle\nabla f(Z),W\rangle|^{2}$ is upper bounded. We first write

where we used (19) for $(i)$ . Since $(a+b+c+d+e)^{2}\leq 5(a^{2}+b^{2}+c^{2}+d^{2}+e^{2})$ , we have

where we used the Cauchy-Schwarz inequality for $(ii)$ , and $(a+b)^{2}\leq 2(a^{2}+b^{2})$ for $(iii)$ . We then use Lemma 9 to upper bound 1, 2, 4, 5, 6, 7, and Lemma 8 for 3. Also since $\left\|W\right\|_{F}=1$ , one has

where in the last line we plugged in $\left\|\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu\right\|_{2,\infty}\leq\sqrt{\dfrac{\mu r}{n}\sigma^{\star}_{1}}$ and $\left\|H\right\|_{2,\infty}\leq 3\sqrt{\dfrac{\mu r}{n}\sigma^{\star}_{1}}$ , i.e. (17) and (18).

Inequality $(a)$ holds because $\left\|AB\right\|_{F}\leq\left\|A\right\|\left\|B\right\|_{F}$ and $\left\|D\right\|=1$ . To get $(b)$ , for the first term in the 3rd line we expand $ZZ^{\top}-\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu{\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu}^{\top}$ , for the second term we expand $Z=\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu+H$ and use ${\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu}^{\top}D\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu=0$ . For $(c)$ , we use $\left\|AB\right\|_{F}\leq\left\|A\right\|\left\|B\right\|_{F}\leq\left\|A\right\|_{F}\left\|B\right\|_{F}$ . Last, $(d)$ holds because $\left\|\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu\right\|^{2}=2\sigma^{\star}_{1}$ .

Finally, we combine (59) and (60). As before, take $\lambda=\frac{1}{2}$ , $\delta=\frac{1}{16}$ , and $\left\|H\right\|^{2}_{F}\leq\frac{1}{16}\sigma^{\star}_{r}$ , we obtain

where for $(a)$ we used $\left\|Z\right\|\leq\left\|H\right\|+\left\|\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu\right\|\leq\frac{1}{4}\sqrt{\sigma^{\star}_{r}}+\sqrt{2\sigma^{\star}_{1}}\leq\frac{7}{4}\sqrt{\sigma^{\star}_{1}}$ , for $(b)$ we used $\mu,r\geq 1$ .

As before, this condition holds uniformly for all $Z$ such that $d(Z,Z^{\star})\leq\frac{1}{4}\sqrt{\sigma^{\star}_{r}}$ and satisfying the incoherence condition.

For the case $X^{\star}$ is positive semidefinite, as we don’t need to impose the regularizer, standard restricted strong smoothness condition follows:

C.3 Proof of Lemma 3

Rearranging the terms in the smoothness condition (61), we can further bound

Combining equation (56) and (62), it follows that

Finally, by upper bounding the probability that Lemma 8, 9, or 10 fails, and the sample probability $p$ these lemmas require, we conclude that once

regularity condition (63) holds with probability at least $1-c_{1}n^{-c_{2}}$ , where $c,c_{1},c_{2}$ are constants.

Appendix D Linear Convergence

Let $H^{k}=Z^{k}-\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{k}$ . Our iterate is $Z^{k+1}=\mathcal{P}_{\mathcal{C}}(Z^{k}-\eta\nabla f(Z^{k}))$ . Since $\mathcal{P}_{\mathcal{C}}$ is just row-wise clipping, by Lemma 11 we have

where we use the definition of $RC(\varepsilon,\alpha,\beta)$ for $(a)$ and $0<\eta\leq\min\left\{\alpha/2,2/\beta\right\}$ for $(b)$ . Therefore,