Convergence Analysis for Rectangular Matrix Completion Using Burer-Monteiro Factorization and Gradient Descent

Qinqing Zheng, John Lafferty

Introduction

A growing body of recent research is shedding new light on the role of nonconvex optimization for tackling large scale problems in machine learning, signal processing, and convex programming. This work is developing techniques that help to explain the surprising effectiveness of relatively simple first-order algorithms for certain nonconvex optimizations.

When applied to problems that can be formulated as semidefinite programs, these techniques can often be viewed as part of a framework proposed by Burer and Monteiro . The Burer-Monteiro technique is based on factoring the semidefinite variable, and applying classical optimization techniques to the resulting nonconvex objective over the factor. While worst-case complexity considerations imply that such an approach cannot succeed in general, a series of recent papers has shown the strategy to be remarkably effective for a number of problems of practical interest, with analytical convergence guarantees and strong empirical performance.

In order for this problem to be well-posed, it is important to understand when XX^{\star} is identifiable and, in particular, the unique minimizer of (1). Moreover, because the problem is in general NP-hard, it is essential to identify tractable families of instances, together with efficient algorithms having global convergence guarantees.

In the following section we give a full description of our approach. Our theoretical results are presented in Section 3, with detailed proofs contained in the appendix. Our analysis subsumes the case where XX^{\star} is positive semidefinite. In Section 4 we briefly review related work. The experimental results are presented in Section 5, and we conclude with a brief discussion of future work in Section 6.

Semidefinite Lifting, Factorization, and Gradient Descent

In this paper, we focus on completing an incoherent or “non-spiky” matrix XX^{\star}. With UΣVU^{\star}\Sigma^{\star}V^{\star} denoting the rank-rr SVD of XX^{\star}, we assume XX^{\star} is μ\mu-incoherent, as defined below.

The matrix XX^{\star} is μ\mu-incoherent with respect to the canonical basis if its singular vectors satisfy

where μ\mu is a constant.Note that μ1\mu\geq 1, since r=UF2=i[n1]U(i)22μrr=\|U^{\star}\|^{2}_{F}=\sum_{i\in[n_{1}]}\left\|U^{\star}_{(i)}\right\|^{2}_{2}\leq\mu r.

Our main interest is the uniform model where mm entries of XX^{\star} are observed uniformly at random, though we shall analyze a Bernoulli sampling model, where each entry of XX^{\star} is observed with probability p=m/n1n2p={m}/{n_{1}n_{2}}. One can transfer the results back to the uniform model, as the probability of failure under the uniform model is at most twice that under the Bernoulli model; see .

Using the rank-rr SVD of XX^{\star}, we can lift XX^{\star} to

The symmetric decomposition of YY^{\star} is not unique; our goal is to find a matrix in the set

since for any Z~S\widetilde{Z}\in\mathcal{S} we have X=Z~UZ~VX^{\star}=\widetilde{Z}_{U}{\widetilde{Z}_{V}}^{\top}. Let Ω\underline{\Omega} denote the corresponding observed entries of YY^{\star}, and consider minimization of the squared error

Note that YY^{\star} is not the unique minimizer of (6), nor is it the only possible positive semidefinite lifting of XX^{\star}. For example, let PP be an r×rr\times r nonsingular matrix, and form the matrices

Since Ω\underline{\Omega} does not contain any entry in the top-left or bottom-right block, YY^{\prime} is also a minimizer of (6). Thus, the solution set of the lifted problem is much larger than the set S\mathcal{S} of actual interest. For the sake of simple analysis, we shall focus on exact recovery of YY^{\star} only, and thus impose an additional regularizer to align the column spaces of ZUZ_{U} and ZVZ_{V}, as in . The regularized loss is

While this apparently introduces an extra tuning parameter, our analysis establishes linear convergence of the projected gradient descent algorithm when λ=12\lambda=\frac{1}{2}, and thus one may treat λ\lambda as a fixed number.

It is discussed in that one needs to ensure the iterates stay incoherent. Let C\mathcal{C} be the set of incoherent matrices

where we assume μ\mu is known and Z0Z^{0} will be determined.

Our algorithm is simply gradient descent on f(Z)f(Z), with projection onto C\mathcal{C}. Let M=p1PΩ(UVX)M=p^{-1}\mathcal{P}_{\Omega}(UV^{\top}-X^{\star}). Then the gradient of ff is given by

The projection PC\mathcal{P}_{\mathcal{C}} to the feasible set C\mathcal{C} has closed form solution, given by row-wise clipping:

Note that X0p1PΩ(X)X^{0}\equiv p^{-1}\mathcal{P}_{\Omega}(X^{\star}) is an unbiased estimator of XX^{\star} under the Bernoulli model. To initialize, we thus construct Z0Z^{0} from the top rank-rr factors of X0X^{0}. This leads to the following algorithm.

Remarks. (i) The step size η\eta is normalized by Z02\left\|Z^{0}\right\|^{2}. Our analysis will establish linear convergence when taking step sizes of the form η/σ1\eta/\sigma^{\star}_{1}, where η\eta is a sufficiently small constant. We replace σ1\sigma^{\star}_{1} by Z02\left\|Z^{0}\right\|^{2} in the actual algorithm since it is unknown in practice. (ii) The feasible set (9) depends on Z0\left\|Z^{0}\right\| as well. Under the above spectral initialization, our analysis shows that when pO(μκ2r2logn/n1n2)p\geq O({\mu\kappa^{2}r^{2}\log n}/{n_{1}\land n_{2}}), the term 2μrn1n2Z0\sqrt{\frac{2\mu r}{n_{1}\land n_{2}}}\left\|Z^{0}\right\| is an upper bound of Z2,\left\|Z^{\star}\right\|_{2,\infty} with high probability (see Corollary 1 below). This means S\mathcal{S} is a subset of C\mathcal{C}. Note that this does not change the global optimality of ZZ^{\star} and its equivalent elements, since f(Z)=0f(Z^{\star})=0. In practice, we find that the iterates of our algorithm remain incoherent, so that one may drop the projection step. (iii) The column space regularizer (8) is needed in our analysis. We also found that when λ=0\lambda=0, our algorithm typically converges to another PSD lifted matrix of XX^{\star}, with minor difference from YY^{\star} in the top-left and bottom-right blocks.

In the following section we state and sketch a proof of our main convergence result for this algorithm.

Main Result: Convergence Analysis

Suppose that XX^{\star} is of rank rr, with condition number κ=σ1/σr\kappa=\sigma^{\star}_{1}/\sigma^{\star}_{r}, and μ\mu-incoherent as defined in Definition 1. Suppose further that we observe mm entries of XX^{\star} chosen uniformly at random. Let Y=ZZY^{\star}=Z^{\star}{Z^{\star}}^{\top} be the lifted matrix as in (4) and write n=max(n1,n2)n=\max(n_{1},n_{2}). Then there exist universal constants c0,c1,c2,c3c_{0},c_{1},c_{2},c_{3} such that if

then with probability at least 1c1nc21-c_{1}n^{-c_{2}} the iterates of Algorithm 1 converge to ZZ^{\star} geometrically, when using regularization parameter λ=1/2\lambda=1/2, correctly specified input rank rr, and constant step size η/σ1\eta/\sigma^{\star}_{1} with ηc3/μ2r2κ\eta\leq\displaystyle{c_{3}}/{\mu^{2}r^{2}\kappa}.

We shall analyze the Bernoulli sampling model, as justified in Section 2. Let us define the distance to ZZ^{\star} in terms of the solution set S\mathcal{S}.

Define the distance between ZZ and ZZ^{\star} as

The next theorem establishes the global convergence of Algorithm 1, assuming that the input rank is correctly specified. The proof sketch is given in the next subsection.

There exist universal constants c0,c1,c2c_{0},c_{1},c_{2} such that if pc0μr2κ2lognn1n2p\geq\dfrac{c_{0}\mu r^{2}\kappa^{2}\log n}{n_{1}\land n_{2}}, with probability at least 1c1nc21-c_{1}n^{-c_{2}}, the initialization Z1CZ^{1}\in\mathcal{C} satisfies

Moreover, there exist universal constants c3,c4,c5,c6c_{3},c_{4},c_{5},c_{6} such that if pc3max(μr2κ2,μrlogn)n1n2p\geq\dfrac{c_{3}\max(\mu r^{2}\kappa^{2},\mu r\log n)}{n_{1}\land n_{2}}, when using constant step size η/σ1\eta/\sigma^{\star}_{1} with ηc4μ2r2κ\eta\leq\dfrac{c_{4}}{\mu^{2}r^{2}\kappa} and initial value Z1CZ^{1}\in\mathcal{C} obeying (13), the kkth step of Algorithm 1 with λ=1/2\lambda=1/2 satisfies

with probability at least 1c5nc61-c_{5}n^{-c_{6}}.

After each update, the distance of our iterates to ZZ^{\star} is reduced by at least a factor of 1O(1/μ2r2κ2)1-O(1/\mu^{2}r^{2}\kappa^{2}).

Hence, the output Z^\widehat{Z} satisfies d(Z^,Z)εd(\widehat{Z},Z^{\star})\leq\varepsilon after at most 2log1(1/(199256ηκ))log(σr/4ε)\left\lceil 2\log^{-1}\left({1}/{(1-\frac{99}{256}\cdot\frac{\eta}{\kappa})}\right)\log\left({\sqrt{\sigma^{\star}_{r}}}/{4\varepsilon}\right)\right\rceil iterations.

Our proof idea is of the same nature as the analysis in . We show two appealing properties when sufficient entries are observed. First, our spectral initialization produces a starting point within the O(σr)O(\sigma^{\star}_{r}) neighborhood of the solution set.

There exist universal constants c,c1,c2c,c_{1},c_{2}, such that if pcμr2κ2lognn1n2p\geq\dfrac{c\mu r^{2}\kappa^{2}\log n}{n_{1}\land n_{2}} then with probability at least 1c1nc21-c_{1}n^{-c_{2}},

To demonstrate this, we exploit the concentration around the mean of p1PΩ(X)p^{-1}\mathcal{P}_{\Omega}(X^{\star}). See Appendix B for the proof. Using this lemma, we can immediately show that ZZ^{\star} and all other elements of S\mathcal{S} are contained in the feasible set (9).

With probability at least 1c1nc21-c_{1}n^{-c_{2}}, Z2,2μrn1n2Z0.\left\|Z^{\star}\right\|_{2,\infty}\leq\sqrt{\frac{2\mu r}{n_{1}\land n_{2}}}\left\|Z^{0}\right\|.

The second crucial property is that f(Z)f(Z) is well-behaved within the O(σr)O(\sqrt{\sigma^{\star}_{r}}) neighborhood, so that the iterates move closer to the optima in every iteration. The key step is to set up a local regularity condition similar to Nesterov’s conditions .

Let \mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu=\operatorname*{arg\,min}_{\widetilde{Z}\in\mathcal{S}}\big{\|}Z-\widetilde{Z}\big{\|}_{F} denote the matrix closest to ZZ in the solution set. We say that ff satisfies the regularity condition RC(ε,α,β)RC(\varepsilon,\alpha,\beta) if there exist constants α\alpha, β\beta such that for any ZCZ\in\mathcal{C} satisfying d(Z,Z)εd(Z,Z^{\star})\leq\varepsilon, we have

Using this condition, one can show the iterates converge linearly to the optima if we start close enough to ZZ^{\star}.

Consider the update Zk+1=PC(Zkμσ1f(Zk))Z^{k+1}=\mathcal{P}_{\mathcal{C}}\left(Z^{k}-\dfrac{\mu}{\sigma^{\star}_{1}}\nabla f(Z^{k})\right). If ff satisfies RC(ε,α,β)RC(\varepsilon,\alpha,\beta), d(Zk,Z)εd(Z^{k},Z^{\star})\leq\varepsilon and 0<μmin(α/2,2/β)0<\mu\leq\min(\alpha/2,2/\beta), then

The following lemma illustrates the local regularity of f(Z)f(Z). Nesterov’s criterion is established upon strong convexity and strong smoothness of the objective. Here we show analogous curvature and smoothness conditions holds for f(Z)f(Z) locally – within the O(σr)O(\sqrt{\sigma^{\star}_{r}}) neighborhood – with high probability. Interestingly, we found that to show the local curvature condition holds, it suffices to set λ=12\lambda=\frac{1}{2}. The proof can be found in Appendix C, for which we have generalized some technical lemmas of .

Let the regularization constant be set to λ=12\lambda=\frac{1}{2}. There exists universal constant c,c1,c2c,c_{1},c_{2}, such that if pcmax(μ2r2κ2,μrlogn)n1n2p\geq\dfrac{c\max(\mu^{2}r^{2}\kappa^{2},\mu r\log n)}{n_{1}\land n_{2}}, then ff satisfies RC(14σr,512/99,13196μ2r2κ)RC(\frac{1}{4}\sqrt{\sigma^{\star}_{r}},512/99,13196\mu^{2}r^{2}\kappa), with probability at least 1c1nc21-c_{1}n^{-c_{2}}.

Related Work

Matrix completion is one instance of the general low rank linear inverse problem

where A\mathcal{A} is an affine transformation and b=A(X)b=\mathcal{A}(X^{\star}) is the measurement of the ground truth XX^{\star}. Considerable progress has been made towards algorithms for recovering XX^{\star} including both convex and nonconvex approaches. One of the most popular methods is nuclear norm minimization, a convenient convex relaxation of rank minimization. It was first proposed in , and analyzed under a certain restricted isometry property (RIP). Subsequent work clarified the conditions for reconstruction, and studied recovery guarantees for both exact and approximately low rank matrices, with or without noise . One significant advantage for this approach is its near-optimal sample complexity. Under the same incoherence assumption as ours, Chen establishes the currently best-known lower bound of O(μrnlog2n)O(\mu rn\log^{2}n) samples. Using a closely related notion of incoherence, Negahban and Wainwright show that if XX^{\star} is “α\alpha-nonspiky” with XXFαn1n2\frac{\left\|X^{\star}\right\|_{\infty}}{\left\|X^{\star}\right\|_{F}}\leq\frac{\alpha}{\sqrt{n_{1}n_{2}}}, then O(α2rnlogn)O(\alpha^{2}rn\log n) samples are sufficient for exact recovery. However, convexity and low sample complexity aside, in practice the power of nuclear norm relaxation is limited due to high computational cost. The popular algorithms for nuclear norm minimization are proximal methods that perform iterative singular value thresholding . However, such algorithms don’t scale to large instances because the per-iteration SVD is expensive.

In the same year, Jain et al. suggested minimizing the squared residual A(X)b2\left\|\mathcal{A}(X)-b\right\|^{2} under a rank constraint rank(X)r\operatorname{\text{rank}}(X)\leq r. While this constraint is nonconvex, projection onto the feasible set can be computed using low rank SVD. Under certain RIP assumption on A\mathcal{A}, Jain et al. establish the global convergence of projected gradient descent for (14). This algorithm is named Singular Value Projection (SVP). Yet in the setting of completion, only experimental support for the effectiveness of SVP is provided. More importantly, SVP also suffers from expensive per-iteration SVD for large scale problems.

Another theoretical disadvantage of the resampling scheme is that the sample complexity depends on the desired accuracy ε\varepsilon, as established by . As the accuracy goes to zero, the sample complexity increases. In contrast, our algorithm doesn’t require resampling, and the sample complexity is independent of ε\varepsilon.

In 2014, Candès et al. proposed Wirtinger flow for phase retrieval. Wirtinger flow is a fast first-order algorithm that minimizes a fourth order (nonconvex) objective, geometrically converging to the global optimum. While previous work lifts the phase retrieval problem into an SDP where the solution is rank one, this work bridges SDP and first-order algorithms via the Burer-Monteiro technique. It has inspired further research on related topics; last year, the authors of considered factorizations for (14), assuming XX^{\star} is semidefinite, and proved global optimality of first-order algorithms under appropriate initializations. Tu et al. have extended this algorithm to handle rectangular matrix via asymmetric factorization, and have shown exact recovery of XX^{\star}, assuming A\mathcal{A} satisfies a certain RIP. They use lifting implicitly, factorizing X=ZUZVX=Z_{U}Z_{V}^{\top} and applying gradient updates on both factors ZUZ_{U} and ZVZ_{V} simultaneously, with the nonconvex objective function

Their proof strategy also shows convergence of ZZ in the lifted space. For the specific case of matrix completion, Chen and Wainwright obtained guarantees when XX^{\star} is semidefinite. Our work generalizes the results obtained in , extending the recent literature on first-order algorithms for factorized models.

After completing this work we learned of independent research of Sun and Luo , who also analysed a gradient algorithm for rectangular matrix completion. Their formulation is similar to ours, with additional Frobenius norm constraints on the factors. The authors established a sample complexity of O(r7κ6)O(r^{7}\kappa^{6}) observations; in comparison our bound scales as O(r2κ2)O(r^{2}\kappa^{2}). The authors also analyzed block coordinate descent type alternating minimization, which cyclically updates the rows of UU and then the rows of VV, showing exact recovery of this algorithm without resampling. Recent independent work of Yi et al. analyzes a gradient scheme for Robust PCA. Under the setting of partial observation without corruption, this is the standard matrix completion problem. In other related work, also study nonconvex optimization methods for matrix completion, using algorithms that still require low rank SVD in each iteration.

Experiments

where λ=0\lambda=0 will enforce the minimizer fitting the observed values exactly. We use ADMM to solve (16). It is based on the algorithm for the matrix approach in , and can neatly handle the case λ=0\lambda=0. We emphasize there is no computational difference between cases whether λ\lambda is zero or not. All methods are implemented in MATLAB. We use the toolbox Manopt for trustRegion and the implementation of OptSpace from the authors. For AltMin, we use the same sample sets in every iteration. The experiments were run on a Linux machine with a 3.4GHz Intel Core i7 processor and 8 GB memory.

Runtime Comparison

We randomly generated a true matrix XX^{\star} of size 4000×20004000\times 2000 and rank 3. It is constructed from the rank-33 SVD of a random 4000×20004000\times 2000 matrix with i.i.d normal entries. We sampled m=199057m=199057 entries of XX^{\star} uniformly at random, where mm is roughly equal to 2nrlogn2nr\log n with n=4000n=4000 and r=3r=3. For simplicity, we feed SVP, OptSpace and GD with the true rank. For all these methods, we use the randomized algorithm of Halko et al. to compute the low rank SVD, which is approximately 15 times faster than MATLAB built-in SVD on instances of such size. We report relative error measured in the Frobenius norm, defined as X^XF/XF\|\widehat{X}-X^{\star}\|_{F}/\left\|X^{\star}\right\|_{F}. For nuclear, we set λ=0\lambda=0 to enforce exact fitting. The convergence speed of ADMM mildly depends on the choice of penalty parameter. We tested 55 values 0.1,0.2,0.5,1,1.50.1,0.2,0.5,1,1.5 and selected 0.20.2, which leads to fastest convergence. Similarly, for SVP, we would like to choose the largest step size for which the algorithm is converging. We evaluated 15,20,30,35,4015,20,30,35,40 and selected 3030. The step size is chosen for GD in the same way. Five values 20,50,70,75,8020,50,70,75,80 are tested for η\eta and we picked 7070. For OptSpace, we compared fixed step sizes 0.50.10.050.010.0050.50.10.050.010.005 with line search, and found the algorithm converged fastest under line search. Figure 1a shows the results. GD is slightly slower than trustRegion and faster than competing approaches.

To further illustrate how runtime scales as the dimension increases, we run larger instances of size 10000×500010000\times 5000 and 20000×500020000\times 5000, where the true rank is 4040. The parameters are selected in the same manner, and we terminate the computation once the relative error is below 1e91e^{-9}. We report the results of AltMin GD, SVP and trustRegion in Figure 2a; nuclear, OptSpace do not scale well to such sizes so that we didn’t include them. The runtime of AltMin scales the slowest, while the runtimes of GD and trustRegion increase slower than SVP.

Sample Complexity

We evaluate the number of observations required by GD for exact recovery. For simplicity, we consider square but asymmetric XX^{\star}. We conducted experiments in 4 cases, where the randomly generated XX^{\star} is of size 500×500500\times 500 or 1000×10001000\times 1000, and of rank 1010 or 2020. In each case, we compute the solutions of GD given mm random observations, and a solution with relative error below 1e61e^{-6} is considered to be successful. We run 20 trials and compute the empirical probability of successful recovery. The results are shown in Figure 2b. For all four cases, the phase transitions occur around m3.5nrm\approx 3.5nr. This suggests that the actual sample complexity of GD may scale linearly with both the dimension nn and the rank rr.

Conclusion

We propose a lifting procedure together with Burer-Monteiro factorization and a first-order algorithm to carry out rectangular matrix completion. While optimizing a nonconvex objective, we establish linear convergence of our method to the global optimum with O(μr2κ2nmax(μ,logn))O(\mu r^{2}\kappa^{2}n\max(\mu,\log n)) random observations. We conjecture that O(nr)O(nr) observations are sufficient for exact recovery, and that the column space regularizer can be dropped. We provide empirical evidence showing this simple algorithm is fast and scalable, suggesting that lifting techniques may be promising for much more general classes of problems.

Acknowledgements

The authors thank Rina Foygel Barber, Yudong Chen and Ruoyu Sun for helpful comments. Research supported in part by ONR grant N00014-15-1-2379 and NSF grant DMS-1513594.

Appendix A Technical Lemmas

Another way of writing the objective function is

where ll is an index of Ω\underline{\Omega}, AlA_{l} is a matrix with 11 at the corresponding observed entry and elsewhere. Let H=ZZ ⁣H=Z-\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu, the gradient can be written as

We will use the following facts throughout the proof:

Inequality (17) is a direct result of Definition 1. To see (18), note that H2,Z2,+Z ⁣2,2μrn1n2σ1+μrn1n2σ1\left\|H\right\|_{2,\infty}\leq\left\|Z\right\|_{2,\infty}+\left\|\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu\right\|_{2,\infty}\leq\sqrt{\frac{2\mu r}{n_{1}\land n_{2}}\sigma_{1}}+\sqrt{\frac{\mu r}{n_{1}\land n_{2}}\sigma^{\star}_{1}}, and σ1σ1116σ1|\sigma_{1}-\sigma^{\star}_{1}|\leq\frac{1}{16}\sigma^{\star}_{1} by the discussion of initialization in Appendix B. For (20), it holds that

where AΛBA\Lambda B^{\top} is the SVD of ZZ{Z^{\star}}^{\top}Z. Clearly, ZZ ⁣Z^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu is positive semidefinite, and HZ ⁣=ZZ ⁣Z ⁣Z ⁣=BΛBZ ⁣Z ⁣H^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu=Z^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu-{\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu}^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu=B\Lambda B^{\top}-{\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu}^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu is symmetric.

Next, we list several technical lemmas that are utilized later. We will use cc to denote a numerical constant, whose value may vary from line to line.

For any ZZ of the form Z=[ZUZV]=[UΣ12RVΣ12R]Z=\begin{bmatrix}Z_{U}\\ Z_{V}\end{bmatrix}=\begin{bmatrix}U\Sigma^{\frac{1}{2}}R\\ V\Sigma^{\frac{1}{2}}R\end{bmatrix}, where U,V,RU,V,R are unitary matrices and Σ0\Sigma\succeq 0 is a diagonal matrix, we have

where X=UΣVX^{\star}=U^{\star}\Sigma^{\star}{V^{\star}}^{\top}. We have

Plugging (24) back into (LABEL:eq:ZZ_to_X_helper0), we obtain the lemma. ∎

Recall that n=max(n1,n2)n=\max(n_{1},n_{2}). We will exploit the following two known concentration results.

Let PT\mathcal{P}_{T} be the Euclidean projection onto TT. There is a numerical constant cc such that for any δ(0,1]\delta\in(0,1], if pcδ2μrlognn1n2p\geq\dfrac{c}{\delta^{2}}\dfrac{\mu r\log n}{n_{1}\land n_{2}}, then with probability 13n31-3n^{-3}, we have

Lemma 7 upper bounds the spectral norm of the adjacency matrix of a random Erdős-Rényi graph. It is a variant of Lemma 7.1 of Keshavan et al. , which uses known results of Feige and Ofek .

We refer readers to for a complete proof, in particular noticing that one can choose pp large enough so that the constant factor in the first term in (26) is only 1+δ1+\delta.

Lemma 8, 9 and 10 are direct generalizations of Lemma 4 and 5 of .

There exists a constant cc such that, for any δ(0,1]\delta\in(0,1], if pcδ2max(log(n1+n2)n1+n2,μ2r2κ2n1n2)p\geq\frac{c}{\delta^{2}}\max\left(\frac{\log(n_{1}+n_{2})}{n_{1}+n_{2}},\frac{\mu^{2}r^{2}\kappa^{2}}{n_{1}\land n_{2}}\right), then with probability at least 112(n1+n2)41-\frac{1}{2}(n_{1}+n_{2})^{-4}, uniformly for all HH such that H2,3μrn1n2σ1\left\|H\right\|_{2,\infty}\leq 3\sqrt{\frac{\mu r}{n_{1}\land n_{2}}\sigma^{\star}_{1}}, we have

where (a)(a) follows from Lemma 7 and (b)(b) follows from H2,3μrn1n2σ1\left\|H\right\|_{2,\infty}\leq 3\sqrt{\frac{\mu r}{n_{1}\land n_{2}}\sigma^{\star}_{1}}.

Let us further assume p162c22μ2r2κ2γδ2(n1n2)p\geq\frac{162c^{2}_{2}\mu^{2}r^{2}\kappa^{2}\gamma}{\delta^{2}(n_{1}\land n_{2})}, where γ=n/(n1n2)\gamma=n/(n_{1}\land n_{2}) is a fixed constant, then we can bound

The final threshold we obtain is thus pcδ2max(log(n1+n2)n1+n2,μ2r2κ2n1n2)p\geq\frac{c}{\delta^{2}}\max\left(\frac{\log(n_{1}+n_{2})}{n_{1}+n_{2}},\frac{\mu^{2}r^{2}\kappa^{2}}{n_{1}\land n_{2}}\right) for some constant cc. ∎

There exists a constant cc, if pclognn1n2p\geq\dfrac{c\log n}{n_{1}\land n_{2}}, then with probability at least 12n142n241-2n_{1}^{-4}-2n_{2}^{-4}, uniformly for all matrices AA, BB such that ABAB^{\top} is of size (n1+n2)×(n1+n2)(n_{1}+n_{2})\times(n_{1}+n_{2}),

Let ΩYi={j:(i,j)Ω}\Omega_{Y_{i}}=\left\{j:(i,j)\in\underline{\Omega}\right\} denote the set of entries sampled in the iith row of ABAB^{\top}. Note that because of the structure of Ω\underline{\Omega}, at most n2n_{2} entries are sampled at the frist n1n_{1} rows, and at most n1n_{1} entries are sampled at the rest n2n_{2} rows.

Using a binomial tail bound, if pclogn2n2p\geq\dfrac{c\log n_{2}}{n_{2}} for sufficiently large cc, the event maxi[n1]ΩYi2pn2\max_{i\in[n_{1}]}|\Omega_{Y_{i}}|\leq 2pn_{2} holds with probability at least 1n241-n_{2}^{-4}. Similarly for the rest n2n_{2} rows. Hence, if pclognn1n2p\geq\dfrac{c\log n}{n_{1}\land n_{2}} for some constant cc, with probability at least 1n14n241-n_{1}^{-4}-n_{2}^{-4}, we have maxi[n1+n2]ΩYi2pn\max_{i\in[n_{1}+n_{2}]}|\Omega_{Y_{i}}|\leq 2pn.

Conditioning on this event, we then have for all A,BA,B of proper size,

Similarly we can prove with probability at least 1n14n241-n_{1}^{-4}-n_{2}^{-4},

The following lemma establishes restricted strong convexity and smoothness of the observation operator for matrices in TT.

Let TT be the subspace defined in (25). There exists a universal constant cc such that, if pcδ2μrlognn1n2p\geq\dfrac{c}{\delta^{2}}\dfrac{\mu r\log n}{n_{1}\land n_{2}}, with probability at least 13n31-3n^{-3}, uniformly for all ATA\in T, we have

Consequently, uniformly for all A,BTA,B\in T,

Let AA be a matrix in TT. Rewriting PΩ(A)F2=PΩPT(A),PΩPT(A)=A,PTPΩPT(A)\left\|\mathcal{P}_{\Omega}(A)\right\|^{2}_{F}=\langle\mathcal{P}_{\Omega}\mathcal{P}_{T}(A),\mathcal{P}_{\Omega}\mathcal{P}_{T}(A)\rangle=\langle A,\mathcal{P}_{T}\mathcal{P}_{\Omega}\mathcal{P}_{T}(A)\rangle, and using the Cauchy-Schwarz inequality and (32) we can bound

where (a)(a) follows from Lemma 6. Combining (33) and (34) proves (30). To show (31), let A=AAFA^{\prime}=\frac{A}{\left\|A\right\|_{F}} and B=BBFB^{\prime}=\frac{B}{\left\|B\right\|_{F}}. Both A+BA^{\prime}+B^{\prime} and ABA^{\prime}-B^{\prime} are in TT. We have

where (b)(b) follows from (30). Thus, we have

Last, we want to show the projection onto feasible set C\mathcal{C} is a contraction.

If x2θ\left\|x\right\|_{2}\leq\theta, then P2θ(x)=x\mathcal{P}_{\left\|\cdot\right\|_{2}\leq\theta}(x)=x. Otherwise P2θ(x)=θxˉ\mathcal{P}_{\left\|\cdot\right\|_{2}\leq\theta}(x)=\theta\bar{x}, where xˉ=xx2\bar{x}=\frac{x}{\left\|x\right\|_{2}}. Write y=(yxˉ)xˉ+Px(y)y=(y^{\top}\bar{x})\bar{x}+\mathcal{P}^{\perp}_{x}(y), we have

If yxˉ0y^{\top}\bar{x}\leq 0, then (39) holds because x>θ\left\|x\right\|>\theta. If yxˉ>0y^{\top}\bar{x}>0, (39) still holds since x>θyyxˉ\left\|x\right\|>\theta\geq\left\|y\right\|\geq y^{\top}\bar{x}. ∎

Appendix B Initialization

Let δ\delta denote the upper bound of p1PΩ(X)X\left\|p^{-1}\mathcal{P}_{\Omega}(X^{\star})-X^{\star}\right\| as in Lemma 5, and let σ1σn\sigma_{1}\geq\ldots\geq\sigma_{n} denote the singular values of p1PΩ(X)p^{-1}\mathcal{P}_{\Omega}(X^{\star}). By Weyl’s theorem, we have

Note this implies σr+1δ\sigma_{r+1}\leq\delta, as σr+1=0\sigma^{\star}_{r+1}=0.

By definition, Z0=[U;V]Σ12Z^{0}=[U;V]\Sigma^{\frac{1}{2}}, where UΣVU\Sigma V^{\top} is the rank-rr SVD of p1PΩ(X)p^{-1}\mathcal{P}_{\Omega}(X^{\star}). According to Lemma 4, one has

where (a)(a) holds because rank(UΣVX)2r\operatorname{\text{rank}}(U\Sigma V^{\top}-X^{\star})\leq 2r, (b)(b) holds since UΣVp1PΩ(X)=σr+1δ\left\|U\Sigma V^{\top}-p^{-1}\mathcal{P}_{\Omega}(X^{\star})\right\|=\sigma_{r+1}\leq\delta.

Let H=Z0Z ⁣0H=Z^{0}-\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{0}. We want to bound d(Z0,Z)2=HF2d(Z^{0},Z^{\star})^{2}=\left\|H\right\|^{2}_{F}. According to (20), HZ ⁣0H^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{0} is symmetric and Z0Z ⁣0{Z^{0}}^{\top}{\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{0}} is positive semidefinite. Hence we can write

where in the second line we used that HZ ⁣0H^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{0} is symmetric. Besides, as Z0Z ⁣0{Z^{0}}^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{0} is positive semidefinite, (42)tr((HH)(Z0Z ⁣0))(4-\sqrt{2})\operatorname{\text{tr}}((H^{\top}H)({Z^{0}}^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{0})) is nonnegative. Therefore,

where in (a)(a) we replaced δ\delta using Lemma 5, and (b)(b) holds since by our incoherence assumption (3) we have

Note that for (c)(c) we used AB2,A2,B\left\|AB^{\top}\right\|_{2,\infty}\leq\left\|A\right\|_{2,\infty}\left\|B\right\|.

Hence, to obtain d(Z0,Z)2116σrd(Z^{0},Z^{\star})^{2}\leq\frac{1}{16}\sigma^{\star}_{r}, it suffices to have

Since PC\mathcal{P}_{\mathcal{C}} is just row-wise clipping, by Lemma 11 we have

B.2 Proof of Corollary 1

By the incoherence assumption, we have Z2,μrn1n2σ1\left\|Z^{\star}\right\|_{2,\infty}\leq\sqrt{\frac{\mu r}{n_{1}\land n_{2}}\sigma^{\star}_{1}}, see (17). It suffices to show 2σ1σ12\sigma_{1}\geq\sigma^{\star}_{1}. From the above discussion, we can see that

By Wely’s theorem, we have σ1σ1116σr|\sigma_{1}-\sigma^{\star}_{1}|\leq\frac{1}{16}\sigma^{\star}_{r}. As a result, 2σ1σ12\sigma_{1}\geq\sigma^{\star}_{1}.

Appendix C Regularity Condition

Analogous to the restricted strong convexity (RSC) and restricted strong smoothness (RSS), we show that with high probability our objective function ff satisfies the local curvature and local smoothness conditions defined below.

There exists constant c1,c2c_{1},c_{2} such that for any ZCZ\in\mathcal{C} satisfying d(Z,Z)14σrd(Z,Z^{\star})\leq\frac{1}{4}\sqrt{\sigma^{\star}_{r}},

There exist constants c3,c4c_{3},c_{4} such that for any ZCZ\in\mathcal{C} satisfying d(Z,Z)14σrd(Z,Z^{\star})\leq\frac{1}{4}\sqrt{\sigma^{\star}_{r}},

where we used equation (19) for (i)(i), the Cauchy-Schwarz inequality for (ii)(ii), inequality (ab)2a22b2(a-b)^{2}\geq\frac{a^{2}}{2}-b^{2} for (iii)(iii). Finally, in the last line we used l=12mAl,M2=PΩ(M)F2\sum_{l=1}^{2m}\langle A_{l},M\rangle^{2}=\left\|\mathcal{P}_{\underline{\Omega}}(M)\right\|^{2}_{F}.

We first lower bound 12p1PΩ(HZ ⁣+Z ⁣H)F2\frac{1}{2}p^{-1}\left\|\mathcal{P}_{\underline{\Omega}}(H{\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu}^{\top}+\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5muH^{\top})\right\|^{2}_{F}. By the symmetry of Ω\underline{\Omega}, it is equal to p1PΩ(HUZ ⁣V+Z ⁣UHV)F2p^{-1}\left\|\mathcal{P}_{\Omega}(H_{U}{\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu}^{\top}_{V}+\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu_{U}H^{\top}_{V})\right\|^{2}_{F}, which expands to

As both HUZ ⁣VH_{U}{\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu}^{\top}_{V} and Z ⁣UHV\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu_{U}H^{\top}_{V} belong to TT, we use Lemma 10 to lower bound above three terms, respectively. This gives us

where we used HUZ ⁣VF2σrHUF2\left\|H_{U}{\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu}^{\top}_{V}\right\|^{2}_{F}\geq\sigma^{\star}_{r}\left\|H_{U}\right\|^{2}_{F} and Z ⁣UHVF2σrHVF2\left\|\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu_{U}H^{\top}_{V}\right\|^{2}_{F}\geq\sigma^{\star}_{r}\left\|H_{V}\right\|^{2}_{F} for (iv)(iv).

Next, we lower bound 2HUZ ⁣V,Z ⁣UHV+λtr(HΓ)2\langle H_{U}{\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu}^{\top}_{V},\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu_{U}H^{\top}_{V}\rangle+\lambda\operatorname{\text{tr}}(H^{\top}\Gamma) together. Rewriting

and plugging in Γ=DZZDZ\Gamma=DZZ^{\top}DZ, we then have

Equality (a)(a) holds because Z ⁣DZ ⁣=0{\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu}^{\top}D\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu=0. We plug in (53) in (b)(b). For (c)(c), we use Z ⁣DZ ⁣=0{\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu}^{\top}D\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu=0 and that HZ ⁣H^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu is symmetric. Finally, we take λ=12\lambda=\frac{1}{2} and use Lemma 8 to upper bound p1PΩ(HH)F2p^{-1}\left\|\mathcal{P}_{\underline{\Omega}}(HH^{\top})\right\|^{2}_{F}:

For simplicity, we take δ=116\delta=\frac{1}{16}. We also have HF2116σr\left\|H\right\|^{2}_{F}\leq\frac{1}{16}\sigma^{\star}_{r}. This leads to

Note that this lower bound holds with high probability uniformly for all ZCZ\in\mathcal{C} such that d(Z,Z)14σrd(Z,Z^{\star})\leq\frac{1}{4}\sqrt{\sigma^{\star}_{r}}, since Lemma 8 and 10 hold uniformly.

When the ground truth XX^{\star} is positive semidefinite, we don’t need to do lifitng nor impose the regularizer. Using Lemma 10, we can lower bound 12p1PΩ(HZ ⁣+Z ⁣H)F2(1δ)σrHF2\frac{1}{2}p^{-1}\left\|\mathcal{P}_{\underline{\Omega}}(H{\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu}^{\top}+\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5muH^{\top})\right\|^{2}_{F}\gtrsim(1-\delta)\sigma^{\star}_{r}\left\|H\right\|^{2}_{F} directly. Taking proper constants, we can obtain the standard restricted strong convexity condition:

C.2 Proof of the Local Smoothness Condition

To upper bound f(Z)F2=maxWF=1f(Z),W2\left\|\nabla f(Z)\right\|^{2}_{F}=\max_{\left\|W\right\|_{F}=1}|\langle\nabla f(Z),W\rangle|^{2}, it suffices to show that for any n×rn\times r WW of unit Frobenius norm, f(Z),W2|\langle\nabla f(Z),W\rangle|^{2} is upper bounded. We first write

where we used (19) for (i)(i). Since (a+b+c+d+e)25(a2+b2+c2+d2+e2)(a+b+c+d+e)^{2}\leq 5(a^{2}+b^{2}+c^{2}+d^{2}+e^{2}), we have

where we used the Cauchy-Schwarz inequality for (ii)(ii), and (a+b)22(a2+b2)(a+b)^{2}\leq 2(a^{2}+b^{2}) for (iii)(iii). We then use Lemma 9 to upper bound 1, 2, 4, 5, 6, 7, and Lemma 8 for 3. Also since WF=1\left\|W\right\|_{F}=1, one has

where in the last line we plugged in Z ⁣2,μrnσ1\left\|\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu\right\|_{2,\infty}\leq\sqrt{\dfrac{\mu r}{n}\sigma^{\star}_{1}} and H2,3μrnσ1\left\|H\right\|_{2,\infty}\leq 3\sqrt{\dfrac{\mu r}{n}\sigma^{\star}_{1}}, i.e. (17) and (18).

Inequality (a)(a) holds because ABFABF\left\|AB\right\|_{F}\leq\left\|A\right\|\left\|B\right\|_{F} and D=1\left\|D\right\|=1. To get (b)(b), for the first term in the 3rd line we expand ZZZ ⁣Z ⁣ZZ^{\top}-\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu{\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu}^{\top}, for the second term we expand Z=Z ⁣+HZ=\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu+H and use Z ⁣DZ ⁣=0{\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu}^{\top}D\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu=0. For (c)(c), we use ABFABFAFBF\left\|AB\right\|_{F}\leq\left\|A\right\|\left\|B\right\|_{F}\leq\left\|A\right\|_{F}\left\|B\right\|_{F}. Last, (d)(d) holds because Z ⁣2=2σ1\left\|\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu\right\|^{2}=2\sigma^{\star}_{1}.

Finally, we combine (59) and (60). As before, take λ=12\lambda=\frac{1}{2}, δ=116\delta=\frac{1}{16}, and HF2116σr\left\|H\right\|^{2}_{F}\leq\frac{1}{16}\sigma^{\star}_{r}, we obtain

where for (a)(a) we used ZH+Z ⁣14σr+2σ174σ1\left\|Z\right\|\leq\left\|H\right\|+\left\|\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu\right\|\leq\frac{1}{4}\sqrt{\sigma^{\star}_{r}}+\sqrt{2\sigma^{\star}_{1}}\leq\frac{7}{4}\sqrt{\sigma^{\star}_{1}}, for (b)(b) we used μ,r1\mu,r\geq 1.

As before, this condition holds uniformly for all ZZ such that d(Z,Z)14σrd(Z,Z^{\star})\leq\frac{1}{4}\sqrt{\sigma^{\star}_{r}} and satisfying the incoherence condition.

For the case XX^{\star} is positive semidefinite, as we don’t need to impose the regularizer, standard restricted strong smoothness condition follows:

C.3 Proof of Lemma 3

Rearranging the terms in the smoothness condition (61), we can further bound

Combining equation (56) and (62), it follows that

Finally, by upper bounding the probability that Lemma 8, 9, or 10 fails, and the sample probability pp these lemmas require, we conclude that once

regularity condition (63) holds with probability at least 1c1nc21-c_{1}n^{-c_{2}}, where c,c1,c2c,c_{1},c_{2} are constants.

Appendix D Linear Convergence

Let Hk=ZkZ ⁣kH^{k}=Z^{k}-\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{k}. Our iterate is Zk+1=PC(Zkηf(Zk))Z^{k+1}=\mathcal{P}_{\mathcal{C}}(Z^{k}-\eta\nabla f(Z^{k})). Since PC\mathcal{P}_{\mathcal{C}} is just row-wise clipping, by Lemma 11 we have

where we use the definition of RC(ε,α,β)RC(\varepsilon,\alpha,\beta) for (a)(a) and 0<ηmin{α/2,2/β}0<\eta\leq\min\left\{\alpha/2,2/\beta\right\} for (b)(b). Therefore,

References