Convergence Analysis for Rectangular Matrix Completion Using Burer-Monteiro Factorization and Gradient Descent
Qinqing Zheng, John Lafferty
Introduction
A growing body of recent research is shedding new light on the role of nonconvex optimization for tackling large scale problems in machine learning, signal processing, and convex programming. This work is developing techniques that help to explain the surprising effectiveness of relatively simple first-order algorithms for certain nonconvex optimizations.
When applied to problems that can be formulated as semidefinite programs, these techniques can often be viewed as part of a framework proposed by Burer and Monteiro . The Burer-Monteiro technique is based on factoring the semidefinite variable, and applying classical optimization techniques to the resulting nonconvex objective over the factor. While worst-case complexity considerations imply that such an approach cannot succeed in general, a series of recent papers has shown the strategy to be remarkably effective for a number of problems of practical interest, with analytical convergence guarantees and strong empirical performance.
In order for this problem to be well-posed, it is important to understand when is identifiable and, in particular, the unique minimizer of (1). Moreover, because the problem is in general NP-hard, it is essential to identify tractable families of instances, together with efficient algorithms having global convergence guarantees.
In the following section we give a full description of our approach. Our theoretical results are presented in Section 3, with detailed proofs contained in the appendix. Our analysis subsumes the case where is positive semidefinite. In Section 4 we briefly review related work. The experimental results are presented in Section 5, and we conclude with a brief discussion of future work in Section 6.
Semidefinite Lifting, Factorization, and Gradient Descent
In this paper, we focus on completing an incoherent or “non-spiky” matrix . With denoting the rank- SVD of , we assume is -incoherent, as defined below.
The matrix is -incoherent with respect to the canonical basis if its singular vectors satisfy
where is a constant.Note that , since .
Our main interest is the uniform model where entries of are observed uniformly at random, though we shall analyze a Bernoulli sampling model, where each entry of is observed with probability . One can transfer the results back to the uniform model, as the probability of failure under the uniform model is at most twice that under the Bernoulli model; see .
Using the rank- SVD of , we can lift to
The symmetric decomposition of is not unique; our goal is to find a matrix in the set
since for any we have . Let denote the corresponding observed entries of , and consider minimization of the squared error
Note that is not the unique minimizer of (6), nor is it the only possible positive semidefinite lifting of . For example, let be an nonsingular matrix, and form the matrices
Since does not contain any entry in the top-left or bottom-right block, is also a minimizer of (6). Thus, the solution set of the lifted problem is much larger than the set of actual interest. For the sake of simple analysis, we shall focus on exact recovery of only, and thus impose an additional regularizer to align the column spaces of and , as in . The regularized loss is
While this apparently introduces an extra tuning parameter, our analysis establishes linear convergence of the projected gradient descent algorithm when , and thus one may treat as a fixed number.
It is discussed in that one needs to ensure the iterates stay incoherent. Let be the set of incoherent matrices
where we assume is known and will be determined.
Our algorithm is simply gradient descent on , with projection onto . Let . Then the gradient of is given by
The projection to the feasible set has closed form solution, given by row-wise clipping:
Note that is an unbiased estimator of under the Bernoulli model. To initialize, we thus construct from the top rank- factors of . This leads to the following algorithm.
Remarks. (i) The step size is normalized by . Our analysis will establish linear convergence when taking step sizes of the form , where is a sufficiently small constant. We replace by in the actual algorithm since it is unknown in practice. (ii) The feasible set (9) depends on as well. Under the above spectral initialization, our analysis shows that when , the term is an upper bound of with high probability (see Corollary 1 below). This means is a subset of . Note that this does not change the global optimality of and its equivalent elements, since . In practice, we find that the iterates of our algorithm remain incoherent, so that one may drop the projection step. (iii) The column space regularizer (8) is needed in our analysis. We also found that when , our algorithm typically converges to another PSD lifted matrix of , with minor difference from in the top-left and bottom-right blocks.
In the following section we state and sketch a proof of our main convergence result for this algorithm.
Main Result: Convergence Analysis
Suppose that is of rank , with condition number , and -incoherent as defined in Definition 1. Suppose further that we observe entries of chosen uniformly at random. Let be the lifted matrix as in (4) and write . Then there exist universal constants such that if
then with probability at least the iterates of Algorithm 1 converge to geometrically, when using regularization parameter , correctly specified input rank , and constant step size with .
We shall analyze the Bernoulli sampling model, as justified in Section 2. Let us define the distance to in terms of the solution set .
Define the distance between and as
The next theorem establishes the global convergence of Algorithm 1, assuming that the input rank is correctly specified. The proof sketch is given in the next subsection.
There exist universal constants such that if , with probability at least , the initialization satisfies
Moreover, there exist universal constants such that if , when using constant step size with and initial value obeying (13), the th step of Algorithm 1 with satisfies
with probability at least .
After each update, the distance of our iterates to is reduced by at least a factor of .
Hence, the output satisfies after at most iterations.
Our proof idea is of the same nature as the analysis in . We show two appealing properties when sufficient entries are observed. First, our spectral initialization produces a starting point within the neighborhood of the solution set.
There exist universal constants , such that if then with probability at least ,
To demonstrate this, we exploit the concentration around the mean of . See Appendix B for the proof. Using this lemma, we can immediately show that and all other elements of are contained in the feasible set (9).
With probability at least ,
The second crucial property is that is well-behaved within the neighborhood, so that the iterates move closer to the optima in every iteration. The key step is to set up a local regularity condition similar to Nesterov’s conditions .
Let \mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu=\operatorname*{arg\,min}_{\widetilde{Z}\in\mathcal{S}}\big{\|}Z-\widetilde{Z}\big{\|}_{F} denote the matrix closest to in the solution set. We say that satisfies the regularity condition if there exist constants , such that for any satisfying , we have
Using this condition, one can show the iterates converge linearly to the optima if we start close enough to .
Consider the update . If satisfies , and , then
The following lemma illustrates the local regularity of . Nesterov’s criterion is established upon strong convexity and strong smoothness of the objective. Here we show analogous curvature and smoothness conditions holds for locally – within the neighborhood – with high probability. Interestingly, we found that to show the local curvature condition holds, it suffices to set . The proof can be found in Appendix C, for which we have generalized some technical lemmas of .
Let the regularization constant be set to . There exists universal constant , such that if , then satisfies , with probability at least .
Related Work
Matrix completion is one instance of the general low rank linear inverse problem
where is an affine transformation and is the measurement of the ground truth . Considerable progress has been made towards algorithms for recovering including both convex and nonconvex approaches. One of the most popular methods is nuclear norm minimization, a convenient convex relaxation of rank minimization. It was first proposed in , and analyzed under a certain restricted isometry property (RIP). Subsequent work clarified the conditions for reconstruction, and studied recovery guarantees for both exact and approximately low rank matrices, with or without noise . One significant advantage for this approach is its near-optimal sample complexity. Under the same incoherence assumption as ours, Chen establishes the currently best-known lower bound of samples. Using a closely related notion of incoherence, Negahban and Wainwright show that if is “-nonspiky” with , then samples are sufficient for exact recovery. However, convexity and low sample complexity aside, in practice the power of nuclear norm relaxation is limited due to high computational cost. The popular algorithms for nuclear norm minimization are proximal methods that perform iterative singular value thresholding . However, such algorithms don’t scale to large instances because the per-iteration SVD is expensive.
In the same year, Jain et al. suggested minimizing the squared residual under a rank constraint . While this constraint is nonconvex, projection onto the feasible set can be computed using low rank SVD. Under certain RIP assumption on , Jain et al. establish the global convergence of projected gradient descent for (14). This algorithm is named Singular Value Projection (SVP). Yet in the setting of completion, only experimental support for the effectiveness of SVP is provided. More importantly, SVP also suffers from expensive per-iteration SVD for large scale problems.
Another theoretical disadvantage of the resampling scheme is that the sample complexity depends on the desired accuracy , as established by . As the accuracy goes to zero, the sample complexity increases. In contrast, our algorithm doesn’t require resampling, and the sample complexity is independent of .
In 2014, Candès et al. proposed Wirtinger flow for phase retrieval. Wirtinger flow is a fast first-order algorithm that minimizes a fourth order (nonconvex) objective, geometrically converging to the global optimum. While previous work lifts the phase retrieval problem into an SDP where the solution is rank one, this work bridges SDP and first-order algorithms via the Burer-Monteiro technique. It has inspired further research on related topics; last year, the authors of considered factorizations for (14), assuming is semidefinite, and proved global optimality of first-order algorithms under appropriate initializations. Tu et al. have extended this algorithm to handle rectangular matrix via asymmetric factorization, and have shown exact recovery of , assuming satisfies a certain RIP. They use lifting implicitly, factorizing and applying gradient updates on both factors and simultaneously, with the nonconvex objective function
Their proof strategy also shows convergence of in the lifted space. For the specific case of matrix completion, Chen and Wainwright obtained guarantees when is semidefinite. Our work generalizes the results obtained in , extending the recent literature on first-order algorithms for factorized models.
After completing this work we learned of independent research of Sun and Luo , who also analysed a gradient algorithm for rectangular matrix completion. Their formulation is similar to ours, with additional Frobenius norm constraints on the factors. The authors established a sample complexity of observations; in comparison our bound scales as . The authors also analyzed block coordinate descent type alternating minimization, which cyclically updates the rows of and then the rows of , showing exact recovery of this algorithm without resampling. Recent independent work of Yi et al. analyzes a gradient scheme for Robust PCA. Under the setting of partial observation without corruption, this is the standard matrix completion problem. In other related work, also study nonconvex optimization methods for matrix completion, using algorithms that still require low rank SVD in each iteration.
Experiments
where will enforce the minimizer fitting the observed values exactly. We use ADMM to solve (16). It is based on the algorithm for the matrix approach in , and can neatly handle the case . We emphasize there is no computational difference between cases whether is zero or not. All methods are implemented in MATLAB. We use the toolbox Manopt for trustRegion and the implementation of OptSpace from the authors. For AltMin, we use the same sample sets in every iteration. The experiments were run on a Linux machine with a 3.4GHz Intel Core i7 processor and 8 GB memory.
Runtime Comparison
We randomly generated a true matrix of size and rank 3. It is constructed from the rank- SVD of a random matrix with i.i.d normal entries. We sampled entries of uniformly at random, where is roughly equal to with and . For simplicity, we feed SVP, OptSpace and GD with the true rank. For all these methods, we use the randomized algorithm of Halko et al. to compute the low rank SVD, which is approximately 15 times faster than MATLAB built-in SVD on instances of such size. We report relative error measured in the Frobenius norm, defined as . For nuclear, we set to enforce exact fitting. The convergence speed of ADMM mildly depends on the choice of penalty parameter. We tested values and selected , which leads to fastest convergence. Similarly, for SVP, we would like to choose the largest step size for which the algorithm is converging. We evaluated and selected . The step size is chosen for GD in the same way. Five values are tested for and we picked . For OptSpace, we compared fixed step sizes with line search, and found the algorithm converged fastest under line search. Figure 1a shows the results. GD is slightly slower than trustRegion and faster than competing approaches.
To further illustrate how runtime scales as the dimension increases, we run larger instances of size and , where the true rank is . The parameters are selected in the same manner, and we terminate the computation once the relative error is below . We report the results of AltMin GD, SVP and trustRegion in Figure 2a; nuclear, OptSpace do not scale well to such sizes so that we didn’t include them. The runtime of AltMin scales the slowest, while the runtimes of GD and trustRegion increase slower than SVP.
Sample Complexity
We evaluate the number of observations required by GD for exact recovery. For simplicity, we consider square but asymmetric . We conducted experiments in 4 cases, where the randomly generated is of size or , and of rank or . In each case, we compute the solutions of GD given random observations, and a solution with relative error below is considered to be successful. We run 20 trials and compute the empirical probability of successful recovery. The results are shown in Figure 2b. For all four cases, the phase transitions occur around . This suggests that the actual sample complexity of GD may scale linearly with both the dimension and the rank .
Conclusion
We propose a lifting procedure together with Burer-Monteiro factorization and a first-order algorithm to carry out rectangular matrix completion. While optimizing a nonconvex objective, we establish linear convergence of our method to the global optimum with random observations. We conjecture that observations are sufficient for exact recovery, and that the column space regularizer can be dropped. We provide empirical evidence showing this simple algorithm is fast and scalable, suggesting that lifting techniques may be promising for much more general classes of problems.
Acknowledgements
The authors thank Rina Foygel Barber, Yudong Chen and Ruoyu Sun for helpful comments. Research supported in part by ONR grant N00014-15-1-2379 and NSF grant DMS-1513594.
Appendix A Technical Lemmas
Another way of writing the objective function is
where is an index of , is a matrix with at the corresponding observed entry and elsewhere. Let , the gradient can be written as
We will use the following facts throughout the proof:
Inequality (17) is a direct result of Definition 1. To see (18), note that , and by the discussion of initialization in Appendix B. For (20), it holds that
where is the SVD of . Clearly, is positive semidefinite, and is symmetric.
Next, we list several technical lemmas that are utilized later. We will use to denote a numerical constant, whose value may vary from line to line.
For any of the form , where are unitary matrices and is a diagonal matrix, we have
where . We have
Plugging (24) back into (LABEL:eq:ZZ_to_X_helper0), we obtain the lemma. ∎
Recall that . We will exploit the following two known concentration results.
Let be the Euclidean projection onto . There is a numerical constant such that for any , if , then with probability , we have
Lemma 7 upper bounds the spectral norm of the adjacency matrix of a random Erdős-Rényi graph. It is a variant of Lemma 7.1 of Keshavan et al. , which uses known results of Feige and Ofek .
We refer readers to for a complete proof, in particular noticing that one can choose large enough so that the constant factor in the first term in (26) is only .
Lemma 8, 9 and 10 are direct generalizations of Lemma 4 and 5 of .
There exists a constant such that, for any , if , then with probability at least , uniformly for all such that , we have
where follows from Lemma 7 and follows from .
Let us further assume , where is a fixed constant, then we can bound
The final threshold we obtain is thus for some constant . ∎
There exists a constant , if , then with probability at least , uniformly for all matrices , such that is of size ,
Let denote the set of entries sampled in the th row of . Note that because of the structure of , at most entries are sampled at the frist rows, and at most entries are sampled at the rest rows.
Using a binomial tail bound, if for sufficiently large , the event holds with probability at least . Similarly for the rest rows. Hence, if for some constant , with probability at least , we have .
Conditioning on this event, we then have for all of proper size,
Similarly we can prove with probability at least ,
The following lemma establishes restricted strong convexity and smoothness of the observation operator for matrices in .
Let be the subspace defined in (25). There exists a universal constant such that, if , with probability at least , uniformly for all , we have
Consequently, uniformly for all ,
Let be a matrix in . Rewriting , and using the Cauchy-Schwarz inequality and (32) we can bound
where follows from Lemma 6. Combining (33) and (34) proves (30). To show (31), let and . Both and are in . We have
where follows from (30). Thus, we have
Last, we want to show the projection onto feasible set is a contraction.
If , then . Otherwise , where . Write , we have
If , then (39) holds because . If , (39) still holds since . ∎
Appendix B Initialization
Let denote the upper bound of as in Lemma 5, and let denote the singular values of . By Weyl’s theorem, we have
Note this implies , as .
By definition, , where is the rank- SVD of . According to Lemma 4, one has
where holds because , holds since .
Let . We want to bound . According to (20), is symmetric and is positive semidefinite. Hence we can write
where in the second line we used that is symmetric. Besides, as is positive semidefinite, is nonnegative. Therefore,
where in we replaced using Lemma 5, and holds since by our incoherence assumption (3) we have
Note that for we used .
Hence, to obtain , it suffices to have
Since is just row-wise clipping, by Lemma 11 we have
B.2 Proof of Corollary 1
By the incoherence assumption, we have , see (17). It suffices to show . From the above discussion, we can see that
By Wely’s theorem, we have . As a result, .
Appendix C Regularity Condition
Analogous to the restricted strong convexity (RSC) and restricted strong smoothness (RSS), we show that with high probability our objective function satisfies the local curvature and local smoothness conditions defined below.
There exists constant such that for any satisfying ,
There exist constants such that for any satisfying ,
where we used equation (19) for , the Cauchy-Schwarz inequality for , inequality for . Finally, in the last line we used .
We first lower bound . By the symmetry of , it is equal to , which expands to
As both and belong to , we use Lemma 10 to lower bound above three terms, respectively. This gives us
where we used and for .
Next, we lower bound together. Rewriting
and plugging in , we then have
Equality holds because . We plug in (53) in . For , we use and that is symmetric. Finally, we take and use Lemma 8 to upper bound :
For simplicity, we take . We also have . This leads to
Note that this lower bound holds with high probability uniformly for all such that , since Lemma 8 and 10 hold uniformly.
When the ground truth is positive semidefinite, we don’t need to do lifitng nor impose the regularizer. Using Lemma 10, we can lower bound directly. Taking proper constants, we can obtain the standard restricted strong convexity condition:
C.2 Proof of the Local Smoothness Condition
To upper bound , it suffices to show that for any of unit Frobenius norm, is upper bounded. We first write
where we used (19) for . Since , we have
where we used the Cauchy-Schwarz inequality for , and for . We then use Lemma 9 to upper bound 1, 2, 4, 5, 6, 7, and Lemma 8 for 3. Also since , one has
where in the last line we plugged in and , i.e. (17) and (18).
Inequality holds because and . To get , for the first term in the 3rd line we expand , for the second term we expand and use . For , we use . Last, holds because .
Finally, we combine (59) and (60). As before, take , , and , we obtain
where for we used , for we used .
As before, this condition holds uniformly for all such that and satisfying the incoherence condition.
For the case is positive semidefinite, as we don’t need to impose the regularizer, standard restricted strong smoothness condition follows:
C.3 Proof of Lemma 3
Rearranging the terms in the smoothness condition (61), we can further bound
Combining equation (56) and (62), it follows that
Finally, by upper bounding the probability that Lemma 8, 9, or 10 fails, and the sample probability these lemmas require, we conclude that once
regularity condition (63) holds with probability at least , where are constants.
Appendix D Linear Convergence
Let . Our iterate is . Since is just row-wise clipping, by Lemma 11 we have
where we use the definition of for and for . Therefore,