A Convergent Gradient Descent Algorithm for Rank Minimization and Semidefinite Programming from Random Linear Measurements
Qinqing Zheng, John Lafferty
Introduction
Semidefinite programming has become a key optimization tool in many areas of applied mathematics, signal processing and machine learning. SDPs often arise naturally from the problem structure, or are derived as surrogate optimizations that are relaxations of difficult combinatorial problems . In spite of the importance of SDPs in principle—promising efficient algorithms with polynomial runtime guarantees—it is widely recognized that current optimization algorithms based on interior point methods can handle only relatively small problems. Thus, a considerable gap exists between the theory and applicability of SDP formulations. Scalable algorithms for semidefinite programming, and closely related families of nonconvex programs more generally, are greatly needed.
A parallel development is the surprising effectiveness of simple classical procedures such as gradient descent for large scale problems, as explored in the recent machine learning literature. In many areas of machine learning and signal processing such as classification, deep learning, and phase retrieval, gradient descent methods, in particular first order stochastic optimization, have led to remarkably efficient algorithms that can attack very large scale problems . In this paper we build on this work to develop first-order algorithms for solving the rank minimization problem under random measurements and a closely related family of semidefinite programs. Our algorithms are efficient and scalable, and we prove that they attain linear convergence to the global optimum under natural assumptions.
This problem is a direct generalization of compressed sensing, and subsumes many machine learning problems such as image compression, low rank matrix completion and low-dimensional metric embedding . While the problem is natural and has many applications, the optimization is nonconvex and challenging to solve. Without conditions on the transformation or the minimum rank solution , it is generally NP hard .
In addition to the wide applicability of affine rank minimization, the problem is also closely connected to a class of semidefinite programs. In Section 2, we show that the minimizer of a particular class of SDP can be obtained by a linear transformation of . Thus, efficient algorithms for problem (2) can be applied in this setting as well.
While this is a nonconvex function, we take motivation from recent work for phase retrieval by Candès et al. , and develop a gradient descent algorithm for optimizing , using a carefully constructed initialization and step size. Our main contributions concerning this algorithm are as follows.
We prove that with constraints our gradient descent scheme can exactly recover with high probability. Empirical experiments show that this bound may potentially be improved to .
We show that our method converges linearly, and has lower computational cost compared with previous methods.
We carry out a detailed comparison of rank minimization algorithms, and demonstrate that when the measurement matrices are sparse, our gradient method significantly outperforms alternative approaches.
In Section 3 we briefly review related work. In Section 4 we discuss the gradient scheme in detail. Our main analytical results are presented in Section 5, with detailed proofs contained in the supplementary material. Our experimental results are presented in Section 6, and we conclude with a brief discussion of future work in Section 7.
Semidefinite Programming and Rank Minimization
Before reviewing related work and presenting our algorithm, we pause to explain the connection between semidefinite programming and rank minimization. This connection enables our scalable gradient descent algorithm to be applied and analyzed for certain classes of SDPs.
Consider a standard form semidefinite program
where . In particular, minimizers of (3) are obtained from minimizers of (4) via the transformation
Since is positive semidefinite, is equal to . Hence, problem (4) is the nuclear norm relaxation of problem (2). Next, we characterize the specific cases where , so that the SDP and rank minimization solutions coincide. The following result is from Recht et al. .
holds for any matrix of rank at most . Suppose that there exists a rank matrix such that . If , then is the only matrix of rank at most satisfying . Furthermore, if , then can be attained by minimizing over the affine subset.
In other words, since , if holds for the transformation and one finds a matrix of rank satisfying the affine constraint, then must be positive semidefinite. Hence, one can ignore the semidefinite constraint when solving the rank minimization (2). The resulting problem then can be exactly solved by nuclear norm relaxation. Since the minimum rank solution is positive semidefinite, it then coincides with the solution of the SDP (4), which is a constrained nuclear norm optimization.
The observation that one can ignore the semidefinite constraint justifies our experimental comparison with methods such as nuclear norm relaxation, SVP, and AltMinSense, described in the following section.
Related Work
Burer and Monteiro proposed a general approach for solving semidefinite programs using factored, nonconvex optimization, giving mostly experimental support for the convergence of the algorithms. The first nontrivial guarantee for solving affine rank minimization problem is given by Recht et al. , based on replacing the rank function by the convex surrogate nuclear norm, as already mentioned in the previous section. While this is a convex problem, solving it in practice is nontrivial, and a variety of methods have been developed for efficient nuclear norm minimization. The most popular algorithms are proximal methods that perform singular value thresholding at every iteration. While effective for small problem instances, the computational expense of the SVD prevents the method from being useful for large scale problems.
Recently, Jain et al. proposed a projected gradient descent algorithm SVP (Singular Value Projection) that solves
As described above, considerable progress has been made on algorithms for rank minimization and certain semidefinite programming problems. Yet truly efficient, scalable and provably convergent algorithms have not yet been obtained. In the specific setting that is positive semidefinite, our algorithm exploits this structure to achieve these goals. We note that recent and independent work of Tu et al. proposes a hybrid algorithm called Procrustes Flow (PF), which uses a few iterations of SVP as initialization, and then applies gradient descent.
A Gradient Descent Algorithm for Rank Minimization
The authors establish the convergence of WF to the global optimum—given sufficient measurements, the iterates of WF converge linearly to up to a global phase, with high probability.
If and the s are real-valued, the function can be expressed as
which is a special case of where and each of and are rank one. See Figure 1a for an illustration; Figure 1b shows the convergence rate of our method. Our methods and results are thus generalizations of Wirtinger flow for phase retrieval.
Before turning to the presentation of our technical results in the following section, we present some intuition and remarks about how and why this algorithm works. For simplicity, let us assume that the rank is specified correctly.
Initialization is of course crucial in nonconvex optimization, as many local minima may be present. To obtain a sufficiently accurate initialization, we use a spectral method, similar to those used in . The starting point is the observation that a linear combination of the constraint values and matrices yields an unbiased estimate of the solution.
Certain key properties of will be seen to yield a linear rate of convergence. In the analysis of convex functions, Nesterov shows that for unconstrained optimization, the gradient descent scheme with sufficiently small step size will converge linearly to the optimum if the objective function is strongly convex and has a Lipschitz continuous gradient. However, these two properties are global and do not hold for our objective function . Nevertheless, we expect that similar conditions hold for the local area near . If so, then if we start close enough to , we can achieve the global optimum.
In our subsequent analysis, we establish the convergence of Algorithm 1 with a constant step size of the form , where is a small constant. Since is unknown, we replace it by .
Convergence Analysis
In this section we present our main result analyzing the gradient descent algorithm, and give a sketch of the proof. To begin, note that the symmetric decomposition of is not unique, since for any orthonormal matrix . Thus, the solution set is
Note that for any . We define the distance to the optimal solution in terms of this set.
Define the distance between and as
Our main result for exact recovery is stated below, assuming that the rank is correctly specified. Since the true rank is typically unknown in practice, one can start from a very low rank and gradually increase it.
Let the condition number denote the ratio of the largest to the smallest nonzero eigenvalues of . There exists a universal constant such that if , with high probability the initialization satisfies
Moreover, there exists a universal constant such that when using constant step size with and initial value obeying (6), the th step of Algorithm 1 satisfies
We now outline the proof, giving full details in the supplementary material. The proof has four main steps. The first step is to give a regularity condition under which the algorithm converges linearly if we start close enough to . This provides a local regularity property that is similar to the Nesterov criteria that the objective function is strongly convex and has a Lipschitz continuous gradient.
Let \mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu=\operatorname*{arg\,min}_{\widetilde{Z}\in\mathcal{S}}\big{\|}Z-\widetilde{Z}\big{\|}_{F} denote the matrix closest to in the solution set. We say that satisfies the regularity condition if there exist constants , such that for any satisfying , we have
Using this regularity condition, we show that the iterative step of the algorithm moves closer to the optimum, if the current iterate is sufficiently close.
Consider the update . If satisfies , , and , then
In the next step of the proof, we condition on two events that will be shown to hold with high probability using concentration results. Let denote a small value to be specified later.
Here the expectations are with respect to the random measurement matrices. Under these assumptions, we can show that the objective satisfies the regularity condition with high probability.
Suppose that A1 and A2 hold. If , then satisfies the regularity condition with probability at least , where , are universal constants.
Next we show that under A1, a good initialization can be found.
Suppose that A1 holds. Let be the top eigenpairs of such that . Let where , . If , then
Finally, we show that conditioning on A1 and A2 is valid since these events have high probability as long as is sufficiently large.
holds with probability at least , where and are universal constants.
with probability at least .
Note that since we need , we have , and the number of measurements required by our algorithm scales as , while only samples are required by the regularity condition. We conjecture this bound could be further improved to be ; this is supported by the experimental results presented below.
Recently, Tu et al. establish a tighter bound overall. Specifically, when only one single SVP step is used in preprocessing, the initialization of PF is also the spectral decomposition of . The authors show that measurements are sufficient for the initial solution to satisfy with high probability, and demonstrate an sample complexity for the regularity condition.
Experiments
In this section we report the results of experiments on synthetic datasets. We compare our gradient descent algorithm with nuclear norm relaxation, SVP and AltMinSense for which we drop the positive semidefiniteness constraint, as justified by the observation in Section 2. We use ADMM for the nuclear norm minimization, based on the algorithm for the mixture approach in Tomioka et al. ; see Appendix G. For simplicity, we assume that AltMinSense, SVP and the gradient scheme know the true rank. Krylov subspace techniques such as the Lanczos method could be used compute the partial eigendecomposition; we use the randomized algorithm of Halko et al. to compute the low rank SVD. All methods are implemented in MATLAB and the experiments were run on a MacBook Pro with a 2.5GHz Intel Core i7 processor and 16 GB memory.
It is instructive to compare the per-iteration cost of the different approaches; see Table 1. Suppose that the density (fraction of nonzero entries) of each is . For AltMinSense, the cost of solving the least squares problem is . The other three methods have cost to compute the affine transformation. For the nuclear norm approach, the cost is from the SVD and the cost is due to the update of the dual variables. The gradient scheme requires operations to compute and to multiply by matrix to obtain the gradient. SVP needs operations to compute the top singular vectors. However, in practice this partial SVD is more expensive than the cost required for the matrix multiplies in the gradient scheme.
Clearly, AltMinSense is the least efficient. For the other approaches, in the dense case ( large), the affine transformation dominates the computation. Our method removes the overhead caused by the SVD. In the sparse case ( small), the other parts dominate and our method enjoys a low cost.
2 Runtime Comparison
We conduct experiments for both dense and sparse measurement matrices. AltMinSense is indeed slow, so we do not include it here.
In the first scenario, we randomly generate a rank- matrix where . We also generate matrices from the GOE, and then take . We report the relative error measured in the Frobenius norm defined as . For the nuclear norm approach, we set the regularization parameter to . We test three values for the penalty parameter and select as it leads to the fastest convergence. Similarly, for SVP we evaluate the three values for the step size, and select as the largest for which SVP converges. For our approach, we test the three values for and select in the same way.
where we use . For all the methods we use the same strategies as before to select parameters. For the nuclear norm approach, we try three values and select . For SVP, we test the three values for the step size and select . For the gradient algorithm, we check the three values for and choose .
The results are shown in Figures 2a and 2b. In the dense case, our method is faster than the nuclear norm approach and slightly outperforms SVP. In the sparse case, it is significantly faster than the other approaches.
3 Sample Complexity
We consider cases where or and is of rank one or two. The results are shown in Figure 2c. For SVP and our approach, the phase transitions happen around when is rank- and when is rank-. This scaling is close to the number of degrees of freedom in each case; this confirms that the sample complexity scales linearly with the rank . The phase transition for the nuclear norm approach occurs later. The results suggest that the sample complexity of our method should also scale as as for SVP and the nuclear norm approach .
Conclusion
We connect a special case of affine rank minimization to a class of semidefinite programs with random constraints. Building on a recently proposed first-order algorithm for phase retrieval , we develop a gradient descent procedure for rank minimization and establish convergence to the optimal solution with measurements. We conjecture that measurements are sufficient for the method to converge, and that the conditions on the sampling matrices can be significantly weakened. More broadly, the technique used in this paper—factoring the semidefinite matrix variable, recasting the convex optimization as a nonconvex optimization, and applying first-order algorithms—first proposed by Burer and Monteiro , may be effective for a much wider class of SDPs, and deserves further study.
Acknowledgements
Research supported in part by NSF grant IIS-1116730 and ONR grant N00014-12-1-0762. The authors thank Afonso Bandeira, Ryota Tomioka and the authors of Tu et al. for helpful comments on this work.
References
Appendix A Proof of Lemma 1
Appendix B Ingredients
We first present some technical lemmas that will be needed later. Recall Definition 2 that for any , \mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu=\operatorname*{arg\,min}_{\widetilde{Z}\in\mathcal{S}}\big{\|}Z-\widetilde{Z}\big{\|}_{F}. Let . The th column of , , , are denoted by , , , respectively. We shall use the following formulas for the gradient and second order partial derivatives:
The next ingredient we need is the expectation of the second order partial derivatives with respect to the random measurement matrices.
The expectation of entry of is
The case where is a direct result of Lemma 2. For the other case, let be a GOE distributed random matrix. It follows from Lemma 1 that
Substituting this back into the above equation, we obtain the lemma. ∎
We next recall a concentration result for the operator (spectral) norm of the random measurement matrices.
(Ledoux and Rider [14, Theorem 1]) There exists two absolute constants and such that with probability at least ,
A tighter upper bound is actually given in the Tracy-Widow law: w.h.p. .
With probability at least , the average of the squared operator norm of the random measurement matrices is upper bounded by .
The following two technical lemmas are important tools for us. Define the set
Suppose that A1 holds: , for any such that . If , then for any it holds that
Let be the th column of . Since , it follows from the assumption of the lemma that
where we replace by and by . Taking the sum of the above inequalities, we obtain
Note that . Therefore,
Suppose that A2 holds: for any such that we have
Our goal is to bound . This can be expanded as
We first bound the sum of the quadratic terms. For any , we have
It follows from assumption (7) that for any ,
Taking the sum of above inequalities, we obtain
Similarly, we bound the sum of the cross terms. For any fixed such that , we have
Note that , and
By Lemma 7, . Replacing those terms in equation (10) gives us
Finally, we obtain the claim by noticing that
where are the singular values of . ∎
.
Let . Note that for any matrices that are of the same size. The equality holds when where is the SVD of . Hence, where is the SVD of ; . Therefore, is symmetric and positive semidefinite. Thus, is also symmetric. This implies that . ∎
Appendix C Linear Convergence
Let . Then we have that
where we use the definition of in the third line, in the third to last line and in the second to last line. Therefore,
Appendix D Regularity Condition
As mentioned before, Nesterov [16, Theorem 2.1.11] shows that the gradient scheme converges linearly under a condition similar to the regularity condition, which is satisfied if the function is strongly convex and has a Lipschitz continuous gradient (strongly smooth). In order to prove Theorem 4, we show that with high probability the function satisfies the local curvature condition, which is analogous to strong convexity, and the local smoothness condition, which is analogous to strong smoothness.
There exists a constant such that for any satisfying ,
There exist constants such that for any satisfying ,
where we use Cauchy-Schwarz inequality in the 2nd line, the inequality in the 5th line, Lemma 5 and 6 in the 7th line, and the fact that in the 8th line. Since and , we have
D.2 Proof of the Local Smoothness Condition
Since , we have
The first term in the righthand side can be upper bounded as
where we use the Cauchy-Schwarz inequality in the first and second line, Lemma 5 and in the third line and Corollary 1 in the last line.
The other three terms are bounded similarly. For the second term, we have
where we use Lemma 6 and 1. The third term is bounded as
Putting these inequalities together, we have
Since and , we have
D.3 Proof of the Regularity Condition
Now we combine the curvature and the smoothness conditions. For any , it holds that
Combining equation (11) and (12), we obtain
where we use . Thus we have
for and .
Appendix E Initialization
Let be the eigenvalues of . By Weyl’s theorem, we have
Since , it is easy to see and . Hence, , , and is the best rank approximation of . Therefore,
where we used in first line, the fact and inequality (13) in the last line.
Let . We want to bound . According to the discussion in Lemma 7, is symmetric and is positive semidefinite.
The following step closely follows . It holds that
where in the fourth line we used the property that the trace is invariant under cyclic permutations and .
Since is positive semidefinite, is nonnegative. Hence,
If , then
Appendix F Sample Complexity
In this section, we verify that our assumptions hold with high probability if , where is a constant that depends on , , and . Our proof relies on the following concentration inequality.
We first give a technical lemma that we will use later.
Let be a random matrix drawn from GOE. Let . There exist absolute constants , such that with probability at least , we have
Let . . Note that and are independent, hence . Besides, since , we can see that is distributed.
First we bound the operator norm of . We rewrite as
where , . As is GOE distributed, by Lemma 4,
where and are absolute constants.
Using the Gaussian tail inequality, we have
Combining inequalities (14) and (15), we have
where the last inequality follows from the union bound.
Next we bound the deviation of the term. By the corollary of Lemma 1 in Laurent and Massart , we have
Finally, the statement is obtained by choosing proper , , and using . ∎
It is equivalent to show that for any unit vector , with high probability,
where in the second line we use unitary invariance of the operator norm, and in the last line we denote by . Since the GOE is invariant under orthogonal conjugation, and are identically distributed. Hence, it suffices to prove the claim when , i.e.
where is the entry of and .
To show this, we apply Theorem 8, where . This requires that the operator norm of is bounded, for each . We address this by noticing that with high probability , . To be precise, by Lemma 8 there exist constants , such that
Taking the union bound over all the s leads to
By Theorem 8, if , then
Combining inequalities (18) and (19), we conclude that
F.2 Proof of Theorem 7
The formulation of the second order partial derivatives and their expectations is given in Appendix B.
It is easy to see that for any , . Thus it is sufficient to prove that for any two unitary vector and with high probability it holds that
We can decompose as for a certain unit vector that is orthogonal to , where . Let . It suffices to prove the following two claims.
For any unitary vector , with high probability
For any two orthogonal unit vectors and , with high probability
where and have the same distribution. Hence we only need to prove the case where :
where is the first column of .
Let denote , , and respectively. It is easy to see that
where . As , for , we can see that and are distributed with degrees of freedom and , respectively. Using the tail bound, we have
As , for ,
If , then by applying Theorem 8 we can see
Combining inequalities (21) and (20) leads to
Proof of (ii)
We only need to prove the case where and due to the same reason above. That is,
where and are the first and second columns of .
As before, let and let denote , , and respectively. From the proof of (i), we can see that with probability at least both and are no larger than . Since , we have
and . If , then by applying Theorem 8 we have
Appendix G ADMM for Nuclear Norm Minimization
We reformulate the nuclear norm minimizing problem as
where is the regularization parameter. will enforce the minimizer satisfying the affine constraint .
We apply ADMM to the dual problem of (23):
where we introduce an auxiliary variable to make this problem equality constrained.
The augmented Lagrangian of problem (24) can be written as
where is the multiplier, is the penalty parameter, and is the indicator function of the unit spectral norm ball i.e. equals if and otherwise.
Let denote the vectorization of a matrix, whose inverse mapping is denoted by . We can rewrite the transformations as and , where is a matrix whose th row is .
The ADMM starts from initialization and updates the three variables alternately. The updates can be computed in close forms:
where is the projection onto the unit spectral norm ball. Let be the singular value decomposition of ,
In fact, the update of can be combined with other steps without being computed explicitly. One only has to iterate the following two steps:
where is the singular value soft-thresholding operator defined as
The sequence of multipliers converges to the primal solution of (23). To speed up the update of , the Cholesky decomposition of is precomputed in our implementation.