A Convergent Gradient Descent Algorithm for Rank Minimization and Semidefinite Programming from Random Linear Measurements

Qinqing Zheng, John Lafferty

Introduction

Semidefinite programming has become a key optimization tool in many areas of applied mathematics, signal processing and machine learning. SDPs often arise naturally from the problem structure, or are derived as surrogate optimizations that are relaxations of difficult combinatorial problems . In spite of the importance of SDPs in principle—promising efficient algorithms with polynomial runtime guarantees—it is widely recognized that current optimization algorithms based on interior point methods can handle only relatively small problems. Thus, a considerable gap exists between the theory and applicability of SDP formulations. Scalable algorithms for semidefinite programming, and closely related families of nonconvex programs more generally, are greatly needed.

A parallel development is the surprising effectiveness of simple classical procedures such as gradient descent for large scale problems, as explored in the recent machine learning literature. In many areas of machine learning and signal processing such as classification, deep learning, and phase retrieval, gradient descent methods, in particular first order stochastic optimization, have led to remarkably efficient algorithms that can attack very large scale problems . In this paper we build on this work to develop first-order algorithms for solving the rank minimization problem under random measurements and a closely related family of semidefinite programs. Our algorithms are efficient and scalable, and we prove that they attain linear convergence to the global optimum under natural assumptions.

This problem is a direct generalization of compressed sensing, and subsumes many machine learning problems such as image compression, low rank matrix completion and low-dimensional metric embedding . While the problem is natural and has many applications, the optimization is nonconvex and challenging to solve. Without conditions on the transformation $\mathcal{A}$ or the minimum rank solution $X^{\star}$ , it is generally NP hard .

In addition to the wide applicability of affine rank minimization, the problem is also closely connected to a class of semidefinite programs. In Section 2, we show that the minimizer of a particular class of SDP can be obtained by a linear transformation of $X^{\star}$ . Thus, efficient algorithms for problem (2) can be applied in this setting as well.

While this is a nonconvex function, we take motivation from recent work for phase retrieval by Candès et al. , and develop a gradient descent algorithm for optimizing $f(Z)$ , using a carefully constructed initialization and step size. Our main contributions concerning this algorithm are as follows.

We prove that with $O(r^{3}n\log n)$ constraints our gradient descent scheme can exactly recover $X^{\star}$ with high probability. Empirical experiments show that this bound may potentially be improved to $O(rn\log n)$ .

We show that our method converges linearly, and has lower computational cost compared with previous methods.

We carry out a detailed comparison of rank minimization algorithms, and demonstrate that when the measurement matrices $A_{i}$ are sparse, our gradient method significantly outperforms alternative approaches.

In Section 3 we briefly review related work. In Section 4 we discuss the gradient scheme in detail. Our main analytical results are presented in Section 5, with detailed proofs contained in the supplementary material. Our experimental results are presented in Section 6, and we conclude with a brief discussion of future work in Section 7.

Semidefinite Programming and Rank Minimization

Before reviewing related work and presenting our algorithm, we pause to explain the connection between semidefinite programming and rank minimization. This connection enables our scalable gradient descent algorithm to be applied and analyzed for certain classes of SDPs.

Consider a standard form semidefinite program

where $A_{i}=L^{-1}\widetilde{A}_{i}{L^{-1}}^{\top}$ . In particular, minimizers $\widetilde{X}^{*}$ of (3) are obtained from minimizers $X^{*}$ of (4) via the transformation

Since $X$ is positive semidefinite, $\operatorname{\text{tr}}(X)$ is equal to $\left\|X\right\|_{*}$ . Hence, problem (4) is the nuclear norm relaxation of problem (2). Next, we characterize the specific cases where $X^{*}=X^{\star}$ , so that the SDP and rank minimization solutions coincide. The following result is from Recht et al. .

holds for any matrix $X$ of rank at most $k$ . Suppose that there exists a rank $r$ matrix $X^{\star}$ such that $\mathcal{A}(X^{\star})=b$ . If $\delta_{2r}<1$ , then $X^{\star}$ is the only matrix of rank at most $r$ satisfying $\mathcal{A}(X)=b$ . Furthermore, if $\delta_{5r}<1/10$ , then $X^{\star}$ can be attained by minimizing $\left\|X\right\|_{*}$ over the affine subset.

In other words, since $\delta_{2r}\leq\delta_{5r}$ , if $\delta_{5r}<1/10$ holds for the transformation $\mathcal{A}$ and one finds a matrix $X$ of rank $r$ satisfying the affine constraint, then $X$ must be positive semidefinite. Hence, one can ignore the semidefinite constraint $X\succeq 0$ when solving the rank minimization (2). The resulting problem then can be exactly solved by nuclear norm relaxation. Since the minimum rank solution is positive semidefinite, it then coincides with the solution of the SDP (4), which is a constrained nuclear norm optimization.

The observation that one can ignore the semidefinite constraint justifies our experimental comparison with methods such as nuclear norm relaxation, SVP, and AltMinSense, described in the following section.

Related Work

Burer and Monteiro proposed a general approach for solving semidefinite programs using factored, nonconvex optimization, giving mostly experimental support for the convergence of the algorithms. The first nontrivial guarantee for solving affine rank minimization problem is given by Recht et al. , based on replacing the rank function by the convex surrogate nuclear norm, as already mentioned in the previous section. While this is a convex problem, solving it in practice is nontrivial, and a variety of methods have been developed for efficient nuclear norm minimization. The most popular algorithms are proximal methods that perform singular value thresholding at every iteration. While effective for small problem instances, the computational expense of the SVD prevents the method from being useful for large scale problems.

Recently, Jain et al. proposed a projected gradient descent algorithm SVP (Singular Value Projection) that solves

As described above, considerable progress has been made on algorithms for rank minimization and certain semidefinite programming problems. Yet truly efficient, scalable and provably convergent algorithms have not yet been obtained. In the specific setting that $X^{\star}$ is positive semidefinite, our algorithm exploits this structure to achieve these goals. We note that recent and independent work of Tu et al. proposes a hybrid algorithm called Procrustes Flow (PF), which uses a few iterations of SVP as initialization, and then applies gradient descent.

A Gradient Descent Algorithm for Rank Minimization

The authors establish the convergence of WF to the global optimum—given sufficient measurements, the iterates of WF converge linearly to $x$ up to a global phase, with high probability.

If $z$ and the $a_{i}$ s are real-valued, the function $f_{\texttt{WF}}(z)$ can be expressed as

which is a special case of $f(Z)$ where $A_{i}=a_{i}a_{i}^{\top}$ and each of $Z$ and $X^{\star}$ are rank one. See Figure 1a for an illustration; Figure 1b shows the convergence rate of our method. Our methods and results are thus generalizations of Wirtinger flow for phase retrieval.

Before turning to the presentation of our technical results in the following section, we present some intuition and remarks about how and why this algorithm works. For simplicity, let us assume that the rank is specified correctly.

Initialization is of course crucial in nonconvex optimization, as many local minima may be present. To obtain a sufficiently accurate initialization, we use a spectral method, similar to those used in . The starting point is the observation that a linear combination of the constraint values and matrices yields an unbiased estimate of the solution.

Certain key properties of $f(Z)$ will be seen to yield a linear rate of convergence. In the analysis of convex functions, Nesterov shows that for unconstrained optimization, the gradient descent scheme with sufficiently small step size will converge linearly to the optimum if the objective function is strongly convex and has a Lipschitz continuous gradient. However, these two properties are global and do not hold for our objective function $f(Z)$ . Nevertheless, we expect that similar conditions hold for the local area near $Z^{\star}$ . If so, then if we start close enough to $Z^{\star}$ , we can achieve the global optimum.

In our subsequent analysis, we establish the convergence of Algorithm 1 with a constant step size of the form $\mu/\left\|Z^{\star}\right\|^{2}_{F}$ , where $\mu$ is a small constant. Since $\left\|Z^{\star}\right\|_{F}$ is unknown, we replace it by $\left\|Z^{0}\right\|_{F}$ .

Convergence Analysis

In this section we present our main result analyzing the gradient descent algorithm, and give a sketch of the proof. To begin, note that the symmetric decomposition of $X^{\star}$ is not unique, since $X^{\star}=(Z^{\star}U)(Z^{\star}U)^{\top}$ for any $r\times r$ orthonormal matrix $U$ . Thus, the solution set is

Note that $\|\widetilde{Z}\|^{2}_{F}=\left\|X^{\star}\right\|_{*}$ for any $\widetilde{Z}\in\mathcal{S}$ . We define the distance to the optimal solution in terms of this set.

Define the distance between $Z$ and $Z^{\star}$ as

Our main result for exact recovery is stated below, assuming that the rank is correctly specified. Since the true rank is typically unknown in practice, one can start from a very low rank and gradually increase it.

Let the condition number $\kappa=\sigma_{1}/\sigma_{r}$ denote the ratio of the largest to the smallest nonzero eigenvalues of $X^{\star}$ . There exists a universal constant $c_{0}$ such that if $m\geq c_{0}\kappa^{2}r^{3}n\log n$ , with high probability the initialization $Z^{0}$ satisfies

Moreover, there exists a universal constant $c_{1}$ such that when using constant step size $\mu/\left\|Z^{\star}\right\|^{2}_{F}$ with $\mu\leq\dfrac{c_{1}}{\kappa n}$ and initial value $Z^{0}$ obeying (6), the $k$ th step of Algorithm 1 satisfies

We now outline the proof, giving full details in the supplementary material. The proof has four main steps. The first step is to give a regularity condition under which the algorithm converges linearly if we start close enough to $Z^{\star}$ . This provides a local regularity property that is similar to the Nesterov criteria that the objective function is strongly convex and has a Lipschitz continuous gradient.

Let $\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu=\operatorname*{arg\,min}_{\widetilde{Z}\in\mathcal{S}}\big{\|}Z-\widetilde{Z}\big{\|}_{F}$ denote the matrix closest to $Z$ in the solution set. We say that $f$ satisfies the regularity condition $RC(\varepsilon,\alpha,\beta)$ if there exist constants $\alpha$ , $\beta$ such that for any $Z$ satisfying $d(Z,Z^{\star})\leq\varepsilon$ , we have

Using this regularity condition, we show that the iterative step of the algorithm moves closer to the optimum, if the current iterate is sufficiently close.

Consider the update $Z^{k+1}=Z^{k}-\dfrac{\mu}{\left\|Z^{\star}\right\|^{2}_{F}}\nabla f(Z^{k})$ . If $f$ satisfies $RC(\varepsilon,\alpha,\beta)$ , $d(Z^{k},Z^{\star})\leq\varepsilon$ , and $0<\mu<\min(\alpha/2,2/\beta)$ , then

In the next step of the proof, we condition on two events that will be shown to hold with high probability using concentration results. Let $\delta$ denote a small value to be specified later.

Here the expectations are with respect to the random measurement matrices. Under these assumptions, we can show that the objective satisfies the regularity condition with high probability.

Suppose that A1 and A2 hold. If $\delta\leq\frac{1}{16}\sigma_{r}$ , then $f$ satisfies the regularity condition $RC(\sqrt{\tfrac{3}{16}\sigma_{r}},24,513\kappa n)$ with probability at least $1-mCe^{-\rho n}$ , where $C$ , $\rho$ are universal constants.

Next we show that under A1, a good initialization can be found.

Suppose that A1 holds. Let $\left\{v_{s},\lambda_{s}\right\}_{s=1}^{r}$ be the top $r$ eigenpairs of $M=\frac{1}{m}\sum\limits_{i=1}^{m}b_{i}A_{i}$ such that $|\lambda_{1}|\geq\cdots\geq|\lambda_{r}|$ . Let $Z^{0}=[z_{1},\ldots,z_{r}]$ where $z_{s}=\sqrt{\tfrac{|\lambda_{s}|}{2}}\cdot v_{s}$ , $s\in[r]$ . If $\delta\leq\tfrac{\sigma_{r}}{4\sqrt{r}}$ , then

Finally, we show that conditioning on A1 and A2 is valid since these events have high probability as long as $m$ is sufficiently large.

holds with probability at least $1-mCe^{-\rho n}-\frac{2}{n^{2}}$ , where $C$ and $\rho$ are universal constants.

with probability at least $1-6me^{-n}-\frac{4}{n^{2}}$ .

Note that since we need $\delta\leq\min\left(\frac{1}{16},\frac{1}{4\sqrt{r}}\right)\sigma_{r}$ , we have $\frac{\delta}{r\sigma_{1}}\leq 1$ , and the number of measurements required by our algorithm scales as $O(r^{3}\kappa^{2}n\log n)$ , while only $O(r^{2}\kappa^{2}n\log n)$ samples are required by the regularity condition. We conjecture this bound could be further improved to be $O(rn\log n)$ ; this is supported by the experimental results presented below.

Recently, Tu et al. establish a tighter $O(r^{2}\kappa^{2}n)$ bound overall. Specifically, when only one single SVP step is used in preprocessing, the initialization of PF is also the spectral decomposition of $\frac{1}{2}M$ . The authors show that $O(r^{2}\kappa^{2}n)$ measurements are sufficient for the initial solution to satisfy $d(Z^{0},Z^{\star})\leq O(\sqrt{\sigma_{r}})$ with high probability, and demonstrate an $O(rn)$ sample complexity for the regularity condition.

Experiments

In this section we report the results of experiments on synthetic datasets. We compare our gradient descent algorithm with nuclear norm relaxation, SVP and AltMinSense for which we drop the positive semidefiniteness constraint, as justified by the observation in Section 2. We use ADMM for the nuclear norm minimization, based on the algorithm for the mixture approach in Tomioka et al. ; see Appendix G. For simplicity, we assume that AltMinSense, SVP and the gradient scheme know the true rank. Krylov subspace techniques such as the Lanczos method could be used compute the partial eigendecomposition; we use the randomized algorithm of Halko et al. to compute the low rank SVD. All methods are implemented in MATLAB and the experiments were run on a MacBook Pro with a 2.5GHz Intel Core i7 processor and 16 GB memory.

It is instructive to compare the per-iteration cost of the different approaches; see Table 1. Suppose that the density (fraction of nonzero entries) of each $A_{i}$ is $\rho$ . For AltMinSense, the cost of solving the least squares problem is $O(mn^{2}r^{2}+n^{3}r^{3}+mn^{2}r\rho)$ . The other three methods have $O(mn^{2}\rho)$ cost to compute the affine transformation. For the nuclear norm approach, the $O(n^{3})$ cost is from the SVD and the $O(m^{2})$ cost is due to the update of the dual variables. The gradient scheme requires $2n^{2}r$ operations to compute $Z^{k}{Z^{k}}^{\top}$ and to multiply $Z^{k}$ by $n\times n$ matrix to obtain the gradient. SVP needs $O(n^{2}r)$ operations to compute the top $r$ singular vectors. However, in practice this partial SVD is more expensive than the $2n^{2}r$ cost required for the matrix multiplies in the gradient scheme.

Clearly, AltMinSense is the least efficient. For the other approaches, in the dense case ( $\rho$ large), the affine transformation dominates the computation. Our method removes the overhead caused by the SVD. In the sparse case ( $\rho$ small), the other parts dominate and our method enjoys a low cost.

2 Runtime Comparison

We conduct experiments for both dense and sparse measurement matrices. AltMinSense is indeed slow, so we do not include it here.

In the first scenario, we randomly generate a $400\times 400$ rank- $2$ matrix $X^{\star}=xx^{\top}+yy^{\top}$ where $x,y\sim\mathcal{N}(0,I)$ . We also generate $m=6n$ matrices $A_{1},\ldots,A_{m}$ from the GOE, and then take $b=\mathcal{A}(X^{\star})$ . We report the relative error measured in the Frobenius norm defined as $\|\widehat{X}-X^{\star}\|_{F}/\|X^{\star}\|_{F}$ . For the nuclear norm approach, we set the regularization parameter to $\lambda=10^{-5}$ . We test three values $\eta=10,100,200$ for the penalty parameter and select $\eta=100$ as it leads to the fastest convergence. Similarly, for SVP we evaluate the three values $5\times 10^{-5},10^{-4},2\times 10^{-4}$ for the step size, and select $10^{-4}$ as the largest for which SVP converges. For our approach, we test the three values $0.6,0.8,1.0$ for $\mu$ and select $0.8$ in the same way.

where we use $\rho=0.001$ . For all the methods we use the same strategies as before to select parameters. For the nuclear norm approach, we try three values $\eta=10,100,200$ and select $\eta=100$ . For SVP, we test the three values $5\times 10^{-3},2\times 10^{-3},10^{-3}$ for the step size and select $10^{-3}$ . For the gradient algorithm, we check the three values $0.8,1,1.5$ for $\mu$ and choose $1$ .

The results are shown in Figures 2a and 2b. In the dense case, our method is faster than the nuclear norm approach and slightly outperforms SVP. In the sparse case, it is significantly faster than the other approaches.

3 Sample Complexity

We consider cases where $n=60$ or $100$ and $X^{\star}$ is of rank one or two. The results are shown in Figure 2c. For SVP and our approach, the phase transitions happen around $m=1.5n$ when $X^{\star}$ is rank- $1$ and $m=2.5n$ when $X^{\star}$ is rank- $2$ . This scaling is close to the number of degrees of freedom in each case; this confirms that the sample complexity scales linearly with the rank $r$ . The phase transition for the nuclear norm approach occurs later. The results suggest that the sample complexity of our method should also scale as $O(rn\log n)$ as for SVP and the nuclear norm approach .

Conclusion

We connect a special case of affine rank minimization to a class of semidefinite programs with random constraints. Building on a recently proposed first-order algorithm for phase retrieval , we develop a gradient descent procedure for rank minimization and establish convergence to the optimal solution with $O(r^{3}n\log n)$ measurements. We conjecture that $O(rn\log n)$ measurements are sufficient for the method to converge, and that the conditions on the sampling matrices $A_{i}$ can be significantly weakened. More broadly, the technique used in this paper—factoring the semidefinite matrix variable, recasting the convex optimization as a nonconvex optimization, and applying first-order algorithms—first proposed by Burer and Monteiro , may be effective for a much wider class of SDPs, and deserves further study.

Acknowledgements

Research supported in part by NSF grant IIS-1116730 and ONR grant N00014-12-1-0762. The authors thank Afonso Bandeira, Ryota Tomioka and the authors of Tu et al. for helpful comments on this work.

References

Appendix A Proof of Lemma 1

Appendix B Ingredients

We first present some technical lemmas that will be needed later. Recall Definition 2 that for any $Z$ , $\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu=\operatorname*{arg\,min}_{\widetilde{Z}\in\mathcal{S}}\big{\|}Z-\widetilde{Z}\big{\|}_{F}$ . Let $H=Z-\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu$ . The $s$ th column of $Z$ , $\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu$ , $Z^{\star}$ , $H$ are denoted by $z_{s}$ , $\bar{z}_{s}$ , $z^{\star}_{s}$ , $h_{s}$ respectively. We shall use the following formulas for the gradient and second order partial derivatives:

The next ingredient we need is the expectation of the second order partial derivatives with respect to the random measurement matrices.

The expectation of $(i,j)$ entry of $Axy^{\top}A$ is

The case where $k\neq s$ is a direct result of Lemma 2. For the other case, let $A=(a_{ij})$ be a GOE distributed random matrix. It follows from Lemma 1 that

Substituting this back into the above equation, we obtain the lemma. ∎

We next recall a concentration result for the operator (spectral) norm of the random measurement matrices.

(Ledoux and Rider [14, Theorem 1]) There exists two absolute constants $C$ and $\rho=\frac{1}{\sqrt{8}C}$ such that with probability at least $1-Ce^{-\rho n}$ ,

A tighter upper bound is actually given in the Tracy-Widow law: w.h.p. $\left\|A_{i}\right\|=O(2\sqrt{n}+n^{1/6})$ .

With probability at least $1-mCe^{-\rho n}$ , the average of the squared operator norm of the random measurement matrices is upper bounded by $9n$ .

The following two technical lemmas are important tools for us. Define the set

Suppose that A1 holds: $\left\|\frac{1}{m}\sum_{i=1}^{m}(u^{\top}A_{i}u)A_{i}-2uu^{\top}\right\|\leq\frac{\delta}{r}$ , for any $u$ such that $\left\|u\right\|\leq\sqrt{\sigma_{1}}$ . If $\delta\leq\frac{1}{16}\sigma_{r}$ , then for any $Z\in E\left(\sqrt{\frac{3}{16}\sigma_{r}}\right)$ it holds that

Let $h_{s}$ be the $s$ th column of $H$ . Since $\max_{s\in[r]}\left\|h_{s}\right\|_{2}\leq\left\|H\right\|_{F}\leq\sqrt{\frac{3}{16}\sigma_{r}}\leq\sqrt{\sigma_{1}}$ , it follows from the assumption of the lemma that

where we replace $\sum\limits_{s=1}^{r}h^{\top}_{s}A_{i}h_{s}$ by $\operatorname{\text{tr}}(H^{\top}A_{i}H)$ and $\sum_{s=1}^{r}h_{s}h_{s}^{\top}$ by $HH^{\top}$ . Taking the sum of the above inequalities, we obtain

Note that $\operatorname{\text{tr}}(H^{\top}HH^{\top}H)=\left\|HH^{\top}\right\|^{2}_{F}$ . Therefore,

Suppose that A2 holds: for any $\widetilde{Z}$ such that $\widetilde{Z}\widetilde{Z}^{\top}=X^{\star}$ we have

Our goal is to bound $\frac{1}{m}\sum\limits_{i=1}^{m}\operatorname{\text{tr}}(H^{\top}A_{i}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu)^{2}$ . This can be expanded as

We first bound the sum of the quadratic terms. For any $s\in[r]$ , we have

It follows from assumption (7) that for any $s\in[r]$ ,

Taking the sum of above inequalities, we obtain

Similarly, we bound the sum of the cross terms. For any fixed $s,k$ such that $s\neq k$ , we have

Note that $\sum_{sk}h^{\top}_{s}\bar{z}_{k}\bar{z}_{s}^{\top}h_{k}=\operatorname{\text{tr}}(H^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5muH^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu)$ , $\sum_{sk}\bar{z}_{s}^{\top}\bar{z}_{k}h_{s}^{\top}h_{k}=\left\|\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5muH^{\top}\right\|^{2}_{F}$ and

By Lemma 7, $\operatorname{\text{tr}}(H^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5muH^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu)=\left\|H^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu\right\|^{2}_{F}$ . Replacing those terms in equation (10) gives us

Finally, we obtain the claim by noticing that

where $\sqrt{\sigma_{1}}=\sigma_{\max}(\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu)\geq\cdots\geq\sigma_{\min}(\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu)=\sqrt{\sigma_{r}}$ are the singular values of $\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu$ . ∎

$\operatorname{\text{tr}}(H^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5muH^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu)=\left\|H^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu\right\|^{2}_{F}$ .

Let $\bar{U}=\operatorname*{arg\,min}_{UU^{\top}=U^{\top}U=I}\left\|Z-Z^{\star}U\right\|^{2}_{F}=\operatorname*{arg\,max}_{UU^{\top}=U^{\top}U=I}\langle U,{Z^{\star}}^{\top}Z\rangle$ . Note that $\langle A,B\rangle\leq\left\|A\right\|_{*}\left\|B\right\|$ for any matrices $A,B$ that are of the same size. The equality holds when $B=U_{A}V_{A}^{\top}$ where $A=U_{A}\Sigma_{A}V_{A}^{\top}$ is the SVD of $A$ . Hence, $\bar{U}=\widetilde{U}\widetilde{V}^{\top}$ where $\widetilde{U}\widetilde{S}\widetilde{V}^{\top}$ is the SVD of ${Z^{\star}}^{\top}Z$ ; $\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu=Z^{\star}\bar{U}$ . Therefore, $Z^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu=Z^{\top}Z^{\star}\bar{U}=\widetilde{V}\widetilde{S}\widetilde{V}^{\top}$ is symmetric and positive semidefinite. Thus, $H^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu=Z^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu-{\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu}^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu$ is also symmetric. This implies that $\operatorname{\text{tr}}(H^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5muH^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu)=\left\|H^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu\right\|^{2}_{F}$ . ∎

Appendix C Linear Convergence

Let $H^{k}=Z^{k}-\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{k}$ . Then we have that

where we use the definition of $RC(\varepsilon,\alpha,\beta)$ in the third line, $\left\|Z^{\star}\right\|^{2}_{F}=\left\|X^{\star}\right\|_{*}=\sum_{s=1}^{r}\sigma_{s}$ in the third to last line and $0<\mu<\min\left\{\alpha/2,2/\beta\right\}$ in the second to last line. Therefore,

Appendix D Regularity Condition

As mentioned before, Nesterov [16, Theorem 2.1.11] shows that the gradient scheme converges linearly under a condition similar to the regularity condition, which is satisfied if the function is strongly convex and has a Lipschitz continuous gradient (strongly smooth). In order to prove Theorem 4, we show that with high probability the function $f$ satisfies the local curvature condition, which is analogous to strong convexity, and the local smoothness condition, which is analogous to strong smoothness.

There exists a constant $C_{1}$ such that for any $Z$ satisfying $d(Z,Z^{\star})\leq\sqrt{\frac{3}{16}\sigma_{r}}$ ,

There exist constants $C_{2},C_{3}$ such that for any $Z$ satisfying $d(Z,Z^{\star})\leq\sqrt{\frac{3}{16}\sigma_{r}}$ ,

where we use Cauchy-Schwarz inequality in the 2nd line, the inequality $(a-b)^{2}\geq\frac{a^{2}}{2}-b^{2}$ in the 5th line, Lemma 5 and 6 in the 7th line, and the fact that $\left\|HH^{\top}\right\|_{F}\leq\left\|H\right\|^{2}_{F}$ in the 8th line. Since $\left\|H\right\|_{F}\leq\sqrt{\frac{3}{16}\sigma_{r}}$ and $\delta\leq\frac{1}{16}\sigma_{r}$ , we have

D.2 Proof of the Local Smoothness Condition

Since $(a+b+c+d)^{2}\leq 4(a^{2}+b^{2}+c^{2}+d^{2})$ , we have

The first term in the righthand side can be upper bounded as

where we use the Cauchy-Schwarz inequality in the first and second line, Lemma 5 and $\left\|HH^{\top}\right\|_{F}\leq\left\|H\right\|^{2}_{F}$ in the third line and Corollary 1 in the last line.

The other three terms are bounded similarly. For the second term, we have

where we use Lemma 6 and 1. The third term is bounded as

Putting these inequalities together, we have

Since $\left\|H\right\|_{F}\leq\sqrt{\frac{3}{16}\sigma_{r}}$ and $\delta\leq\frac{1}{16}\sigma_{r}$ , we have

D.3 Proof of the Regularity Condition

Now we combine the curvature and the smoothness conditions. For any $\gamma\in\left(0,\frac{\sigma_{1}}{\sigma_{r}}\right)$ , it holds that

Combining equation (11) and (12), we obtain

where we use $\left\|\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu\right\|^{2}_{F}=\left\|Z^{\star}\right\|^{2}_{F}=\left\|X^{\star}\right\|_{*}\geq\sigma_{r}$ . Thus we have

for $\alpha\geq 24$ and $\beta\geq\frac{\sigma_{1}}{\sigma_{r}}\cdot 513n$ .

Appendix E Initialization

Let $\lambda^{\prime}_{1}\geq\cdots\geq\lambda^{\prime}_{n}$ be the eigenvalues of $M$ . By Weyl’s theorem, we have

Since $\delta<\sigma_{r}$ , it is easy to see $\lambda^{\prime}_{1}\geq\cdots\geq\lambda^{\prime}_{r}>\delta$ and $|\lambda^{\prime}_{s}|\leq\delta,s=r+1,\ldots,n$ . Hence, $\lambda_{s}=\lambda^{\prime}_{s}$ , $s\in[r]$ , and $Z^{0}{Z^{0}}^{\top}$ is the best rank $r$ approximation of $\frac{1}{2}M$ . Therefore,

where we used $\left\|A\right\|_{F}\leq\sqrt{\operatorname{\text{rank}}(A)}\left\|A\right\|$ in first line, the fact $\left\|Z^{0}{Z^{0}}^{\top}-\frac{1}{2}M\right\|=\frac{1}{2}|\lambda_{r+1}|\leq\frac{1}{2}\delta$ and inequality (13) in the last line.

Let $H=Z^{0}-\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{0}$ . We want to bound $d(Z^{0},Z^{\star})^{2}=\left\|H\right\|^{2}_{F}$ . According to the discussion in Lemma 7, $H^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{0}$ is symmetric and ${Z^{0}}^{\top}{\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{0}}$ is positive semidefinite.

The following step closely follows . It holds that

where in the fourth line we used the property that the trace is invariant under cyclic permutations and $H^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{0}={\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{0}}^{\top}H$ .

Since ${Z^{0}}^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{0}$ is positive semidefinite, $\operatorname{\text{tr}}((H^{\top}H)({Z^{0}}^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{0}))$ is nonnegative. Hence,

If $\delta\leq\frac{\sigma_{r}}{4\sqrt{r}}$ , then

Appendix F Sample Complexity

In this section, we verify that our assumptions hold with high probability if $m\geq cn\log n$ , where $c$ is a constant that depends on $\delta$ , $r$ , and $\kappa$ . Our proof relies on the following concentration inequality.

We first give a technical lemma that we will use later.

Let $A=(a_{ij})$ be a random matrix drawn from GOE. Let $S=a_{11}A-2e_{1}e_{1}^{\top}$ . There exist absolute constants $C$ , $\rho$ such that with probability at least $1-Ce^{-\rho n}$ , we have

Let $\widetilde{A}=A-a_{11}e_{1}e_{1}^{\top}$ . $S=a_{11}\widetilde{A}+(a_{11}^{2}-2)e_{1}e_{1}^{\top}$ . Note that $a_{11}$ and $\widetilde{A}$ are independent, hence $\left\|S\right\|\leq|a_{11}|\|\widetilde{A}\|+|a^{2}_{11}-2|$ . Besides, since $a_{11}\sim\mathcal{N}(0,2)$ , we can see that $a^{2}_{11}/2$ is $\chi^{2}$ distributed.

First we bound the operator norm of $\widetilde{A}$ . We rewrite $\|\widetilde{A}\|$ as

where $D=\widetilde{A}+de_{1}e_{1}^{\top}$ , $d\sim\mathcal{N}(0,2)$ . As $D$ is GOE distributed, by Lemma 4,

where $C^{\prime}$ and $\rho^{\prime}$ are absolute constants.

Using the Gaussian tail inequality, we have

Combining inequalities (14) and (15), we have

where the last inequality follows from the union bound.

Next we bound the deviation of the $\chi^{2}$ term. By the corollary of Lemma 1 in Laurent and Massart , we have

Finally, the statement is obtained by choosing proper $C$ , $\rho$ , and using $\sqrt{n}\leq n$ . ∎

It is equivalent to show that for any unit vector $u$ , with high probability,

where in the second line we use unitary invariance of the operator norm, and in the last line we denote $P^{\top}A_{i}P$ by $\widetilde{A}_{i}$ . Since the GOE is invariant under orthogonal conjugation, $\widetilde{A}_{i}$ and $A_{i}$ are identically distributed. Hence, it suffices to prove the claim when $u=e_{1}$ , i.e.

where $a^{(i)}_{11}$ is the $(1,1)$ entry of $A_{i}$ and $\delta_{0}=\frac{\delta}{r\sigma_{1}}$ .

To show this, we apply Theorem 8, where $S_{i}=a^{(i)}_{11}A_{i}-2e_{1}e^{\top}_{1}$ . This requires that the operator norm of $S_{i}$ is bounded, for each $i$ . We address this by noticing that with high probability $\left\|S_{i}\right\|\leq 18n$ , $\forall i$ . To be precise, by Lemma 8 there exist constants $C,\rho$ , such that

Taking the union bound over all the $S_{i}$ s leads to

By Theorem 8, if $m\geq\frac{42}{\min(\delta^{2}_{0},\delta_{0})}\cdot n\log n$ , then

Combining inequalities (18) and (19), we conclude that

F.2 Proof of Theorem 7

The formulation of the second order partial derivatives and their expectations is given in Appendix B.

It is easy to see that for any $\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu\in\mathcal{S}$ , $\max_{s\in[r]}\left\|\bar{z}_{r}\right\|\leq\sqrt{\sigma_{1}}$ . Thus it is sufficient to prove that for any two unitary vector $u$ and $y$ with high probability it holds that

We can decompose $y$ as $y=\beta u+\beta_{\perp}u_{\perp}$ for a certain unit vector $u_{\perp}$ that is orthogonal to $u$ , where $\beta^{2}+\beta^{2}_{\perp}=1$ . Let $\delta_{0}=\dfrac{\delta}{2r\sigma_{1}}$ . It suffices to prove the following two claims.

For any unitary vector $u$ , with high probability

For any two orthogonal unit vectors $u$ and $u_{\perp}$ , with high probability

where $\widetilde{A}_{i}$ and $A_{i}$ have the same distribution. Hence we only need to prove the case where $u=e_{1}$ :

where $v^{(i)}=A_{i}e_{1}$ is the first column of $A_{i}$ .

Let $S,v,A$ denote $S_{1}$ , $v^{(1)}$ , and $A^{(1)}$ respectively. It is easy to see that

where $w=\sum_{k=2}^{n}a^{2}_{1k}$ . As $a_{11}\sim\mathcal{N}(0,2)$ , $a_{1k}\sim~{}\mathcal{N}(0,1)$ for $k\neq 1$ , we can see that $a^{2}_{11}/2$ and $w$ are $\chi^{2}$ distributed with degrees of freedom $1$ and $n-1$ , respectively. Using the $\chi^{2}$ tail bound, we have

As $v_{1}\sim\mathcal{N}(0,2)$ , $v_{j}\sim\mathcal{N}(0,1)$ for $j\neq 1$ ,

If $m\geq(128/\min(\delta^{2}_{0},\delta_{0}))n\log n$ , then by applying Theorem 8 we can see

Combining inequalities (21) and (20) leads to

Proof of (ii)

We only need to prove the case where $u=e_{1}$ and $u_{\perp}=e_{2}$ due to the same reason above. That is,

where $v^{(i)}$ and $q^{(i)}$ are the first and second columns of $A_{i}$ .

As before, let $S_{i}=2(v^{(i)}{q^{(i)}}^{\top}-e_{2}e_{1}^{\top})$ and let $S,v,q,A$ denote $S_{1}$ , $v^{(1)}$ , $q^{(1)}$ and $A^{(1)}$ respectively. From the proof of (i), we can see that with probability at least $1-4e^{-n}$ both $\left\|v\right\|$ and $\left\|q\right\|$ are no larger than $\sqrt{13n+1}$ . Since $\left\|S\right\|\leq 2\left\|v\right\|\left\|q\right\|+2$ , we have

and $\nu^{2}=8(n+1)m$ . If $m\geq\frac{78}{\min(\delta_{0}^{2},\delta_{0})}n\log n$ , then by applying Theorem 8 we have

Appendix G ADMM for Nuclear Norm Minimization

We reformulate the nuclear norm minimizing problem as

where $\lambda>0$ is the regularization parameter. $\lambda\rightarrow 0$ will enforce the minimizer $X^{*}_{\text{nuc}}$ satisfying the affine constraint $\mathcal{A}(X^{*}_{\text{nuc}})=b$ .

We apply ADMM to the dual problem of (23):

where we introduce an auxiliary variable $V$ to make this problem equality constrained.

The augmented Lagrangian of problem (24) can be written as

where $X$ is the multiplier, $\eta$ is the penalty parameter, and $\mathbf{1}_{\left\|\cdot\right\|\leq 1}$ is the indicator function of the unit spectral norm ball i.e. $\mathbf{1}_{\left\|\cdot\right\|\leq 1}(V)$ equals if $\left\|V\right\|\leq 1$ and $+\infty$ otherwise.

Let $\text{vec}(\cdot)$ denote the vectorization of a matrix, whose inverse mapping is denoted by $\text{mat}(\cdot)$ . We can rewrite the transformations as $\mathcal{A}(X)=\boldsymbol{A}\text{vec}(X)$ and $\mathcal{A}^{\top}(\alpha)=\text{mat}(\boldsymbol{A}^{\top}\alpha)=\sum_{i=1}^{m}\alpha_{i}A_{i}$ , where $\boldsymbol{A}$ is a $m\times n^{2}$ matrix whose $i$ th row is $\text{vec}(A_{i})^{\top}$ .

The ADMM starts from initialization $(\alpha^{0},V^{0},X^{0})$ and updates the three variables alternately. The updates can be computed in close forms:

where $\text{proj}(\cdot)$ is the projection onto the unit spectral norm ball. Let $X=U\Sigma V^{\top}$ be the singular value decomposition of $X$ ,

In fact, the update of $V$ can be combined with other steps without being computed explicitly. One only has to iterate the following two steps:

where $\text{prox}_{\eta}(\cdot)$ is the singular value soft-thresholding operator defined as

The sequence of multipliers $\left\{X^{k}\right\}$ converges to the primal solution of (23). To speed up the update of $\alpha$ , the Cholesky decomposition of $\lambda I+\eta\boldsymbol{A}\boldsymbol{A}^{\top}$ is precomputed in our implementation.