A Convergent Gradient Descent Algorithm for Rank Minimization and Semidefinite Programming from Random Linear Measurements

Qinqing Zheng, John Lafferty

Introduction

Semidefinite programming has become a key optimization tool in many areas of applied mathematics, signal processing and machine learning. SDPs often arise naturally from the problem structure, or are derived as surrogate optimizations that are relaxations of difficult combinatorial problems . In spite of the importance of SDPs in principle—promising efficient algorithms with polynomial runtime guarantees—it is widely recognized that current optimization algorithms based on interior point methods can handle only relatively small problems. Thus, a considerable gap exists between the theory and applicability of SDP formulations. Scalable algorithms for semidefinite programming, and closely related families of nonconvex programs more generally, are greatly needed.

A parallel development is the surprising effectiveness of simple classical procedures such as gradient descent for large scale problems, as explored in the recent machine learning literature. In many areas of machine learning and signal processing such as classification, deep learning, and phase retrieval, gradient descent methods, in particular first order stochastic optimization, have led to remarkably efficient algorithms that can attack very large scale problems . In this paper we build on this work to develop first-order algorithms for solving the rank minimization problem under random measurements and a closely related family of semidefinite programs. Our algorithms are efficient and scalable, and we prove that they attain linear convergence to the global optimum under natural assumptions.

This problem is a direct generalization of compressed sensing, and subsumes many machine learning problems such as image compression, low rank matrix completion and low-dimensional metric embedding . While the problem is natural and has many applications, the optimization is nonconvex and challenging to solve. Without conditions on the transformation A\mathcal{A} or the minimum rank solution XX^{\star}, it is generally NP hard .

In addition to the wide applicability of affine rank minimization, the problem is also closely connected to a class of semidefinite programs. In Section 2, we show that the minimizer of a particular class of SDP can be obtained by a linear transformation of XX^{\star}. Thus, efficient algorithms for problem (2) can be applied in this setting as well.

While this is a nonconvex function, we take motivation from recent work for phase retrieval by Candès et al. , and develop a gradient descent algorithm for optimizing f(Z)f(Z), using a carefully constructed initialization and step size. Our main contributions concerning this algorithm are as follows.

We prove that with O(r3nlogn)O(r^{3}n\log n) constraints our gradient descent scheme can exactly recover XX^{\star} with high probability. Empirical experiments show that this bound may potentially be improved to O(rnlogn)O(rn\log n).

We show that our method converges linearly, and has lower computational cost compared with previous methods.

We carry out a detailed comparison of rank minimization algorithms, and demonstrate that when the measurement matrices AiA_{i} are sparse, our gradient method significantly outperforms alternative approaches.

In Section 3 we briefly review related work. In Section 4 we discuss the gradient scheme in detail. Our main analytical results are presented in Section 5, with detailed proofs contained in the supplementary material. Our experimental results are presented in Section 6, and we conclude with a brief discussion of future work in Section 7.

Semidefinite Programming and Rank Minimization

Before reviewing related work and presenting our algorithm, we pause to explain the connection between semidefinite programming and rank minimization. This connection enables our scalable gradient descent algorithm to be applied and analyzed for certain classes of SDPs.

Consider a standard form semidefinite program

where Ai=L1A~iL1A_{i}=L^{-1}\widetilde{A}_{i}{L^{-1}}^{\top}. In particular, minimizers X~\widetilde{X}^{*} of (3) are obtained from minimizers XX^{*} of (4) via the transformation

Since XX is positive semidefinite, tr(X)\operatorname{\text{tr}}(X) is equal to X\left\|X\right\|_{*}. Hence, problem (4) is the nuclear norm relaxation of problem (2). Next, we characterize the specific cases where X=XX^{*}=X^{\star}, so that the SDP and rank minimization solutions coincide. The following result is from Recht et al. .

holds for any matrix XX of rank at most kk. Suppose that there exists a rank rr matrix XX^{\star} such that A(X)=b\mathcal{A}(X^{\star})=b. If δ2r<1\delta_{2r}<1, then XX^{\star} is the only matrix of rank at most rr satisfying A(X)=b\mathcal{A}(X)=b. Furthermore, if δ5r<1/10\delta_{5r}<1/10, then XX^{\star} can be attained by minimizing X\left\|X\right\|_{*} over the affine subset.

In other words, since δ2rδ5r\delta_{2r}\leq\delta_{5r}, if δ5r<1/10\delta_{5r}<1/10 holds for the transformation A\mathcal{A} and one finds a matrix XX of rank rr satisfying the affine constraint, then XX must be positive semidefinite. Hence, one can ignore the semidefinite constraint X0X\succeq 0 when solving the rank minimization (2). The resulting problem then can be exactly solved by nuclear norm relaxation. Since the minimum rank solution is positive semidefinite, it then coincides with the solution of the SDP (4), which is a constrained nuclear norm optimization.

The observation that one can ignore the semidefinite constraint justifies our experimental comparison with methods such as nuclear norm relaxation, SVP, and AltMinSense, described in the following section.

Related Work

Burer and Monteiro proposed a general approach for solving semidefinite programs using factored, nonconvex optimization, giving mostly experimental support for the convergence of the algorithms. The first nontrivial guarantee for solving affine rank minimization problem is given by Recht et al. , based on replacing the rank function by the convex surrogate nuclear norm, as already mentioned in the previous section. While this is a convex problem, solving it in practice is nontrivial, and a variety of methods have been developed for efficient nuclear norm minimization. The most popular algorithms are proximal methods that perform singular value thresholding at every iteration. While effective for small problem instances, the computational expense of the SVD prevents the method from being useful for large scale problems.

Recently, Jain et al. proposed a projected gradient descent algorithm SVP (Singular Value Projection) that solves

As described above, considerable progress has been made on algorithms for rank minimization and certain semidefinite programming problems. Yet truly efficient, scalable and provably convergent algorithms have not yet been obtained. In the specific setting that XX^{\star} is positive semidefinite, our algorithm exploits this structure to achieve these goals. We note that recent and independent work of Tu et al. proposes a hybrid algorithm called Procrustes Flow (PF), which uses a few iterations of SVP as initialization, and then applies gradient descent.

A Gradient Descent Algorithm for Rank Minimization

The authors establish the convergence of WF to the global optimum—given sufficient measurements, the iterates of WF converge linearly to xx up to a global phase, with high probability.

If zz and the aia_{i}s are real-valued, the function fWF(z)f_{\texttt{WF}}(z) can be expressed as

which is a special case of f(Z)f(Z) where Ai=aiaiA_{i}=a_{i}a_{i}^{\top} and each of ZZ and XX^{\star} are rank one. See Figure 1a for an illustration; Figure 1b shows the convergence rate of our method. Our methods and results are thus generalizations of Wirtinger flow for phase retrieval.

Before turning to the presentation of our technical results in the following section, we present some intuition and remarks about how and why this algorithm works. For simplicity, let us assume that the rank is specified correctly.

Initialization is of course crucial in nonconvex optimization, as many local minima may be present. To obtain a sufficiently accurate initialization, we use a spectral method, similar to those used in . The starting point is the observation that a linear combination of the constraint values and matrices yields an unbiased estimate of the solution.

Certain key properties of f(Z)f(Z) will be seen to yield a linear rate of convergence. In the analysis of convex functions, Nesterov shows that for unconstrained optimization, the gradient descent scheme with sufficiently small step size will converge linearly to the optimum if the objective function is strongly convex and has a Lipschitz continuous gradient. However, these two properties are global and do not hold for our objective function f(Z)f(Z). Nevertheless, we expect that similar conditions hold for the local area near ZZ^{\star}. If so, then if we start close enough to ZZ^{\star}, we can achieve the global optimum.

In our subsequent analysis, we establish the convergence of Algorithm 1 with a constant step size of the form μ/ZF2\mu/\left\|Z^{\star}\right\|^{2}_{F}, where μ\mu is a small constant. Since ZF\left\|Z^{\star}\right\|_{F} is unknown, we replace it by Z0F\left\|Z^{0}\right\|_{F}.

Convergence Analysis

In this section we present our main result analyzing the gradient descent algorithm, and give a sketch of the proof. To begin, note that the symmetric decomposition of XX^{\star} is not unique, since X=(ZU)(ZU)X^{\star}=(Z^{\star}U)(Z^{\star}U)^{\top} for any r×rr\times r orthonormal matrix UU. Thus, the solution set is

Note that Z~F2=X\|\widetilde{Z}\|^{2}_{F}=\left\|X^{\star}\right\|_{*} for any Z~S\widetilde{Z}\in\mathcal{S}. We define the distance to the optimal solution in terms of this set.

Define the distance between ZZ and ZZ^{\star} as

Our main result for exact recovery is stated below, assuming that the rank is correctly specified. Since the true rank is typically unknown in practice, one can start from a very low rank and gradually increase it.

Let the condition number κ=σ1/σr\kappa=\sigma_{1}/\sigma_{r} denote the ratio of the largest to the smallest nonzero eigenvalues of XX^{\star}. There exists a universal constant c0c_{0} such that if mc0κ2r3nlognm\geq c_{0}\kappa^{2}r^{3}n\log n, with high probability the initialization Z0Z^{0} satisfies

Moreover, there exists a universal constant c1c_{1} such that when using constant step size μ/ZF2\mu/\left\|Z^{\star}\right\|^{2}_{F} with μc1κn\mu\leq\dfrac{c_{1}}{\kappa n} and initial value Z0Z^{0} obeying (6), the kkth step of Algorithm 1 satisfies

We now outline the proof, giving full details in the supplementary material. The proof has four main steps. The first step is to give a regularity condition under which the algorithm converges linearly if we start close enough to ZZ^{\star}. This provides a local regularity property that is similar to the Nesterov criteria that the objective function is strongly convex and has a Lipschitz continuous gradient.

Let \mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu=\operatorname*{arg\,min}_{\widetilde{Z}\in\mathcal{S}}\big{\|}Z-\widetilde{Z}\big{\|}_{F} denote the matrix closest to ZZ in the solution set. We say that ff satisfies the regularity condition RC(ε,α,β)RC(\varepsilon,\alpha,\beta) if there exist constants α\alpha, β\beta such that for any ZZ satisfying d(Z,Z)εd(Z,Z^{\star})\leq\varepsilon, we have

Using this regularity condition, we show that the iterative step of the algorithm moves closer to the optimum, if the current iterate is sufficiently close.

Consider the update Zk+1=ZkμZF2f(Zk)Z^{k+1}=Z^{k}-\dfrac{\mu}{\left\|Z^{\star}\right\|^{2}_{F}}\nabla f(Z^{k}). If ff satisfies RC(ε,α,β)RC(\varepsilon,\alpha,\beta), d(Zk,Z)εd(Z^{k},Z^{\star})\leq\varepsilon, and 0<μ<min(α/2,2/β)0<\mu<\min(\alpha/2,2/\beta), then

In the next step of the proof, we condition on two events that will be shown to hold with high probability using concentration results. Let δ\delta denote a small value to be specified later.

Here the expectations are with respect to the random measurement matrices. Under these assumptions, we can show that the objective satisfies the regularity condition with high probability.

Suppose that A1 and A2 hold. If δ116σr\delta\leq\frac{1}{16}\sigma_{r}, then ff satisfies the regularity condition RC(316σr,24,513κn)RC(\sqrt{\tfrac{3}{16}\sigma_{r}},24,513\kappa n) with probability at least 1mCeρn1-mCe^{-\rho n}, where CC, ρ\rho are universal constants.

Next we show that under A1, a good initialization can be found.

Suppose that A1 holds. Let {vs,λs}s=1r\left\{v_{s},\lambda_{s}\right\}_{s=1}^{r} be the top rr eigenpairs of M=1mi=1mbiAiM=\frac{1}{m}\sum\limits_{i=1}^{m}b_{i}A_{i} such that λ1λr|\lambda_{1}|\geq\cdots\geq|\lambda_{r}|. Let Z0=[z1,,zr]Z^{0}=[z_{1},\ldots,z_{r}] where zs=λs2vsz_{s}=\sqrt{\tfrac{|\lambda_{s}|}{2}}\cdot v_{s}, s[r]s\in[r]. If δσr4r\delta\leq\tfrac{\sigma_{r}}{4\sqrt{r}}, then

Finally, we show that conditioning on A1 and A2 is valid since these events have high probability as long as mm is sufficiently large.

holds with probability at least 1mCeρn2n21-mCe^{-\rho n}-\frac{2}{n^{2}}, where CC and ρ\rho are universal constants.

with probability at least 16men4n21-6me^{-n}-\frac{4}{n^{2}}.

Note that since we need δmin(116,14r)σr\delta\leq\min\left(\frac{1}{16},\frac{1}{4\sqrt{r}}\right)\sigma_{r}, we have δrσ11\frac{\delta}{r\sigma_{1}}\leq 1, and the number of measurements required by our algorithm scales as O(r3κ2nlogn)O(r^{3}\kappa^{2}n\log n), while only O(r2κ2nlogn)O(r^{2}\kappa^{2}n\log n) samples are required by the regularity condition. We conjecture this bound could be further improved to be O(rnlogn)O(rn\log n); this is supported by the experimental results presented below.

Recently, Tu et al. establish a tighter O(r2κ2n)O(r^{2}\kappa^{2}n) bound overall. Specifically, when only one single SVP step is used in preprocessing, the initialization of PF is also the spectral decomposition of 12M\frac{1}{2}M. The authors show that O(r2κ2n)O(r^{2}\kappa^{2}n) measurements are sufficient for the initial solution to satisfy d(Z0,Z)O(σr)d(Z^{0},Z^{\star})\leq O(\sqrt{\sigma_{r}}) with high probability, and demonstrate an O(rn)O(rn) sample complexity for the regularity condition.

Experiments

In this section we report the results of experiments on synthetic datasets. We compare our gradient descent algorithm with nuclear norm relaxation, SVP and AltMinSense for which we drop the positive semidefiniteness constraint, as justified by the observation in Section 2. We use ADMM for the nuclear norm minimization, based on the algorithm for the mixture approach in Tomioka et al. ; see Appendix G. For simplicity, we assume that AltMinSense, SVP and the gradient scheme know the true rank. Krylov subspace techniques such as the Lanczos method could be used compute the partial eigendecomposition; we use the randomized algorithm of Halko et al. to compute the low rank SVD. All methods are implemented in MATLAB and the experiments were run on a MacBook Pro with a 2.5GHz Intel Core i7 processor and 16 GB memory.

It is instructive to compare the per-iteration cost of the different approaches; see Table 1. Suppose that the density (fraction of nonzero entries) of each AiA_{i} is ρ\rho. For AltMinSense, the cost of solving the least squares problem is O(mn2r2+n3r3+mn2rρ)O(mn^{2}r^{2}+n^{3}r^{3}+mn^{2}r\rho). The other three methods have O(mn2ρ)O(mn^{2}\rho) cost to compute the affine transformation. For the nuclear norm approach, the O(n3)O(n^{3}) cost is from the SVD and the O(m2)O(m^{2}) cost is due to the update of the dual variables. The gradient scheme requires 2n2r2n^{2}r operations to compute ZkZkZ^{k}{Z^{k}}^{\top} and to multiply ZkZ^{k} by n×nn\times n matrix to obtain the gradient. SVP needs O(n2r)O(n^{2}r) operations to compute the top rr singular vectors. However, in practice this partial SVD is more expensive than the 2n2r2n^{2}r cost required for the matrix multiplies in the gradient scheme.

Clearly, AltMinSense is the least efficient. For the other approaches, in the dense case (ρ\rho large), the affine transformation dominates the computation. Our method removes the overhead caused by the SVD. In the sparse case (ρ\rho small), the other parts dominate and our method enjoys a low cost.

2 Runtime Comparison

We conduct experiments for both dense and sparse measurement matrices. AltMinSense is indeed slow, so we do not include it here.

In the first scenario, we randomly generate a 400×400400\times 400 rank-22 matrix X=xx+yyX^{\star}=xx^{\top}+yy^{\top} where x,yN(0,I)x,y\sim\mathcal{N}(0,I). We also generate m=6nm=6n matrices A1,,AmA_{1},\ldots,A_{m} from the GOE, and then take b=A(X)b=\mathcal{A}(X^{\star}). We report the relative error measured in the Frobenius norm defined as X^XF/XF\|\widehat{X}-X^{\star}\|_{F}/\|X^{\star}\|_{F}. For the nuclear norm approach, we set the regularization parameter to λ=105\lambda=10^{-5}. We test three values η=10,100,200\eta=10,100,200 for the penalty parameter and select η=100\eta=100 as it leads to the fastest convergence. Similarly, for SVP we evaluate the three values 5×105,104,2×1045\times 10^{-5},10^{-4},2\times 10^{-4} for the step size, and select 10410^{-4} as the largest for which SVP converges. For our approach, we test the three values 0.6,0.8,1.00.6,0.8,1.0 for μ\mu and select 0.80.8 in the same way.

where we use ρ=0.001\rho=0.001. For all the methods we use the same strategies as before to select parameters. For the nuclear norm approach, we try three values η=10,100,200\eta=10,100,200 and select η=100\eta=100. For SVP, we test the three values 5×103,2×103,1035\times 10^{-3},2\times 10^{-3},10^{-3} for the step size and select 10310^{-3}. For the gradient algorithm, we check the three values 0.8,1,1.50.8,1,1.5 for μ\mu and choose 11.

The results are shown in Figures 2a and 2b. In the dense case, our method is faster than the nuclear norm approach and slightly outperforms SVP. In the sparse case, it is significantly faster than the other approaches.

3 Sample Complexity

We consider cases where n=60n=60 or 100100 and XX^{\star} is of rank one or two. The results are shown in Figure 2c. For SVP and our approach, the phase transitions happen around m=1.5nm=1.5n when XX^{\star} is rank-11 and m=2.5nm=2.5n when XX^{\star} is rank-22. This scaling is close to the number of degrees of freedom in each case; this confirms that the sample complexity scales linearly with the rank rr. The phase transition for the nuclear norm approach occurs later. The results suggest that the sample complexity of our method should also scale as O(rnlogn)O(rn\log n) as for SVP and the nuclear norm approach .

Conclusion

We connect a special case of affine rank minimization to a class of semidefinite programs with random constraints. Building on a recently proposed first-order algorithm for phase retrieval , we develop a gradient descent procedure for rank minimization and establish convergence to the optimal solution with O(r3nlogn)O(r^{3}n\log n) measurements. We conjecture that O(rnlogn)O(rn\log n) measurements are sufficient for the method to converge, and that the conditions on the sampling matrices AiA_{i} can be significantly weakened. More broadly, the technique used in this paper—factoring the semidefinite matrix variable, recasting the convex optimization as a nonconvex optimization, and applying first-order algorithms—first proposed by Burer and Monteiro , may be effective for a much wider class of SDPs, and deserves further study.

Acknowledgements

Research supported in part by NSF grant IIS-1116730 and ONR grant N00014-12-1-0762. The authors thank Afonso Bandeira, Ryota Tomioka and the authors of Tu et al. for helpful comments on this work.

References

Appendix A Proof of Lemma 1

Appendix B Ingredients

We first present some technical lemmas that will be needed later. Recall Definition 2 that for any ZZ, \mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu=\operatorname*{arg\,min}_{\widetilde{Z}\in\mathcal{S}}\big{\|}Z-\widetilde{Z}\big{\|}_{F}. Let H=ZZ ⁣H=Z-\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu. The ssth column of ZZ, Z ⁣\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu, ZZ^{\star}, HH are denoted by zsz_{s}, zˉs\bar{z}_{s}, zsz^{\star}_{s}, hsh_{s} respectively. We shall use the following formulas for the gradient and second order partial derivatives:

The next ingredient we need is the expectation of the second order partial derivatives with respect to the random measurement matrices.

The expectation of (i,j)(i,j) entry of AxyAAxy^{\top}A is

The case where ksk\neq s is a direct result of Lemma 2. For the other case, let A=(aij)A=(a_{ij}) be a GOE distributed random matrix. It follows from Lemma 1 that

Substituting this back into the above equation, we obtain the lemma. ∎

We next recall a concentration result for the operator (spectral) norm of the random measurement matrices.

(Ledoux and Rider [14, Theorem 1]) There exists two absolute constants CC and ρ=18C\rho=\frac{1}{\sqrt{8}C} such that with probability at least 1Ceρn1-Ce^{-\rho n},

A tighter upper bound is actually given in the Tracy-Widow law: w.h.p. Ai=O(2n+n1/6)\left\|A_{i}\right\|=O(2\sqrt{n}+n^{1/6}).

With probability at least 1mCeρn1-mCe^{-\rho n}, the average of the squared operator norm of the random measurement matrices is upper bounded by 9n9n.

The following two technical lemmas are important tools for us. Define the set

Suppose that A1 holds: 1mi=1m(uAiu)Ai2uuδr\left\|\frac{1}{m}\sum_{i=1}^{m}(u^{\top}A_{i}u)A_{i}-2uu^{\top}\right\|\leq\frac{\delta}{r}, for any uu such that uσ1\left\|u\right\|\leq\sqrt{\sigma_{1}}. If δ116σr\delta\leq\frac{1}{16}\sigma_{r}, then for any ZE(316σr)Z\in E\left(\sqrt{\frac{3}{16}\sigma_{r}}\right) it holds that

Let hsh_{s} be the ssth column of HH. Since maxs[r]hs2HF316σrσ1\max_{s\in[r]}\left\|h_{s}\right\|_{2}\leq\left\|H\right\|_{F}\leq\sqrt{\frac{3}{16}\sigma_{r}}\leq\sqrt{\sigma_{1}}, it follows from the assumption of the lemma that

where we replace s=1rhsAihs\sum\limits_{s=1}^{r}h^{\top}_{s}A_{i}h_{s} by tr(HAiH)\operatorname{\text{tr}}(H^{\top}A_{i}H) and s=1rhshs\sum_{s=1}^{r}h_{s}h_{s}^{\top} by HHHH^{\top}. Taking the sum of the above inequalities, we obtain

Note that tr(HHHH)=HHF2\operatorname{\text{tr}}(H^{\top}HH^{\top}H)=\left\|HH^{\top}\right\|^{2}_{F}. Therefore,

Suppose that A2 holds: for any Z~\widetilde{Z} such that Z~Z~=X\widetilde{Z}\widetilde{Z}^{\top}=X^{\star} we have

Our goal is to bound 1mi=1mtr(HAiZ ⁣)2\frac{1}{m}\sum\limits_{i=1}^{m}\operatorname{\text{tr}}(H^{\top}A_{i}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu)^{2}. This can be expanded as

We first bound the sum of the quadratic terms. For any s[r]s\in[r], we have

It follows from assumption (7) that for any s[r]s\in[r],

Taking the sum of above inequalities, we obtain

Similarly, we bound the sum of the cross terms. For any fixed s,ks,k such that sks\neq k, we have

Note that skhszˉkzˉshk=tr(HZ ⁣HZ ⁣)\sum_{sk}h^{\top}_{s}\bar{z}_{k}\bar{z}_{s}^{\top}h_{k}=\operatorname{\text{tr}}(H^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5muH^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu), skzˉszˉkhshk=Z ⁣HF2\sum_{sk}\bar{z}_{s}^{\top}\bar{z}_{k}h_{s}^{\top}h_{k}=\left\|\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5muH^{\top}\right\|^{2}_{F} and

By Lemma 7, tr(HZ ⁣HZ ⁣)=HZ ⁣F2\operatorname{\text{tr}}(H^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5muH^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu)=\left\|H^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu\right\|^{2}_{F}. Replacing those terms in equation (10) gives us

Finally, we obtain the claim by noticing that

where σ1=σmax(Z ⁣)σmin(Z ⁣)=σr\sqrt{\sigma_{1}}=\sigma_{\max}(\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu)\geq\cdots\geq\sigma_{\min}(\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu)=\sqrt{\sigma_{r}} are the singular values of Z ⁣\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu. ∎

tr(HZ ⁣HZ ⁣)=HZ ⁣F2\operatorname{\text{tr}}(H^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5muH^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu)=\left\|H^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu\right\|^{2}_{F}.

Let Uˉ=arg minUU=UU=IZZUF2=arg maxUU=UU=IU,ZZ\bar{U}=\operatorname*{arg\,min}_{UU^{\top}=U^{\top}U=I}\left\|Z-Z^{\star}U\right\|^{2}_{F}=\operatorname*{arg\,max}_{UU^{\top}=U^{\top}U=I}\langle U,{Z^{\star}}^{\top}Z\rangle. Note that A,BAB\langle A,B\rangle\leq\left\|A\right\|_{*}\left\|B\right\| for any matrices A,BA,B that are of the same size. The equality holds when B=UAVAB=U_{A}V_{A}^{\top} where A=UAΣAVAA=U_{A}\Sigma_{A}V_{A}^{\top} is the SVD of AA. Hence, Uˉ=U~V~\bar{U}=\widetilde{U}\widetilde{V}^{\top} where U~S~V~\widetilde{U}\widetilde{S}\widetilde{V}^{\top} is the SVD of ZZ{Z^{\star}}^{\top}Z; Z ⁣=ZUˉ\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu=Z^{\star}\bar{U}. Therefore, ZZ ⁣=ZZUˉ=V~S~V~Z^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu=Z^{\top}Z^{\star}\bar{U}=\widetilde{V}\widetilde{S}\widetilde{V}^{\top} is symmetric and positive semidefinite. Thus, HZ ⁣=ZZ ⁣Z ⁣Z ⁣H^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu=Z^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu-{\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu}^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu is also symmetric. This implies that tr(HZ ⁣HZ ⁣)=HZ ⁣F2\operatorname{\text{tr}}(H^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5muH^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu)=\left\|H^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu\right\|^{2}_{F}. ∎

Appendix C Linear Convergence

Let Hk=ZkZ ⁣kH^{k}=Z^{k}-\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{k}. Then we have that

where we use the definition of RC(ε,α,β)RC(\varepsilon,\alpha,\beta) in the third line, ZF2=X=s=1rσs\left\|Z^{\star}\right\|^{2}_{F}=\left\|X^{\star}\right\|_{*}=\sum_{s=1}^{r}\sigma_{s} in the third to last line and 0<μ<min{α/2,2/β}0<\mu<\min\left\{\alpha/2,2/\beta\right\} in the second to last line. Therefore,

Appendix D Regularity Condition

As mentioned before, Nesterov [16, Theorem 2.1.11] shows that the gradient scheme converges linearly under a condition similar to the regularity condition, which is satisfied if the function is strongly convex and has a Lipschitz continuous gradient (strongly smooth). In order to prove Theorem 4, we show that with high probability the function ff satisfies the local curvature condition, which is analogous to strong convexity, and the local smoothness condition, which is analogous to strong smoothness.

There exists a constant C1C_{1} such that for any ZZ satisfying d(Z,Z)316σrd(Z,Z^{\star})\leq\sqrt{\frac{3}{16}\sigma_{r}},

There exist constants C2,C3C_{2},C_{3} such that for any ZZ satisfying d(Z,Z)316σrd(Z,Z^{\star})\leq\sqrt{\frac{3}{16}\sigma_{r}},

where we use Cauchy-Schwarz inequality in the 2nd line, the inequality (ab)2a22b2(a-b)^{2}\geq\frac{a^{2}}{2}-b^{2} in the 5th line, Lemma 5 and 6 in the 7th line, and the fact that HHFHF2\left\|HH^{\top}\right\|_{F}\leq\left\|H\right\|^{2}_{F} in the 8th line. Since HF316σr\left\|H\right\|_{F}\leq\sqrt{\frac{3}{16}\sigma_{r}} and δ116σr\delta\leq\frac{1}{16}\sigma_{r}, we have

D.2 Proof of the Local Smoothness Condition

Since (a+b+c+d)24(a2+b2+c2+d2)(a+b+c+d)^{2}\leq 4(a^{2}+b^{2}+c^{2}+d^{2}), we have

The first term in the righthand side can be upper bounded as

where we use the Cauchy-Schwarz inequality in the first and second line, Lemma 5 and HHFHF2\left\|HH^{\top}\right\|_{F}\leq\left\|H\right\|^{2}_{F} in the third line and Corollary 1 in the last line.

The other three terms are bounded similarly. For the second term, we have

where we use Lemma 6 and 1. The third term is bounded as

Putting these inequalities together, we have

Since HF316σr\left\|H\right\|_{F}\leq\sqrt{\frac{3}{16}\sigma_{r}} and δ116σr\delta\leq\frac{1}{16}\sigma_{r}, we have

D.3 Proof of the Regularity Condition

Now we combine the curvature and the smoothness conditions. For any γ(0,σ1σr)\gamma\in\left(0,\frac{\sigma_{1}}{\sigma_{r}}\right), it holds that

Combining equation (11) and (12), we obtain

where we use Z ⁣F2=ZF2=Xσr\left\|\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu\right\|^{2}_{F}=\left\|Z^{\star}\right\|^{2}_{F}=\left\|X^{\star}\right\|_{*}\geq\sigma_{r}. Thus we have

for α24\alpha\geq 24 and βσ1σr513n\beta\geq\frac{\sigma_{1}}{\sigma_{r}}\cdot 513n.

Appendix E Initialization

Let λ1λn\lambda^{\prime}_{1}\geq\cdots\geq\lambda^{\prime}_{n} be the eigenvalues of MM. By Weyl’s theorem, we have

Since δ<σr\delta<\sigma_{r}, it is easy to see λ1λr>δ\lambda^{\prime}_{1}\geq\cdots\geq\lambda^{\prime}_{r}>\delta and λsδ,s=r+1,,n|\lambda^{\prime}_{s}|\leq\delta,s=r+1,\ldots,n. Hence, λs=λs\lambda_{s}=\lambda^{\prime}_{s}, s[r]s\in[r], and Z0Z0Z^{0}{Z^{0}}^{\top} is the best rank rr approximation of 12M\frac{1}{2}M. Therefore,

where we used AFrank(A)A\left\|A\right\|_{F}\leq\sqrt{\operatorname{\text{rank}}(A)}\left\|A\right\| in first line, the fact Z0Z012M=12λr+112δ\left\|Z^{0}{Z^{0}}^{\top}-\frac{1}{2}M\right\|=\frac{1}{2}|\lambda_{r+1}|\leq\frac{1}{2}\delta and inequality (13) in the last line.

Let H=Z0Z ⁣0H=Z^{0}-\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{0}. We want to bound d(Z0,Z)2=HF2d(Z^{0},Z^{\star})^{2}=\left\|H\right\|^{2}_{F}. According to the discussion in Lemma 7, HZ ⁣0H^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{0} is symmetric and Z0Z ⁣0{Z^{0}}^{\top}{\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{0}} is positive semidefinite.

The following step closely follows . It holds that

where in the fourth line we used the property that the trace is invariant under cyclic permutations and HZ ⁣0=Z ⁣0HH^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{0}={\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{0}}^{\top}H.

Since Z0Z ⁣0{Z^{0}}^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{0} is positive semidefinite, tr((HH)(Z0Z ⁣0))\operatorname{\text{tr}}((H^{\top}H)({Z^{0}}^{\top}\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu^{0})) is nonnegative. Hence,

If δσr4r\delta\leq\frac{\sigma_{r}}{4\sqrt{r}}, then

Appendix F Sample Complexity

In this section, we verify that our assumptions hold with high probability if mcnlognm\geq cn\log n, where cc is a constant that depends on δ\delta, rr, and κ\kappa. Our proof relies on the following concentration inequality.

We first give a technical lemma that we will use later.

Let A=(aij)A=(a_{ij}) be a random matrix drawn from GOE. Let S=a11A2e1e1S=a_{11}A-2e_{1}e_{1}^{\top}. There exist absolute constants CC, ρ\rho such that with probability at least 1Ceρn1-Ce^{-\rho n}, we have

Let A~=Aa11e1e1\widetilde{A}=A-a_{11}e_{1}e_{1}^{\top}. S=a11A~+(a1122)e1e1S=a_{11}\widetilde{A}+(a_{11}^{2}-2)e_{1}e_{1}^{\top}. Note that a11a_{11} and A~\widetilde{A} are independent, hence Sa11A~+a1122\left\|S\right\|\leq|a_{11}|\|\widetilde{A}\|+|a^{2}_{11}-2|. Besides, since a11N(0,2)a_{11}\sim\mathcal{N}(0,2), we can see that a112/2a^{2}_{11}/2 is χ2\chi^{2} distributed.

First we bound the operator norm of A~\widetilde{A}. We rewrite A~\|\widetilde{A}\| as

where D=A~+de1e1D=\widetilde{A}+de_{1}e_{1}^{\top}, dN(0,2)d\sim\mathcal{N}(0,2). As DD is GOE distributed, by Lemma 4,

where CC^{\prime} and ρ\rho^{\prime} are absolute constants.

Using the Gaussian tail inequality, we have

Combining inequalities (14) and (15), we have

where the last inequality follows from the union bound.

Next we bound the deviation of the χ2\chi^{2} term. By the corollary of Lemma 1 in Laurent and Massart , we have

Finally, the statement is obtained by choosing proper CC, ρ\rho, and using nn\sqrt{n}\leq n. ∎

It is equivalent to show that for any unit vector uu, with high probability,

where in the second line we use unitary invariance of the operator norm, and in the last line we denote PAiPP^{\top}A_{i}P by A~i\widetilde{A}_{i}. Since the GOE is invariant under orthogonal conjugation, A~i\widetilde{A}_{i} and AiA_{i} are identically distributed. Hence, it suffices to prove the claim when u=e1u=e_{1}, i.e.

where a11(i)a^{(i)}_{11} is the (1,1)(1,1) entry of AiA_{i} and δ0=δrσ1\delta_{0}=\frac{\delta}{r\sigma_{1}}.

To show this, we apply Theorem 8, where Si=a11(i)Ai2e1e1S_{i}=a^{(i)}_{11}A_{i}-2e_{1}e^{\top}_{1}. This requires that the operator norm of SiS_{i} is bounded, for each ii. We address this by noticing that with high probability Si18n\left\|S_{i}\right\|\leq 18n, i\forall i. To be precise, by Lemma 8 there exist constants C,ρC,\rho, such that

Taking the union bound over all the SiS_{i}s leads to

By Theorem 8, if m42min(δ02,δ0)nlognm\geq\frac{42}{\min(\delta^{2}_{0},\delta_{0})}\cdot n\log n, then

Combining inequalities (18) and (19), we conclude that

F.2 Proof of Theorem 7

The formulation of the second order partial derivatives and their expectations is given in Appendix B.

It is easy to see that for any Z ⁣S\mkern 1.5mu\overline{\mkern-2.5muZ\mkern-1.0mu}\mkern 1.5mu\in\mathcal{S}, maxs[r]zˉrσ1\max_{s\in[r]}\left\|\bar{z}_{r}\right\|\leq\sqrt{\sigma_{1}}. Thus it is sufficient to prove that for any two unitary vector uu and yy with high probability it holds that

We can decompose yy as y=βu+βuy=\beta u+\beta_{\perp}u_{\perp} for a certain unit vector uu_{\perp} that is orthogonal to uu, where β2+β2=1\beta^{2}+\beta^{2}_{\perp}=1. Let δ0=δ2rσ1\delta_{0}=\dfrac{\delta}{2r\sigma_{1}}. It suffices to prove the following two claims.

For any unitary vector uu, with high probability

For any two orthogonal unit vectors uu and uu_{\perp}, with high probability

where A~i\widetilde{A}_{i} and AiA_{i} have the same distribution. Hence we only need to prove the case where u=e1u=e_{1}:

where v(i)=Aie1v^{(i)}=A_{i}e_{1} is the first column of AiA_{i}.

Let S,v,AS,v,A denote S1S_{1}, v(1)v^{(1)}, and A(1)A^{(1)} respectively. It is easy to see that

where w=k=2na1k2w=\sum_{k=2}^{n}a^{2}_{1k}. As a11N(0,2)a_{11}\sim\mathcal{N}(0,2), a1k N(0,1)a_{1k}\sim~{}\mathcal{N}(0,1) for k1k\neq 1, we can see that a112/2a^{2}_{11}/2 and ww are χ2\chi^{2} distributed with degrees of freedom 11 and n1n-1, respectively. Using the χ2\chi^{2} tail bound, we have

As v1N(0,2)v_{1}\sim\mathcal{N}(0,2), vjN(0,1)v_{j}\sim\mathcal{N}(0,1) for j1j\neq 1,

If m(128/min(δ02,δ0))nlognm\geq(128/\min(\delta^{2}_{0},\delta_{0}))n\log n, then by applying Theorem 8 we can see

Combining inequalities (21) and (20) leads to

Proof of (ii)

We only need to prove the case where u=e1u=e_{1} and u=e2u_{\perp}=e_{2} due to the same reason above. That is,

where v(i)v^{(i)} and q(i)q^{(i)} are the first and second columns of AiA_{i}.

As before, let Si=2(v(i)q(i)e2e1)S_{i}=2(v^{(i)}{q^{(i)}}^{\top}-e_{2}e_{1}^{\top}) and let S,v,q,AS,v,q,A denote S1S_{1}, v(1)v^{(1)}, q(1)q^{(1)} and A(1)A^{(1)} respectively. From the proof of (i), we can see that with probability at least 14en1-4e^{-n} both v\left\|v\right\| and q\left\|q\right\| are no larger than 13n+1\sqrt{13n+1}. Since S2vq+2\left\|S\right\|\leq 2\left\|v\right\|\left\|q\right\|+2, we have

and ν2=8(n+1)m\nu^{2}=8(n+1)m. If m78min(δ02,δ0)nlognm\geq\frac{78}{\min(\delta_{0}^{2},\delta_{0})}n\log n, then by applying Theorem 8 we have

Appendix G ADMM for Nuclear Norm Minimization

We reformulate the nuclear norm minimizing problem as

where λ>0\lambda>0 is the regularization parameter. λ0\lambda\rightarrow 0 will enforce the minimizer XnucX^{*}_{\text{nuc}} satisfying the affine constraint A(Xnuc)=b\mathcal{A}(X^{*}_{\text{nuc}})=b.

We apply ADMM to the dual problem of (23):

where we introduce an auxiliary variable VV to make this problem equality constrained.

The augmented Lagrangian of problem (24) can be written as

where XX is the multiplier, η\eta is the penalty parameter, and 11\mathbf{1}_{\left\|\cdot\right\|\leq 1} is the indicator function of the unit spectral norm ball i.e. 11(V)\mathbf{1}_{\left\|\cdot\right\|\leq 1}(V) equals if V1\left\|V\right\|\leq 1 and ++\infty otherwise.

Let vec()\text{vec}(\cdot) denote the vectorization of a matrix, whose inverse mapping is denoted by mat()\text{mat}(\cdot). We can rewrite the transformations as A(X)=Avec(X)\mathcal{A}(X)=\boldsymbol{A}\text{vec}(X) and A(α)=mat(Aα)=i=1mαiAi\mathcal{A}^{\top}(\alpha)=\text{mat}(\boldsymbol{A}^{\top}\alpha)=\sum_{i=1}^{m}\alpha_{i}A_{i}, where A\boldsymbol{A} is a m×n2m\times n^{2} matrix whose iith row is vec(Ai)\text{vec}(A_{i})^{\top}.

The ADMM starts from initialization (α0,V0,X0)(\alpha^{0},V^{0},X^{0}) and updates the three variables alternately. The updates can be computed in close forms:

where proj()\text{proj}(\cdot) is the projection onto the unit spectral norm ball. Let X=UΣVX=U\Sigma V^{\top} be the singular value decomposition of XX,

In fact, the update of VV can be combined with other steps without being computed explicitly. One only has to iterate the following two steps:

where proxη()\text{prox}_{\eta}(\cdot) is the singular value soft-thresholding operator defined as

The sequence of multipliers {Xk}\left\{X^{k}\right\} converges to the primal solution of (23). To speed up the update of α\alpha, the Cholesky decomposition of λI+ηAA\lambda I+\eta\boldsymbol{A}\boldsymbol{A}^{\top} is precomputed in our implementation.