Guaranteed Rank Minimization via Singular Value Projection

Raghu Meka, Prateek Jain, Inderjit S. Dhillon

Introduction

In this paper we study the general affine rank minimization problem (ARMP),

The general affine rank minimization problem is of considerable practical interest and many important machine learning problems such as matrix completion, low-dimensional metric embedding, low-rank kernel learning can be viewed as instances of the above problem. Unfortunately, ARMP is NP-hard in general and is also NP-hard to approximate ([MJCD08]).

Until recently, most known methods for ARMP\mathsf{ARMP} were heuristic in nature with few known rigorous guarantees. The most commonly used heuristic for the problem is to assume a factorization of XX and optimize the resulting non-convex problem by alternating minimization [Bra03, Kor08, MB07], alternative projections [GB00] or alternating LMIs [SIG97]. Another common approach is to relax the rank constraint to a convex function such as the trace-norm or the log determinant [FHB01], [FHB03]. However, most of these methods do not have any optimality guarantees. Recently, Meka et al. [MJCD08] proposed online learning based methods for ARMP. However, their methods can only guarantee at best a logarithmic approximation for the minimum rank.

Recht et al. show that for affine constraints with bounded isometry constants (specifically, δ5k<1/10\delta_{5k}<1/10), finding the minimum trace-norm solution recovers the minimum rank solution. Their results were later extended to noisy measurements and isometry constants up to δ3k<1/43\delta_{3k}<1/4\sqrt{3} by Lee and Bresler [LB09b]. However, even the best existing optimization algorithms for the trace-norm relaxation are relatively inefficient in practice and their results are hard to analyze.

In another recent work, Lee and Bresler [LB09a] obtained exact-recovery guarantees for ARMP\mathsf{ARMP} satisfying RIP\mathsf{RIP} using a different approach. Lee and Bresler propose an algorithm (ADMiRA) motivated by the orthogonal matching pursuit line of work in compressed sensing, and show that for affine constraints with isometry constant δ4k0.04\delta_{4k}\leq 0.04 their algorithm recovers the optimal solution. They also prove similar guarantees for noisy measurements and provide a geometric convergence rate for their algorithm. However, their method is not very efficient for large datasets and is hard to analyze.

In this paper we propose a simple and fast algorithm SVP\mathsf{SVP} (Singular Value Projection) based on the classical projected gradient algorithm. We present a simple analysis showing that SVP\mathsf{SVP} recovers the minimum rank solution for affine constraints that satisfy RIP\mathsf{RIP} even in the presence of noise and prove the following guarantees. Independent of our work, Goldfarb and Ma [GM09] proposed an algorithm similar to our algorithm. However, their analysis and formulation is different from ours. In particular, their analysis builds on the analysis of Lee and Bresler and they require stronger isometry assumptions, δ3k<1/30\delta_{3k}<1/\sqrt{30}, than we do. In addition, we make partial progress on analyzing SVP\mathsf{SVP} for the matrix completion problem and proving exact recovery.

Suppose the isometry constant of A\mathcal{A} satisfies δ2k1/3\delta_{2k}\leq 1/3 and let b=A(X)b=\mathcal{A}(X^{*}) for a rank-kk matrix XX^{*}. Then, SVP\mathsf{SVP} (Algorithm 1) with step-size ηt=1/(1+δ2k)\eta_{t}=1/(1+\delta_{2k}) converges to XX^{*}. Furthermore, SVP\mathsf{SVP} outputs a matrix XX of rank at most kk such that A(X)b22ϵ\|\mathcal{A}(X)-b\|_{2}^{2}\leq\epsilon in at most 1log((1δ2k)/2δ2k)logb22ϵ\left\lceil\frac{1}{\log((1-\delta_{2k})/2\delta_{2k})}\log\frac{\|b\|^{2}}{2\epsilon}\right\rceil iterations.

Our analysis of SVP\mathsf{SVP} is motivated by the recent work in the field of compressed sensing by Blumensath and Davies [BD09], Garg and Khandekar [GK09]. Our results improve the results of Recht et al. and Lee and Bresler as follows.

SVP\mathsf{SVP} is considerably simpler to analyze than the methods of Recht et al. and Lee and Bresler. Further, we need weaker isometry assumptions on A\mathcal{A}: we only require δ2k<1/3\delta_{2k}<1/3 as opposed to δ5k<1/10\delta_{5k}<1/10 required by Recht et al., δ3k<1/43\delta_{3k}<1/4\sqrt{3} required by Lee and Bresler [LB09b] and δ4k0.04\delta_{4k}\leq 0.04 required by Lee and Bresler [LB09a].

SVP\mathsf{SVP} has a strong geometric convergence rate and is faster than using the best trace-norm optimization algorithms and the methods of Lee and Bresler by an order of magnitude.

Although restricted isometry property is natural in settings where the affine constraints contain information about all the entries of the unknown matrix, in several cases of considerable practical interest the affine constraints only contain local information and may not satisfy RIP\mathsf{RIP} directly.

One such important problem where RIP\mathsf{RIP} does not hold directly is the low-rank matrix completion problem. In the matrix completion problem we are given the entries of an unknown low-rank matrix XX^{*} for ordered pairs (i,j)Ω[m]×[n](i,j)\in\Omega\subseteq[m]\times[n] and the goal is to complete the missing entries of XX^{*}. A highly popular application of the matrix completion problem is in the field of collaborative filtering, where typically the task is to predict user ratings given past ratings of the users. Recently, a lot of attention has been given to the problem due to the Netflix Challenge [Net]. Other applications of matrix completion include triangulation from incomplete data, link prediction in social networks etc.

Similar to ARMP\mathsf{ARMP}, the low-rank matrix completion is also NP-hard in general and most methods are heuristic in nature with no theoretical guarantees. The alternating least squares minimization heuristic and its variants [Kor08, MB07] perform the best in practice but are notoriously hard to analyze.

Recently, Candes and Recht [CR08], Candes and Tao [CT09] and Keshavan et al. [KOM09] obtained the first non-trivial results for low-rank matrix completion under a few additional assumptions. Broadly, these papers give exact-recovery guarantees when the optimal solution XX^{*} is μ\mu-incoherent (see Definition 4.1), and the entries Ω\Omega are chosen uniformly at random with ΩC(μ,k)npolylogn|\Omega|\geq C(\mu,k)\,n\,poly\log n, where C(μ,k)C(\mu,k) depends only on μ,k\mu,k. However, the algorithms of the above papers, even when using methods tailored specifically for matrix-completion such as those of Cai et al. [CCS08], are quite expensive in practice and not very tolerant to noise.

As low-rank matrix completion is a special case of ARMP\mathsf{ARMP}, we can naturally adapt our algorithm SVP\mathsf{SVP} for matrix completion. We demonstrate empirically that for a suitable step-size, SVP\mathsf{SVP} significantly outperforms the methods of [CR08], [CT09], [CCS08], [KOM09] in accuracy, computational time and tolerance to noise. Furthermore, our experiments strongly suggest (see Figure 1) that guarantees similar to those of [CT09], [KOM09] hold for SVP\mathsf{SVP}, achieving exact recovery for incoherent matrices from an almost optimal number of entriesIt follows from a coupon collector argument that exact-recovery from random samples requires nklognnk\log n samples..

Although we do not provide a rigorous proof of exact-recovery for SVP\mathsf{SVP} applied to matrix completion, we make partial progress in this direction and give strong intuition for the performance of SVP\mathsf{SVP}. We prove that though the affine constraints defining the matrix-completion problems do not obey the restricted isometry property, they obey the restricted isometry property over incoherent matrices. This weaker RIP\mathsf{RIP} condition along with a hypothesis bounding the incoherence of the iterates of SVP\mathsf{SVP} imply exact-recovery of a low-rank incoherent matrix from an almost optimal number of entries. We also provide strong empirical evidence supporting our hypothesis bounding the incoherence of the iterates of SVP\mathsf{SVP} (see Figure 2).

We first present our algorithm SVP\mathsf{SVP} in Section 2 and present its analysis for affine constraints satisfying RIP\mathsf{RIP} in Section 3. In Section 4, we specialize our algorithm SVP\mathsf{SVP} to the task of low-rank matrix completion and prove a more restricted isometry property for the matrix completion problem. In Section 6, we give empirical results for SVP\mathsf{SVP} applied to ARMP\mathsf{ARMP} and matrix-completion on real-world and synthetic problems.

Singular Value Projection (SVP)

Consider the following robust formulation of ARMP\mathsf{ARMP} (RARMP\mathsf{RARMP}),

The hardness of the above problem mainly comes from the non-convexity of the set of low-rank matrices C(k)\mathcal{C}(k). However, in spite of the hardness of the rank constraint, the Euclidean projection onto the non-convex set C(k)\mathcal{C}(k) can be computed efficiently using singular value decomposition. Our algorithm uses this observation along with the projected gradient method for efficiently minimizing the objective function specified in problem (RARMP).

In SVP\mathsf{SVP}, a candidate solution to ARMP\mathsf{ARMP} is computed iteratively by starting from the all-zero matrix and adapting the classical projected gradient descent update as follows (Observe that ψ(X)=AT(A(X)b)\nabla\psi(X)=\mathcal{A}^{T}(\mathcal{A}(X)-b)) :

Algorithm 1 presents our SVP\mathsf{SVP} algorithm. Note that the iterates XtX^{t} are always low-rank, facilitating faster computation of the SVD. See Section 5 for a more detailed discussion of the computational issues.

Analysis for Affine Constraints Satisfying 𝖱𝖨𝖯𝖱𝖨𝖯\mathsf{RIP}

We now show that SVP\mathsf{SVP} solves exact rank minimization for affine constraints that satisfy RIP\mathsf{RIP} and prove our main results, Theorems 1.1 and 1.2. We first present a lemma that bounds the error at the (t+1)(t+1)-st iteration (ψ(Xt+1)\psi(X^{t+1})) with respect to the error incurred by the optimal solution (ψ(X)\psi(X^{*})) and the tt-th iteration.

Let XX^{*} be an optimal solution of (RARMP) and let XtX^{t} be the iterate obtained by SVP algorithm at tt-th iteration. Then,

where δ2k\delta_{2k} is the rank 2k2k isometry constant of A\mathcal{A}.

Recall that ψ(X)=12A(X)b22\psi(X)=\frac{1}{2}\|\mathcal{A}(X)-b\|_{2}^{2}. Since ψ()\psi(\cdot) is a quadratic function, we have

where inequality (3) follows from RIP\mathsf{RIP} applied to the matrix Xt+1XtX^{t+1}-X^{t} of rank at most 2k2k. Let Yt+1=Xt11+δ2kAT(A(Xt)b)Y^{t+1}=X^{t}-\frac{1}{1+\delta_{2k}}\mathcal{A}^{T}(\mathcal{A}(X^{t})-b) and

Now, by definition, Pk(Yt+1)=Xt+1\mathcal{P}_{k}(Y^{t+1})=X^{t+1} is the minimizer of ft(X)f_{t}(X) over all matrices XC(k)X\in\mathcal{C}(k) (of rank at most kk). In particular, ft(Xt+1)ft(X)f_{t}(X^{t+1})\leq f_{t}(X^{*}). Thus,

where inequality (4) follows from RIP\mathsf{RIP} applied to XXtX^{*}-X^{t}. ∎

We now prove that SVP\mathsf{SVP} obtains the optimal solution for ARMP with restricted isometry property.

Using Lemma 3.1 and the fact that ψ(X)=0\psi(X^{*})=0 for the noise-less case,

Also, note that for δ2k<1/3\delta_{2k}<1/3, 2δ2k(1δ2k)<1\frac{2\delta_{2k}}{(1-\delta_{2k})}<1. Hence, ψ(Xτ)ϵ\psi(X^{\tau})\leq\epsilon where τ=1log((1δ2k)/2δ2k)logψ(X0)ϵ\tau=\left\lceil\frac{1}{\log((1-\delta_{2k})/2\delta_{2k})}\log\frac{\psi(X^{0})}{\epsilon}\right\rceil. Now, the SVP algorithm is initialized using X0=0X^{0}=0, i.e., ψ(X0)=b22\psi(X^{0})=\frac{\|b\|^{2}}{2}. Hence, τ=1log((1δ2k)/2δ2k)logb22ϵ\tau=\left\lceil\frac{1}{\log((1-\delta_{2k})/2\delta_{2k})}\log\frac{\|b\|^{2}}{2\epsilon}\right\rceil. ∎

Next, we prove the noisy version of Theorem 1.1.

Let the current solution XtX^{t} satisfy ψ(Xt)C2e2/2\psi(X^{t})\geq C^{2}\|e\|^{2}/2, where C0C\geq 0 is a universal constant. Using Lemma 3.1 and the fact that bA(X)=eb-\mathcal{A}(X^{*})=e,

where D=(1C2+2δ2k(1δ2k)(1+1C)2)D=\left(\frac{1}{C^{2}}+\frac{2\delta_{2k}}{(1-\delta_{2k})}\left(1+\frac{1}{C}\right)^{2}\right). Recall that δ2k<1/3\delta_{2k}<1/3. Hence, selecting C>(1+δ2k)/(13δ2k)C>(1+\delta_{2k})/(1-3\delta_{2k}), we get D<1D<1. Also, ψ(X0)=ψ(0)=b2/2\psi(X^{0})=\psi(0)=\|b\|^{2}/2. Hence, ψ(Xτ)(C2+ϵ)e2/2\psi(X^{\tau})\leq(C^{2}+\epsilon)\|e\|^{2}/2 where τ=1log(1/D)logb2(C2+ϵ)e2\tau=\left\lceil\frac{1}{\log(1/D)}\log\frac{\|b\|^{2}}{(C^{2}+\epsilon)\|e\|^{2}}\right\rceil. ∎

Matrix Completion

Observe that the matrix completion problem is a special case of ARMP\mathsf{ARMP}. However, the affine constraints that define MCP\mathsf{MCP}, PΩ\mathcal{P}_{\Omega}, do not satisfy RIP\mathsf{RIP} in general. Thus Theorems 1.1, 1.2 above and the results of Recht et al. [RFP07] do not directly apply to MCP\mathsf{MCP}. The first non-trivial results for MCP\mathsf{MCP} were obtained recently by Candes and Recht [CR08], Keshavan et al. [KOM09] and Candes and Tao [CT09]. These works show exact recovery of the unknown matrix XX^{*} when the observed entries are sampled uniformly and XX^{*} is incoherent in the sense defined below.

Intuitively, high incoherence (i.e., μ\mu is small) implies that the non-zero entries of XX are not concentrated in a small number of entries. Hence, a random sampling of the matrix should provide enough information to reconstruct the entire matrix.

As matrix completion is a special case of ARMP\mathsf{ARMP}, we can apply SVP\mathsf{SVP} for matrix completion. We apply SVP\mathsf{SVP} to matrix-completion with step-size ηt=1/(1+δ)p\eta_{t}=1/(1+\delta)p, where pp is the density of sampled entries and 0<δ<1/30<\delta<1/3 is a parameter depending on how large pp is, leading to the update

We now provide some intuition for our choice of step-size ηt\eta_{t} and make partial progress towards proving that SVP\mathsf{SVP} achieves exact recovery for low-rank incoherent matrices. We show that though the affine constraints defining MCP\mathsf{MCP}, PΩ\mathcal{P}_{\Omega}, do not satisfy RIP\mathsf{RIP} for all low-rank matrices, they satisfy RIP\mathsf{RIP} for all low-rank incoherent matrices. Thus, if the iterates appearing in SVP\mathsf{SVP} remain incoherent throughout the execution of the algorithm, then Theorem 1.1 would imply recovery of the unknown entries of the matrix. Empirical evidence strongly supports our hypothesis that the incoherence of the iterates arising in SVP\mathsf{SVP} remains bounded.

Figure 1 plots the threshold sampling density pp beyond which matrix completion for randomly generated matrices is solved exactly by SVP\mathsf{SVP} for fixed kk and varying matrix sizes nn. Note that the density threshold matches the optimal bound of O(klogn/n)O(k\log n/n) with the constant being C=1.28C=1.28. Figure 2 plots the maximum incoherence maxtμ(Xt)=nmaxt,i,jUijt\max_{t}\mu(X^{t})=\sqrt{n}\,\max_{t,i,j}|U^{t}_{ij}|, where UtU^{t} are the left singular vectors of the intermediate iterates XtX^{t} computed by SVP\mathsf{SVP}. The figure clearly shows that the incoherence μ(Xt)\mu(X^{t}) of the iterates is bounded by a constant independent of the matrix size nn and density pp throughout the execution of SVP\mathsf{SVP}.

Combining the above Chernoff bound estimate with a union bound over low-rank incoherent matrices, we obtain the following restricted isometry property for the projection operator PΩ\mathcal{P}_{\Omega} restricted to low-rank incoherent matrices. See Section 4.1 for a detailed proof.

There exists a constant C0C\geq 0 such that the following holds for all 0<δ<10<\delta<1, μ1\mu\geq 1, nm3n\geq m\geq 3: For Ω[m]×[n]\Omega\subseteq[m]\times[n] chosen according to the Bernoulli model with density pCμ2k2logn/δ2mp\geq C\mu^{2}k^{2}\log n/\delta^{2}m, with probability at least 1exp(nlogn)1-\exp(-n\log n), the restricted isometry property in (6) holds for all μ\mu-incoherent matrices XX of rank at most kk.

Motivated by the above theorem and supported by empirical evidence (Figures 1, 2) we hypothesize that SVP\mathsf{SVP} achieves exact recovery from an almost optimal number of samples.

Fix μ,k\mu,k and δ1/3\delta\leq 1/3. Then, there exists a constant CC such that for a μ\mu-incoherent matrix XX^{*} of rank at most kk and Ω\Omega sampled from the Bernoulli model with density pCμ2k2logn/δ2mp\geq C\mu^{2}k^{2}\log n/\delta^{2}m, SVP\mathsf{SVP} with step-size ηt=1/(1+δ)p\eta_{t}=1/(1+\delta)p converges to XX^{*} with high probability. Moreover, SVP\mathsf{SVP} outputs a matrix XX of rank at most kk such that PΩ(X)PΩ(X)F2ϵ\|\mathcal{P}_{\Omega}(X)-\mathcal{P}_{\Omega}(X^{*})\|_{F}^{2}\leq\epsilon after Oμ,k(log(1ϵ))O_{\mu,k}\left(\left\lceil\log\left(\frac{1}{\epsilon}\right)\right\rceil\right) iterations.

We now prove the RIP\mathsf{RIP} property of Theorem 4.2 for the projection operator PΩ\mathcal{P}_{\Omega}. To prove Theorem 4.2 we first show the theorem for a discrete collection of matrices using Chernoff type large-deviation bounds and use standard quantization arguments to generalize to the continuous case. We first introduce some notation.

We need Bernstein’s inequality [Wik09] stated below.

Let X1,X2,,XnX_{1},X_{2},\dots,X_{n} be independent random variables with E[Xi]=0,iE[X_{i}]=0,\forall i. Furthermore, let XiM|X_{i}|\leq M. Then,

For (i,j)[m]×[n](i,j)\in[m]\times[n], let ωij\omega_{ij} be the indicator variables with ωij=1\omega_{ij}=1 if (i,j)Ω(i,j)\in\Omega and otherwise. Then, ωij\omega_{ij} are independent random variables with Pr[ωij=1]=p\operatorname*{\mathsf{Pr}}[\omega_{ij}=1]=p. Let random variable Zij=ωijXij2Z_{ij}=\omega_{ij}X_{ij}^{2}. Note that,

Observe that ZijE[Zij]Xij2(α2/mn)XF2|Z_{ij}-E[Z_{ij}]|\leq|X_{ij}|^{2}\leq(\alpha^{2}/mn)\cdot\|X\|_{F}^{2}. Thus,

Now, define random variable S=i,jZij=i,jωijXij2=PΩ(X)F2S=\sum_{i,j}Z_{ij}=\sum_{i,j}\omega_{ij}X_{ij}^{2}=\|\mathcal{P}_{\Omega}(X)\|_{F}^{2}. Note that, E[S]=pXF2E[S]=p\|X\|_{F}^{2}. Since, ZijZ_{ij} are independent random variables,

Using Bernstein’s inequality (Lemma 4.5) for SS with t=δpXF2t=\delta p\|X\|_{F}^{2} and Equations (7) and (8) we get,

We now discretize the space of low-rank incoherent matrices so as to be able to use the above lemma with a union bound. We need the following simple lemmas.

Let X=UΣVTX=U\Sigma V^{T} be the singular value decomposition of XX. Then, Xij=UiΣVjTX_{ij}=U_{i}\Sigma V_{j}^{T}, where Ui,VjU_{i},V_{j} are the ii’th and jj’th rows of U,VU,V respectively. Now,

The following lemma shows that the space of low-rank μ\mu-incoherent matrices can be discretized into a reasonably small set of regular matrices such that every low-rank μ\mu-incoherent matrix is close to a matrix from the set.

We will show that S(μ,ϵ)S(\mu,\epsilon) satisfies the conditions of the Lemma. Observe that D(ρ)<2/ρ|D(\rho)|<2/\rho. Thus,

Hence, S(μ,ϵ)<(2/ρ)mk+nk+k<(mnk/ϵ)3(m+n)k|S(\mu,\epsilon)|<(2/\rho)^{mk+nk+k}<(mnk/\epsilon)^{3(m+n)k}.

Now, since Uilμ/m|U_{il}|\leq\sqrt{\mu}/\sqrt{m}, it follows that U1U(ρ)U_{1}\in U(\rho). Further, for all i[m],l[k]i\in[m],l\in[k],

Similarly, define V1,Σ1V_{1},\Sigma_{1} by rounding entries of V,ΣV,\Sigma to integer multiples of μρ/n\sqrt{\mu}\,\rho/\sqrt{n} and ρ\rho respectively. Then, V1V(ρ)V_{1}\in V(\rho), Σ1Σ(ρ)\Sigma_{1}\in\Sigma(\rho) and for (j,l)[n]×[k](j,l)\in[n]\times[k],

Let X(ρ)=U1Σ1V1TX(\rho)=U_{1}\Sigma_{1}V_{1}^{T}. Then, by the above equations and Lemma 4.8, for i[m],l[k],j[n]i\in[m],l\in[k],j\in[n],

Furthermore, using triangular inequality, X(ρ)F>XFϵ>XF/2\|X(\rho)\|_{F}>\|X\|_{F}-\epsilon>\|X\|_{F}/2. Since, ϵ<1\epsilon<1 and μkXF1\mu\sqrt{k}\|X\|_{F}\geq 1,

Thus, X(ρ)X(\rho) is 4μk4\mu\sqrt{k}-regular. The lemma now follows by taking Y=X(ρ)Y=X(\rho). ∎

We now prove Theorem 4.2 by combining Lemmas 4.6 and 4.9.

Let mnm\leq n, ϵ=δ/9mnk\epsilon=\delta/9mnk and

where S(μ,ϵ)S(\mu,\epsilon) is as in Lemma 4.9. Then, by Lemma 4.2 and union bound,

where C10C_{1}\geq 0 is a constant independent of m,n,km,n,k.

Thus, if p>Cμ2k2logn/δ2mp>C\mu^{2}k^{2}\log n/\delta^{2}m, where C=16(C1+1)C=16(C_{1}+1), with probability at least 1exp(nlogn)1-\exp(-n\log n), the following holds

As the statement of the theorem is invariant under scaling, it is enough to show the statement for all μ\mu-incoherent matrices XX of rank at most kk and X2=1\|X\|_{2}=1. Fix such a XX and suppose that (10) holds. Now, by Lemma 4.9 there exists YS(μ,ϵ)Y\in S^{\prime}(\mu,\epsilon) such that YXFϵ\|Y-X\|_{F}\leq\epsilon. Moreover,

Further, starting with PΩ(YX)FYXFϵ\|\mathcal{P}_{\Omega}(Y-X)\|_{F}\leq\|Y-X\|_{F}\leq\epsilon and arguing as above we get that

Combining inequalities (11), (12) above, we have

Computational Issues and Related Work

The affine rank minimization problem is a natural generalization to matrices of the following compressed sensing problem for vectors:

However, a number of methods have been proposed recently to solve the problem for restricted families of sensing matrices. Most of the methods with provable theoretical guarantees assume that the sensing matrix AA satisfies restricted isometry properties similar to those in (1). Broadly speaking, existing compressed sensing approaches can be divided into three categories:

l1l_{1} relaxation: These methods relax the non-convex l0l_{0} objective function to the convex l1l_{1} objective function [CT05, CR07, Fuc05, DET06]. At a high level these results show that if the sensing matrix AA obeys RIP\mathsf{RIP} or other RIP\mathsf{RIP} like properties, then l1l_{1} relaxation recovers the optimal sparse solution from an almost optimal O(klogn)O(k\log n) measurements.

Basis pursuit: These methods greedily search for the subset of columns of AA that would span the optimal solution. Specifically, in each iteration, columns of the sensing matrix that have the highest correlation with the current residual measurement vector are greedily added to the basis. Assuming RIP\mathsf{RIP}, basis pursuit methods also guarantee recovery of the optimal solution from a near optimal number of measurements [TN08, NTV08].

Iterative Hard Thresholding (IHT): IHT based methods try to minimize l0l_{0} norm directly by hard thresholding [BD09, GK09] the current candidate solution to a small support vector. Here again, exact-recovery guarantees are known assuming RIP\mathsf{RIP}. Recently, Garg and Khandekar [GK09] demonstrated that their GradeS method outperforms most of the existing compressed sensing algorithms empirically.

As ARMP\mathsf{ARMP} is a generalization of problem (13), it is natural to ask if the above compressed sensing algorithms can be generalized to solve ARMP\mathsf{ARMP}. Interestingly, the answer is yes. Trace-norm relaxation approaches [RFP07] can be seen as a direct generalization of the l1l_{1} relaxation approach. Similarly, the ADMiRA algorithm of Lee and Bresler [LB09a] generalizes the CoSAMP algorithm of Tropp and Needell [TN08]. Finally, our approach is a generalization of the IHT approach. Table 1 summarizes these three approaches and compares them in terms of a few desirable characteristics an algorithm for ARMP\mathsf{ARMP} should have.

Minimizing the trace-norm of a matrix subject to affine constraints can be cast as a semi-definite programming problem. However, algorithms for semi-definite programming, as used by most methods for minimizing trace-norm, are prohibitively expensive even for moderately large datasets. Recently, a variety of methods mostly based on iterative soft-thresholding have been proposed to solve the trace-norm minimization problem efficiently. For instance, Cai et al. [CCS08] proposed a Singular Value Thresholding (SVT) algorithm which is based on Uzawa’s algorithm[AHU58]. A related approach based on linearized Bregman iteration was proposed by Ma et al. [MGC09]. Toh and Yun [TY09], while Ji and Ye [JY09] proposed Nesterov’s projected gradient based methods for optimizing the trace-norm.

While the soft-thresholding based methods for trace-norm minimization are significantly faster than semi-definite programming approaches they suffer from an important bottleneck: though the final solution to the trace-norm minimization is a low-rank matrix, the rank of the iterate in intermediate iterations can be large. In contrast, the rank of the iterates in our method is always equal to the rank of the optimal solution.

Also, though minimizing the trace-norm approximates the low-rank solution even in the presence of noise (see [CP09], [LB09b] for instance), noise poses considerable computational challenges for trace-norm optimization. Cai et al. propose a variant of SVT for handling noise that performs moderately well for uniformly bounded noise. However, the performance of SVT worsens considerably in the presence of outlier noise. SVP\mathsf{SVP} on the other hand is robust to both outlier and uniformly bounded noise as it minimizes the cumulative loss function A(X)b22\|\mathcal{A}(X)-b\|_{2}^{2}.

For the case of low-rank matrix completion, Candes and Recht [CR08] obtained the first non-trivial results for the problem obtaining guaranteed completion for incoherent matrices XX^{*} and randomly sampled entries Ω\Omega. Candes and Recht show that for XX^{*} μ\mu-incoherent and Ω\Omega chosen at random with ΩC(μ)k2n1.2|\Omega|\geq C(\mu)\,k^{2}n^{1.2}, trace-norm relaxation recovers the optimal solution. Building on the work of Candes and Recht, Candes and Tao [CT09] obtained the near-optimal bound of Ωmin(Cμ4k2nlog2n,Cμ2knlog6n)|\Omega|\geq\min(C\mu^{4}k^{2}n\log^{2}n,C\mu^{2}kn\log^{6}n) for exact-recovery via trace-norm minimization. However, the analysis of Candes and Recht, Candes and Tao is considerably complicated and minimizing trace-norm, even when using methods tailored for matrix-completion such as those of Cai et al. is relatively expensive in practice.

For the case of matrix completion, SVT has the important property that the intermediate iterations of the algorithm only require computing the singular value decomposition of a sparse matrix. This facilitates the use of fast SVD computing package such as PROPACK [Lar] that only require subroutines that compute matrix-vector products.

Our SVP\mathsf{SVP} algorithm has a similar property facilitating fast computation of the update in equation (5); each iteration of SVP\mathsf{SVP} involves computing the SVD of the matrix Y=Xt+PΩ(XtX)Y=X^{t}+\mathcal{P}_{\Omega}(X^{t}-X^{*}), where XtX^{t} is a matrix of rank at most kk whose SVD we know and PΩ(XtX)\mathcal{P}_{\Omega}(X^{t}-X^{*}) is a sparse matrix. Thus, we can compute matrix-vector products of the form YxYx in time O((m+n)k+Ω)O((m+n)k+|\Omega|).

In a different line of work, Keshavan et al. [KOM09] obtained exact-recovery from uniformly sampled Ω\Omega with ΩC(μ,k)nlogn|\Omega|\geq C(\mu,k)\,n\log n using different techniques. The first iteration of SVP\mathsf{SVP} is similar to the first step of Keshavan et al. However, after the first iteration, Keshavan et al. use a sophisticated alternating minimization algorithm based on gradient descent on the Grassmannian manifold of low-rank matrices. However, convergence of their alternating minimization algorithm is slow. The simplicity of the updates in SVP\mathsf{SVP} makes it both easier to implement and significantly less computationally intensive than the alternating minimization algorithm of Keshavan et al.

A related problem to the matrix completion problem is the problem of low-rank plus sparse decomposition of a matrix addressed by Chandrasekaran et al. [CSPW09] and Wright et al. [WGRM09]. Interestingly, Wright et al. [WGRM09] show that the low-rank matrix completion problem can be reduced to the low-rank plus sparse decomposition problem. Here again, their method relies on the trace-norm relaxation and is significantly more computationally intensive than our algorithm.

A drawback of our SVP method it requires rank kk of the optimal solution to be known beforehand. For ARMP, we propose using the following heuristic: run SVP with some initial guess kk and increment it by a fixed number (e.g, 10) until error AXb2\|{\mathcal{A}}X-b\|^{2}\| incurred by SVP doesn’t change.

For the matrix completion problem, in the first step of our SVP method, we compute singular values incrementally till we find a significant gap between singular values. Our heuristic is justified because: Keshavan et al. [KOM09] show that the top kk (kk being rank of optimal solution) singular values of the sampled matrix approximate the underlying matrix well, i.e., there should be a gap between kk-th and k+1k+1-th singular value.

Experimental Results

In this section, we empirically evaluate our SVP\mathsf{SVP} method for the affine rank minimization and low-rank matrix completion problems. For both problems we present empirical results on synthetic as well as real-world datasets. For ARMP\mathsf{ARMP} we compare our method against the trace-norm based singular value thresholding (SVT) method [CCS08]. Note that although Cai et al. present the SVT algorithm in the context of matrix completion problem, it can be easily adapted for ARMP\mathsf{ARMP}. For matrix completion we compare against SVT, ADMiRA [LB09a], the spectral matrix completion (SMC) method of Keshavan et al. [KOM09], and regularized alternating least squares minimization (ALS). We use our own implementation of ALS and SVT for ARMP, while for matrix completion we use the code provided by the respective authors for SVT, ADMiRA and SMC. We report results averaged over 2020 runs. All the methods are implemented in Matlab and use mex files.

Next we evaluate our method for the problem of matrix reconstruction from random measurements. As in Recht et al. [RFP07], we use the MIT logo as the test image for reconstruction. the MIT logo we use is a 38×7338\times 73 image and has rank four. For reconstruction, we generate random measurement matrices AiA_{i} and measure bi=Tr(AiX)b_{i}=Tr(A_{i}X). Figure 3 (b) shows that our method incurs significantly smaller reconstruction error than SVT with lower number of iterations.

2 Matrix Completion

Finally, we study the behavior of our method in presence of noise. For this experiment, we generate random matrices of different size and add approximately 5%5\% Gaussian noise. Figure 6 plots error incurred and time required by various methods as nn increases from 10001000 to 50005000. Note that SVT is particularly sensitive to noise and incurs high RMSE.

Matrix Completion: Movie-Lens Dataset Finally, we evaluate our method on the Movie-Lens dataset [Mov], which contains 1 million ratings for 39003900 movies by 60406040 users. For SVP\mathsf{SVP} and ALS, we fix the rank of the matrix to be k=15k=15. For SVP\mathsf{SVP}, we set the step size ηt\eta_{t} to be 5/t5/\sqrt{t}. SVP\mathsf{SVP} incurs RMSE of 1.011.01 in 64.8564.85 seconds, while SVT incurs RMSE of 1.211.21 in 1214.781214.78 seconds. In contrast, ALS achieves RMSE of 0.900.90 in 195.34195.34 seconds. We attribute the relatively poor performance of SVP\mathsf{SVP} and SVT as compared with ALS to the fact that the ratings matrix is not sampled uniformly, thus violating a crucial assumption of both our method and SVT. Similar to Figure 6 (b), SVT converges much slower than SVP on the Movie-Lens data.

Conclusion and Future Work

There has been a significant amount of work recently in the area of low-rank approximations. Examples include minimizing rank subject to affine constraints, low-rank matrix completion, low-rank plus sparse decomposition. Most of this research, with the exception of Keshavan et al. [KOM09], relies on relaxing the rank constraint with trace-norm and gives guarantees for recovering the optimal solution under certain additional assumptions. However, trace-norm relaxation based methods are typically hard to analyze and are relatively expensive in practice.

In this paper, we proposed a simple and natural algorithm based on iterative hard-thresholding. We give a simple analysis of our algorithm for the affine rank minimization problem satisfying the restricted isometry property and give geometric convergence guarantees even in the presence of noise. The intermediate steps in our algorithm are less computationally demanding than those of current state-of-the-art methods. We empirically demonstrate that our method is significantly faster and more robust to both uniformly bounded and outlier noise than most existing methods.

An immediate question arising out of our work is to prove our hypothesis bounding the incoherence of the iterates of SVP\mathsf{SVP} for low-rank matrix completion, or otherwise directly prove Conjecture 4.3. Other directions include application of our methods to other problems of similar flavor such as the low-rank plus sparse matrix decomposition [CSPW09], or other matrix completion type problems like minimum dimensionality embedding using partial distance observations [FHB03] and low-rank kernel learning [MJCD08].

Acknowledgments

This research was supported by NSF grant CCF-0431257, NSF grant CCF-0916309 and NSF grant CCF-0728879. We thank the reviewers of NIPS 2009 for their useful comments and for pointing out a mistake in an earlier proof of Theorem 4.2.

References