Guaranteed Rank Minimization via Singular Value Projection

Raghu Meka, Prateek Jain, Inderjit S. Dhillon

Introduction

In this paper we study the general affine rank minimization problem (ARMP),

The general affine rank minimization problem is of considerable practical interest and many important machine learning problems such as matrix completion, low-dimensional metric embedding, low-rank kernel learning can be viewed as instances of the above problem. Unfortunately, ARMP is NP-hard in general and is also NP-hard to approximate ([MJCD08]).

Until recently, most known methods for $\mathsf{ARMP}$ were heuristic in nature with few known rigorous guarantees. The most commonly used heuristic for the problem is to assume a factorization of $X$ and optimize the resulting non-convex problem by alternating minimization [Bra03, Kor08, MB07], alternative projections [GB00] or alternating LMIs [SIG97]. Another common approach is to relax the rank constraint to a convex function such as the trace-norm or the log determinant [FHB01], [FHB03]. However, most of these methods do not have any optimality guarantees. Recently, Meka et al. [MJCD08] proposed online learning based methods for ARMP. However, their methods can only guarantee at best a logarithmic approximation for the minimum rank.

Recht et al. show that for affine constraints with bounded isometry constants (specifically, $\delta_{5k}<1/10$ ), finding the minimum trace-norm solution recovers the minimum rank solution. Their results were later extended to noisy measurements and isometry constants up to $\delta_{3k}<1/4\sqrt{3}$ by Lee and Bresler [LB09b]. However, even the best existing optimization algorithms for the trace-norm relaxation are relatively inefficient in practice and their results are hard to analyze.

In another recent work, Lee and Bresler [LB09a] obtained exact-recovery guarantees for $\mathsf{ARMP}$ satisfying $\mathsf{RIP}$ using a different approach. Lee and Bresler propose an algorithm (ADMiRA) motivated by the orthogonal matching pursuit line of work in compressed sensing, and show that for affine constraints with isometry constant $\delta_{4k}\leq 0.04$ their algorithm recovers the optimal solution. They also prove similar guarantees for noisy measurements and provide a geometric convergence rate for their algorithm. However, their method is not very efficient for large datasets and is hard to analyze.

In this paper we propose a simple and fast algorithm $\mathsf{SVP}$ (Singular Value Projection) based on the classical projected gradient algorithm. We present a simple analysis showing that $\mathsf{SVP}$ recovers the minimum rank solution for affine constraints that satisfy $\mathsf{RIP}$ even in the presence of noise and prove the following guarantees. Independent of our work, Goldfarb and Ma [GM09] proposed an algorithm similar to our algorithm. However, their analysis and formulation is different from ours. In particular, their analysis builds on the analysis of Lee and Bresler and they require stronger isometry assumptions, $\delta_{3k}<1/\sqrt{30}$ , than we do. In addition, we make partial progress on analyzing $\mathsf{SVP}$ for the matrix completion problem and proving exact recovery.

Suppose the isometry constant of $\mathcal{A}$ satisfies $\delta_{2k}\leq 1/3$ and let $b=\mathcal{A}(X^{*})$ for a rank- $k$ matrix $X^{*}$ . Then, $\mathsf{SVP}$ (Algorithm 1) with step-size $\eta_{t}=1/(1+\delta_{2k})$ converges to $X^{*}$ . Furthermore, $\mathsf{SVP}$ outputs a matrix $X$ of rank at most $k$ such that $\|\mathcal{A}(X)-b\|_{2}^{2}\leq\epsilon$ in at most $\left\lceil\frac{1}{\log((1-\delta_{2k})/2\delta_{2k})}\log\frac{\|b\|^{2}}{2\epsilon}\right\rceil$ iterations.

Our analysis of $\mathsf{SVP}$ is motivated by the recent work in the field of compressed sensing by Blumensath and Davies [BD09], Garg and Khandekar [GK09]. Our results improve the results of Recht et al. and Lee and Bresler as follows.

$\mathsf{SVP}$ is considerably simpler to analyze than the methods of Recht et al. and Lee and Bresler. Further, we need weaker isometry assumptions on $\mathcal{A}$ : we only require $\delta_{2k}<1/3$ as opposed to $\delta_{5k}<1/10$ required by Recht et al., $\delta_{3k}<1/4\sqrt{3}$ required by Lee and Bresler [LB09b] and $\delta_{4k}\leq 0.04$ required by Lee and Bresler [LB09a].

$\mathsf{SVP}$ has a strong geometric convergence rate and is faster than using the best trace-norm optimization algorithms and the methods of Lee and Bresler by an order of magnitude.

Although restricted isometry property is natural in settings where the affine constraints contain information about all the entries of the unknown matrix, in several cases of considerable practical interest the affine constraints only contain local information and may not satisfy $\mathsf{RIP}$ directly.

One such important problem where $\mathsf{RIP}$ does not hold directly is the low-rank matrix completion problem. In the matrix completion problem we are given the entries of an unknown low-rank matrix $X^{*}$ for ordered pairs $(i,j)\in\Omega\subseteq[m]\times[n]$ and the goal is to complete the missing entries of $X^{*}$ . A highly popular application of the matrix completion problem is in the field of collaborative filtering, where typically the task is to predict user ratings given past ratings of the users. Recently, a lot of attention has been given to the problem due to the Netflix Challenge [Net]. Other applications of matrix completion include triangulation from incomplete data, link prediction in social networks etc.

Similar to $\mathsf{ARMP}$ , the low-rank matrix completion is also NP-hard in general and most methods are heuristic in nature with no theoretical guarantees. The alternating least squares minimization heuristic and its variants [Kor08, MB07] perform the best in practice but are notoriously hard to analyze.

Recently, Candes and Recht [CR08], Candes and Tao [CT09] and Keshavan et al. [KOM09] obtained the first non-trivial results for low-rank matrix completion under a few additional assumptions. Broadly, these papers give exact-recovery guarantees when the optimal solution $X^{*}$ is $\mu$ -incoherent (see Definition 4.1), and the entries $\Omega$ are chosen uniformly at random with $|\Omega|\geq C(\mu,k)\,n\,poly\log n$ , where $C(\mu,k)$ depends only on $\mu,k$ . However, the algorithms of the above papers, even when using methods tailored specifically for matrix-completion such as those of Cai et al. [CCS08], are quite expensive in practice and not very tolerant to noise.

As low-rank matrix completion is a special case of $\mathsf{ARMP}$ , we can naturally adapt our algorithm $\mathsf{SVP}$ for matrix completion. We demonstrate empirically that for a suitable step-size, $\mathsf{SVP}$ significantly outperforms the methods of [CR08], [CT09], [CCS08], [KOM09] in accuracy, computational time and tolerance to noise. Furthermore, our experiments strongly suggest (see Figure 1) that guarantees similar to those of [CT09], [KOM09] hold for $\mathsf{SVP}$ , achieving exact recovery for incoherent matrices from an almost optimal number of entriesIt follows from a coupon collector argument that exact-recovery from random samples requires $nk\log n$ samples..

Although we do not provide a rigorous proof of exact-recovery for $\mathsf{SVP}$ applied to matrix completion, we make partial progress in this direction and give strong intuition for the performance of $\mathsf{SVP}$ . We prove that though the affine constraints defining the matrix-completion problems do not obey the restricted isometry property, they obey the restricted isometry property over incoherent matrices. This weaker $\mathsf{RIP}$ condition along with a hypothesis bounding the incoherence of the iterates of $\mathsf{SVP}$ imply exact-recovery of a low-rank incoherent matrix from an almost optimal number of entries. We also provide strong empirical evidence supporting our hypothesis bounding the incoherence of the iterates of $\mathsf{SVP}$ (see Figure 2).

We first present our algorithm $\mathsf{SVP}$ in Section 2 and present its analysis for affine constraints satisfying $\mathsf{RIP}$ in Section 3. In Section 4, we specialize our algorithm $\mathsf{SVP}$ to the task of low-rank matrix completion and prove a more restricted isometry property for the matrix completion problem. In Section 6, we give empirical results for $\mathsf{SVP}$ applied to $\mathsf{ARMP}$ and matrix-completion on real-world and synthetic problems.

Singular Value Projection (SVP)

Consider the following robust formulation of $\mathsf{ARMP}$ ( $\mathsf{RARMP}$ ),

The hardness of the above problem mainly comes from the non-convexity of the set of low-rank matrices $\mathcal{C}(k)$ . However, in spite of the hardness of the rank constraint, the Euclidean projection onto the non-convex set $\mathcal{C}(k)$ can be computed efficiently using singular value decomposition. Our algorithm uses this observation along with the projected gradient method for efficiently minimizing the objective function specified in problem (RARMP).

In $\mathsf{SVP}$ , a candidate solution to $\mathsf{ARMP}$ is computed iteratively by starting from the all-zero matrix and adapting the classical projected gradient descent update as follows (Observe that $\nabla\psi(X)=\mathcal{A}^{T}(\mathcal{A}(X)-b)$ ) :

Algorithm 1 presents our $\mathsf{SVP}$ algorithm. Note that the iterates $X^{t}$ are always low-rank, facilitating faster computation of the SVD. See Section 5 for a more detailed discussion of the computational issues.

Analysis for Affine Constraints Satisfying 𝖱𝖨𝖯𝖱𝖨𝖯\mathsf{RIP}

We now show that $\mathsf{SVP}$ solves exact rank minimization for affine constraints that satisfy $\mathsf{RIP}$ and prove our main results, Theorems 1.1 and 1.2. We first present a lemma that bounds the error at the $(t+1)$ -st iteration ( $\psi(X^{t+1})$ ) with respect to the error incurred by the optimal solution ( $\psi(X^{*})$ ) and the $t$ -th iteration.

Let $X^{*}$ be an optimal solution of (RARMP) and let $X^{t}$ be the iterate obtained by SVP algorithm at $t$ -th iteration. Then,

where $\delta_{2k}$ is the rank $2k$ isometry constant of $\mathcal{A}$ .

Recall that $\psi(X)=\frac{1}{2}\|\mathcal{A}(X)-b\|_{2}^{2}$ . Since $\psi(\cdot)$ is a quadratic function, we have

where inequality (3) follows from $\mathsf{RIP}$ applied to the matrix $X^{t+1}-X^{t}$ of rank at most $2k$ . Let $Y^{t+1}=X^{t}-\frac{1}{1+\delta_{2k}}\mathcal{A}^{T}(\mathcal{A}(X^{t})-b)$ and

Now, by definition, $\mathcal{P}_{k}(Y^{t+1})=X^{t+1}$ is the minimizer of $f_{t}(X)$ over all matrices $X\in\mathcal{C}(k)$ (of rank at most $k$ ). In particular, $f_{t}(X^{t+1})\leq f_{t}(X^{*})$ . Thus,

where inequality (4) follows from $\mathsf{RIP}$ applied to $X^{*}-X^{t}$ . ∎

We now prove that $\mathsf{SVP}$ obtains the optimal solution for ARMP with restricted isometry property.

Using Lemma 3.1 and the fact that $\psi(X^{*})=0$ for the noise-less case,

Also, note that for $\delta_{2k}<1/3$ , $\frac{2\delta_{2k}}{(1-\delta_{2k})}<1$ . Hence, $\psi(X^{\tau})\leq\epsilon$ where $\tau=\left\lceil\frac{1}{\log((1-\delta_{2k})/2\delta_{2k})}\log\frac{\psi(X^{0})}{\epsilon}\right\rceil$ . Now, the SVP algorithm is initialized using $X^{0}=0$ , i.e., $\psi(X^{0})=\frac{\|b\|^{2}}{2}$ . Hence, $\tau=\left\lceil\frac{1}{\log((1-\delta_{2k})/2\delta_{2k})}\log\frac{\|b\|^{2}}{2\epsilon}\right\rceil$ . ∎

Next, we prove the noisy version of Theorem 1.1.

Let the current solution $X^{t}$ satisfy $\psi(X^{t})\geq C^{2}\|e\|^{2}/2$ , where $C\geq 0$ is a universal constant. Using Lemma 3.1 and the fact that $b-\mathcal{A}(X^{*})=e$ ,

where $D=\left(\frac{1}{C^{2}}+\frac{2\delta_{2k}}{(1-\delta_{2k})}\left(1+\frac{1}{C}\right)^{2}\right)$ . Recall that $\delta_{2k}<1/3$ . Hence, selecting $C>(1+\delta_{2k})/(1-3\delta_{2k})$ , we get $D<1$ . Also, $\psi(X^{0})=\psi(0)=\|b\|^{2}/2$ . Hence, $\psi(X^{\tau})\leq(C^{2}+\epsilon)\|e\|^{2}/2$ where $\tau=\left\lceil\frac{1}{\log(1/D)}\log\frac{\|b\|^{2}}{(C^{2}+\epsilon)\|e\|^{2}}\right\rceil$ . ∎

Matrix Completion

Observe that the matrix completion problem is a special case of $\mathsf{ARMP}$ . However, the affine constraints that define $\mathsf{MCP}$ , $\mathcal{P}_{\Omega}$ , do not satisfy $\mathsf{RIP}$ in general. Thus Theorems 1.1, 1.2 above and the results of Recht et al. [RFP07] do not directly apply to $\mathsf{MCP}$ . The first non-trivial results for $\mathsf{MCP}$ were obtained recently by Candes and Recht [CR08], Keshavan et al. [KOM09] and Candes and Tao [CT09]. These works show exact recovery of the unknown matrix $X^{*}$ when the observed entries are sampled uniformly and $X^{*}$ is incoherent in the sense defined below.

Intuitively, high incoherence (i.e., $\mu$ is small) implies that the non-zero entries of $X$ are not concentrated in a small number of entries. Hence, a random sampling of the matrix should provide enough information to reconstruct the entire matrix.

As matrix completion is a special case of $\mathsf{ARMP}$ , we can apply $\mathsf{SVP}$ for matrix completion. We apply $\mathsf{SVP}$ to matrix-completion with step-size $\eta_{t}=1/(1+\delta)p$ , where $p$ is the density of sampled entries and $0<\delta<1/3$ is a parameter depending on how large $p$ is, leading to the update

We now provide some intuition for our choice of step-size $\eta_{t}$ and make partial progress towards proving that $\mathsf{SVP}$ achieves exact recovery for low-rank incoherent matrices. We show that though the affine constraints defining $\mathsf{MCP}$ , $\mathcal{P}_{\Omega}$ , do not satisfy $\mathsf{RIP}$ for all low-rank matrices, they satisfy $\mathsf{RIP}$ for all low-rank incoherent matrices. Thus, if the iterates appearing in $\mathsf{SVP}$ remain incoherent throughout the execution of the algorithm, then Theorem 1.1 would imply recovery of the unknown entries of the matrix. Empirical evidence strongly supports our hypothesis that the incoherence of the iterates arising in $\mathsf{SVP}$ remains bounded.

Figure 1 plots the threshold sampling density $p$ beyond which matrix completion for randomly generated matrices is solved exactly by $\mathsf{SVP}$ for fixed $k$ and varying matrix sizes $n$ . Note that the density threshold matches the optimal bound of $O(k\log n/n)$ with the constant being $C=1.28$ . Figure 2 plots the maximum incoherence $\max_{t}\mu(X^{t})=\sqrt{n}\,\max_{t,i,j}|U^{t}_{ij}|$ , where $U^{t}$ are the left singular vectors of the intermediate iterates $X^{t}$ computed by $\mathsf{SVP}$ . The figure clearly shows that the incoherence $\mu(X^{t})$ of the iterates is bounded by a constant independent of the matrix size $n$ and density $p$ throughout the execution of $\mathsf{SVP}$ .

Combining the above Chernoff bound estimate with a union bound over low-rank incoherent matrices, we obtain the following restricted isometry property for the projection operator $\mathcal{P}_{\Omega}$ restricted to low-rank incoherent matrices. See Section 4.1 for a detailed proof.

There exists a constant $C\geq 0$ such that the following holds for all $0<\delta<1$ , $\mu\geq 1$ , $n\geq m\geq 3$ : For $\Omega\subseteq[m]\times[n]$ chosen according to the Bernoulli model with density $p\geq C\mu^{2}k^{2}\log n/\delta^{2}m$ , with probability at least $1-\exp(-n\log n)$ , the restricted isometry property in (6) holds for all $\mu$ -incoherent matrices $X$ of rank at most $k$ .

Motivated by the above theorem and supported by empirical evidence (Figures 1, 2) we hypothesize that $\mathsf{SVP}$ achieves exact recovery from an almost optimal number of samples.

Fix $\mu,k$ and $\delta\leq 1/3$ . Then, there exists a constant $C$ such that for a $\mu$ -incoherent matrix $X^{*}$ of rank at most $k$ and $\Omega$ sampled from the Bernoulli model with density $p\geq C\mu^{2}k^{2}\log n/\delta^{2}m$ , $\mathsf{SVP}$ with step-size $\eta_{t}=1/(1+\delta)p$ converges to $X^{*}$ with high probability. Moreover, $\mathsf{SVP}$ outputs a matrix $X$ of rank at most $k$ such that $\|\mathcal{P}_{\Omega}(X)-\mathcal{P}_{\Omega}(X^{*})\|_{F}^{2}\leq\epsilon$ after $O_{\mu,k}\left(\left\lceil\log\left(\frac{1}{\epsilon}\right)\right\rceil\right)$ iterations.

We now prove the $\mathsf{RIP}$ property of Theorem 4.2 for the projection operator $\mathcal{P}_{\Omega}$ . To prove Theorem 4.2 we first show the theorem for a discrete collection of matrices using Chernoff type large-deviation bounds and use standard quantization arguments to generalize to the continuous case. We first introduce some notation.

We need Bernstein’s inequality [Wik09] stated below.

Let $X_{1},X_{2},\dots,X_{n}$ be independent random variables with $E[X_{i}]=0,\forall i$ . Furthermore, let $|X_{i}|\leq M$ . Then,

For $(i,j)\in[m]\times[n]$ , let $\omega_{ij}$ be the indicator variables with $\omega_{ij}=1$ if $(i,j)\in\Omega$ and otherwise. Then, $\omega_{ij}$ are independent random variables with $\operatorname*{\mathsf{Pr}}[\omega_{ij}=1]=p$ . Let random variable $Z_{ij}=\omega_{ij}X_{ij}^{2}$ . Note that,

Observe that $|Z_{ij}-E[Z_{ij}]|\leq|X_{ij}|^{2}\leq(\alpha^{2}/mn)\cdot\|X\|_{F}^{2}$ . Thus,

Now, define random variable $S=\sum_{i,j}Z_{ij}=\sum_{i,j}\omega_{ij}X_{ij}^{2}=\|\mathcal{P}_{\Omega}(X)\|_{F}^{2}$ . Note that, $E[S]=p\|X\|_{F}^{2}$ . Since, $Z_{ij}$ are independent random variables,

Using Bernstein’s inequality (Lemma 4.5) for $S$ with $t=\delta p\|X\|_{F}^{2}$ and Equations (7) and (8) we get,

We now discretize the space of low-rank incoherent matrices so as to be able to use the above lemma with a union bound. We need the following simple lemmas.

Let $X=U\Sigma V^{T}$ be the singular value decomposition of $X$ . Then, $X_{ij}=U_{i}\Sigma V_{j}^{T}$ , where $U_{i},V_{j}$ are the $i$ ’th and $j$ ’th rows of $U,V$ respectively. Now,

The following lemma shows that the space of low-rank $\mu$ -incoherent matrices can be discretized into a reasonably small set of regular matrices such that every low-rank $\mu$ -incoherent matrix is close to a matrix from the set.

We will show that $S(\mu,\epsilon)$ satisfies the conditions of the Lemma. Observe that $|D(\rho)|<2/\rho$ . Thus,

Hence, $|S(\mu,\epsilon)|<(2/\rho)^{mk+nk+k}<(mnk/\epsilon)^{3(m+n)k}$ .

Now, since $|U_{il}|\leq\sqrt{\mu}/\sqrt{m}$ , it follows that $U_{1}\in U(\rho)$ . Further, for all $i\in[m],l\in[k]$ ,

Similarly, define $V_{1},\Sigma_{1}$ by rounding entries of $V,\Sigma$ to integer multiples of $\sqrt{\mu}\,\rho/\sqrt{n}$ and $\rho$ respectively. Then, $V_{1}\in V(\rho)$ , $\Sigma_{1}\in\Sigma(\rho)$ and for $(j,l)\in[n]\times[k]$ ,

Let $X(\rho)=U_{1}\Sigma_{1}V_{1}^{T}$ . Then, by the above equations and Lemma 4.8, for $i\in[m],l\in[k],j\in[n]$ ,

Furthermore, using triangular inequality, $\|X(\rho)\|_{F}>\|X\|_{F}-\epsilon>\|X\|_{F}/2$ . Since, $\epsilon<1$ and $\mu\sqrt{k}\|X\|_{F}\geq 1$ ,

Thus, $X(\rho)$ is $4\mu\sqrt{k}$ -regular. The lemma now follows by taking $Y=X(\rho)$ . ∎

We now prove Theorem 4.2 by combining Lemmas 4.6 and 4.9.

Let $m\leq n$ , $\epsilon=\delta/9mnk$ and

where $S(\mu,\epsilon)$ is as in Lemma 4.9. Then, by Lemma 4.2 and union bound,

where $C_{1}\geq 0$ is a constant independent of $m,n,k$ .

Thus, if $p>C\mu^{2}k^{2}\log n/\delta^{2}m$ , where $C=16(C_{1}+1)$ , with probability at least $1-\exp(-n\log n)$ , the following holds

As the statement of the theorem is invariant under scaling, it is enough to show the statement for all $\mu$ -incoherent matrices $X$ of rank at most $k$ and $\|X\|_{2}=1$ . Fix such a $X$ and suppose that (10) holds. Now, by Lemma 4.9 there exists $Y\in S^{\prime}(\mu,\epsilon)$ such that $\|Y-X\|_{F}\leq\epsilon$ . Moreover,

Further, starting with $\|\mathcal{P}_{\Omega}(Y-X)\|_{F}\leq\|Y-X\|_{F}\leq\epsilon$ and arguing as above we get that

Combining inequalities (11), (12) above, we have

Computational Issues and Related Work

The affine rank minimization problem is a natural generalization to matrices of the following compressed sensing problem for vectors:

However, a number of methods have been proposed recently to solve the problem for restricted families of sensing matrices. Most of the methods with provable theoretical guarantees assume that the sensing matrix $A$ satisfies restricted isometry properties similar to those in (1). Broadly speaking, existing compressed sensing approaches can be divided into three categories:

$l_{1}$ relaxation: These methods relax the non-convex $l_{0}$ objective function to the convex $l_{1}$ objective function [CT05, CR07, Fuc05, DET06]. At a high level these results show that if the sensing matrix $A$ obeys $\mathsf{RIP}$ or other $\mathsf{RIP}$ like properties, then $l_{1}$ relaxation recovers the optimal sparse solution from an almost optimal $O(k\log n)$ measurements.

Basis pursuit: These methods greedily search for the subset of columns of $A$ that would span the optimal solution. Specifically, in each iteration, columns of the sensing matrix that have the highest correlation with the current residual measurement vector are greedily added to the basis. Assuming $\mathsf{RIP}$ , basis pursuit methods also guarantee recovery of the optimal solution from a near optimal number of measurements [TN08, NTV08].

Iterative Hard Thresholding (IHT): IHT based methods try to minimize $l_{0}$ norm directly by hard thresholding [BD09, GK09] the current candidate solution to a small support vector. Here again, exact-recovery guarantees are known assuming $\mathsf{RIP}$ . Recently, Garg and Khandekar [GK09] demonstrated that their GradeS method outperforms most of the existing compressed sensing algorithms empirically.

As $\mathsf{ARMP}$ is a generalization of problem (13), it is natural to ask if the above compressed sensing algorithms can be generalized to solve $\mathsf{ARMP}$ . Interestingly, the answer is yes. Trace-norm relaxation approaches [RFP07] can be seen as a direct generalization of the $l_{1}$ relaxation approach. Similarly, the ADMiRA algorithm of Lee and Bresler [LB09a] generalizes the CoSAMP algorithm of Tropp and Needell [TN08]. Finally, our approach is a generalization of the IHT approach. Table 1 summarizes these three approaches and compares them in terms of a few desirable characteristics an algorithm for $\mathsf{ARMP}$ should have.

Minimizing the trace-norm of a matrix subject to affine constraints can be cast as a semi-definite programming problem. However, algorithms for semi-definite programming, as used by most methods for minimizing trace-norm, are prohibitively expensive even for moderately large datasets. Recently, a variety of methods mostly based on iterative soft-thresholding have been proposed to solve the trace-norm minimization problem efficiently. For instance, Cai et al. [CCS08] proposed a Singular Value Thresholding (SVT) algorithm which is based on Uzawa’s algorithm[AHU58]. A related approach based on linearized Bregman iteration was proposed by Ma et al. [MGC09]. Toh and Yun [TY09], while Ji and Ye [JY09] proposed Nesterov’s projected gradient based methods for optimizing the trace-norm.

While the soft-thresholding based methods for trace-norm minimization are significantly faster than semi-definite programming approaches they suffer from an important bottleneck: though the final solution to the trace-norm minimization is a low-rank matrix, the rank of the iterate in intermediate iterations can be large. In contrast, the rank of the iterates in our method is always equal to the rank of the optimal solution.

Also, though minimizing the trace-norm approximates the low-rank solution even in the presence of noise (see [CP09], [LB09b] for instance), noise poses considerable computational challenges for trace-norm optimization. Cai et al. propose a variant of SVT for handling noise that performs moderately well for uniformly bounded noise. However, the performance of SVT worsens considerably in the presence of outlier noise. $\mathsf{SVP}$ on the other hand is robust to both outlier and uniformly bounded noise as it minimizes the cumulative loss function $\|\mathcal{A}(X)-b\|_{2}^{2}$ .

For the case of low-rank matrix completion, Candes and Recht [CR08] obtained the first non-trivial results for the problem obtaining guaranteed completion for incoherent matrices $X^{*}$ and randomly sampled entries $\Omega$ . Candes and Recht show that for $X^{*}$ $\mu$ -incoherent and $\Omega$ chosen at random with $|\Omega|\geq C(\mu)\,k^{2}n^{1.2}$ , trace-norm relaxation recovers the optimal solution. Building on the work of Candes and Recht, Candes and Tao [CT09] obtained the near-optimal bound of $|\Omega|\geq\min(C\mu^{4}k^{2}n\log^{2}n,C\mu^{2}kn\log^{6}n)$ for exact-recovery via trace-norm minimization. However, the analysis of Candes and Recht, Candes and Tao is considerably complicated and minimizing trace-norm, even when using methods tailored for matrix-completion such as those of Cai et al. is relatively expensive in practice.

For the case of matrix completion, SVT has the important property that the intermediate iterations of the algorithm only require computing the singular value decomposition of a sparse matrix. This facilitates the use of fast SVD computing package such as PROPACK [Lar] that only require subroutines that compute matrix-vector products.

Our $\mathsf{SVP}$ algorithm has a similar property facilitating fast computation of the update in equation (5); each iteration of $\mathsf{SVP}$ involves computing the SVD of the matrix $Y=X^{t}+\mathcal{P}_{\Omega}(X^{t}-X^{*})$ , where $X^{t}$ is a matrix of rank at most $k$ whose SVD we know and $\mathcal{P}_{\Omega}(X^{t}-X^{*})$ is a sparse matrix. Thus, we can compute matrix-vector products of the form $Yx$ in time $O((m+n)k+|\Omega|)$ .

In a different line of work, Keshavan et al. [KOM09] obtained exact-recovery from uniformly sampled $\Omega$ with $|\Omega|\geq C(\mu,k)\,n\log n$ using different techniques. The first iteration of $\mathsf{SVP}$ is similar to the first step of Keshavan et al. However, after the first iteration, Keshavan et al. use a sophisticated alternating minimization algorithm based on gradient descent on the Grassmannian manifold of low-rank matrices. However, convergence of their alternating minimization algorithm is slow. The simplicity of the updates in $\mathsf{SVP}$ makes it both easier to implement and significantly less computationally intensive than the alternating minimization algorithm of Keshavan et al.

A related problem to the matrix completion problem is the problem of low-rank plus sparse decomposition of a matrix addressed by Chandrasekaran et al. [CSPW09] and Wright et al. [WGRM09]. Interestingly, Wright et al. [WGRM09] show that the low-rank matrix completion problem can be reduced to the low-rank plus sparse decomposition problem. Here again, their method relies on the trace-norm relaxation and is significantly more computationally intensive than our algorithm.

A drawback of our SVP method it requires rank $k$ of the optimal solution to be known beforehand. For ARMP, we propose using the following heuristic: run SVP with some initial guess $k$ and increment it by a fixed number (e.g, 10) until error $\|{\mathcal{A}}X-b\|^{2}\|$ incurred by SVP doesn’t change.

For the matrix completion problem, in the first step of our SVP method, we compute singular values incrementally till we find a significant gap between singular values. Our heuristic is justified because: Keshavan et al. [KOM09] show that the top $k$ ( $k$ being rank of optimal solution) singular values of the sampled matrix approximate the underlying matrix well, i.e., there should be a gap between $k$ -th and $k+1$ -th singular value.

Experimental Results

In this section, we empirically evaluate our $\mathsf{SVP}$ method for the affine rank minimization and low-rank matrix completion problems. For both problems we present empirical results on synthetic as well as real-world datasets. For $\mathsf{ARMP}$ we compare our method against the trace-norm based singular value thresholding (SVT) method [CCS08]. Note that although Cai et al. present the SVT algorithm in the context of matrix completion problem, it can be easily adapted for $\mathsf{ARMP}$ . For matrix completion we compare against SVT, ADMiRA [LB09a], the spectral matrix completion (SMC) method of Keshavan et al. [KOM09], and regularized alternating least squares minimization (ALS). We use our own implementation of ALS and SVT for ARMP, while for matrix completion we use the code provided by the respective authors for SVT, ADMiRA and SMC. We report results averaged over $20$ runs. All the methods are implemented in Matlab and use mex files.

Next we evaluate our method for the problem of matrix reconstruction from random measurements. As in Recht et al. [RFP07], we use the MIT logo as the test image for reconstruction. the MIT logo we use is a $38\times 73$ image and has rank four. For reconstruction, we generate random measurement matrices $A_{i}$ and measure $b_{i}=Tr(A_{i}X)$ . Figure 3 (b) shows that our method incurs significantly smaller reconstruction error than SVT with lower number of iterations.

2 Matrix Completion

Finally, we study the behavior of our method in presence of noise. For this experiment, we generate random matrices of different size and add approximately $5\%$ Gaussian noise. Figure 6 plots error incurred and time required by various methods as $n$ increases from $1000$ to $5000$ . Note that SVT is particularly sensitive to noise and incurs high RMSE.

Matrix Completion: Movie-Lens Dataset Finally, we evaluate our method on the Movie-Lens dataset [Mov], which contains 1 million ratings for $3900$ movies by $6040$ users. For $\mathsf{SVP}$ and ALS, we fix the rank of the matrix to be $k=15$ . For $\mathsf{SVP}$ , we set the step size $\eta_{t}$ to be $5/\sqrt{t}$ . $\mathsf{SVP}$ incurs RMSE of $1.01$ in $64.85$ seconds, while SVT incurs RMSE of $1.21$ in $1214.78$ seconds. In contrast, ALS achieves RMSE of $0.90$ in $195.34$ seconds. We attribute the relatively poor performance of $\mathsf{SVP}$ and SVT as compared with ALS to the fact that the ratings matrix is not sampled uniformly, thus violating a crucial assumption of both our method and SVT. Similar to Figure 6 (b), SVT converges much slower than SVP on the Movie-Lens data.

Conclusion and Future Work

There has been a significant amount of work recently in the area of low-rank approximations. Examples include minimizing rank subject to affine constraints, low-rank matrix completion, low-rank plus sparse decomposition. Most of this research, with the exception of Keshavan et al. [KOM09], relies on relaxing the rank constraint with trace-norm and gives guarantees for recovering the optimal solution under certain additional assumptions. However, trace-norm relaxation based methods are typically hard to analyze and are relatively expensive in practice.

In this paper, we proposed a simple and natural algorithm based on iterative hard-thresholding. We give a simple analysis of our algorithm for the affine rank minimization problem satisfying the restricted isometry property and give geometric convergence guarantees even in the presence of noise. The intermediate steps in our algorithm are less computationally demanding than those of current state-of-the-art methods. We empirically demonstrate that our method is significantly faster and more robust to both uniformly bounded and outlier noise than most existing methods.

An immediate question arising out of our work is to prove our hypothesis bounding the incoherence of the iterates of $\mathsf{SVP}$ for low-rank matrix completion, or otherwise directly prove Conjecture 4.3. Other directions include application of our methods to other problems of similar flavor such as the low-rank plus sparse matrix decomposition [CSPW09], or other matrix completion type problems like minimum dimensionality embedding using partial distance observations [FHB03] and low-rank kernel learning [MJCD08].

Acknowledgments

This research was supported by NSF grant CCF-0431257, NSF grant CCF-0916309 and NSF grant CCF-0728879. We thank the reviewers of NIPS 2009 for their useful comments and for pointing out a mistake in an earlier proof of Theorem 4.2.