Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization

Benjamin Recht, Maryam Fazel, Pablo A. Parrilo

Introduction

Notions such as order, complexity, or dimensionality can often be expressed by means of the rank of an appropriate matrix. For example, a low-rank matrix could correspond to a low-degree statistical model for a random process (e.g., factor analysis), a low-order realization of a linear system , a low-order controller for a plant , or a low-dimensional embedding of data in Euclidean space . If the set of feasible models or designs is affine in the matrix variable, choosing the simplest model can be cast as an affine rank minimization problem,

A recent heuristic introduced in minimizes the nuclear norm, or the sum of the singular values of the matrix, over the affine subset. The nuclear norm is a convex function, can be optimized efficiently, and is the best convex approximation of the rank function over the unit ball of matrices with norm less than one. When the matrix variable is symmetric and positive semidefinite, this heuristic is equivalent to the trace heuristic often used by the control community (see, e.g., ). The nuclear norm heuristic has been observed to produce very low-rank solutions in practice, but a theoretical characterization of when it produces the minimum rank solution has not been previously available. This paper provides the first such mathematical characterization.

As in the case of compressed sensing, the conditions we derive to guarantee properties about the nuclear norm heuristic are deterministic, but they are at least as difficult to check as solving the rank minimization problem itself. We are only able to guarantee that the nuclear norm heuristic recovers the minimum rank solution of A(X)=b\mathcal{A}(X)=b when A\mathcal{A} is sampled from specific ensembles of random maps. The constraints appearing in many of the applications mentioned above, such as low-order control system design, are typically not random at all and have structured demands according to the specifics of the design problem. It thus behooves us to present several examples where random constraints manifest themselves in practical scenarios for which no practical solution procedure is known.

Rank minimization forms the basis of many model reduction and low-order system identification problems for linear time-invariant (LTI) systems. The following example illustrates how random constraints might arise in this context. Consider the problem of finding the minimum order discrete-time LTI system that is consistent with a set of time-domain observations. In particular, suppose our observations are the system output sampled at a fixed time NN, after a random Gaussian input signal is applied from t=0t=0 to t=Nt=N. Suppose we make such measurements for pp different input signals, that is, we observe yi(N)=t=0Nai(Nt)h(t)y_{i}(N)=\sum_{t=0}^{N}a_{i}(N-t)h(t) for i=1,,pi=1,\ldots,p, where aia_{i}, the iith input signal, is a zero-mean Gaussian random variable with the same variance for t=0,Nt=0,\ldots N, and h(t)h(t) denotes the impulse response. We can write this compactly as y=Ahy=Ah, where h=[h(0),,h(N)]h=[h(0),\ldots,h(N)]^{\prime}, and Aij=ai(Nj)A_{ij}=a_{i}(N-j).

From linear system theory, the order of the minimal realization for such a system is given by the rank of the following Hankel matrix (see, e.g., )

Therefore the problem can be expressed as

where the optimization variables are h(0),,h(2N)h(0),\ldots,h(2N), and the matrix AA consists of i.i.d. zero-mean Gaussian entries.

Low-Rank Matrix Completion

In the matrix completion problem where we are given random subset of entries of a matrix, we would like to fill in the missing entries such that the resulting matrix has the lowest possible rank. This problem arises in machine learning scenarios where we are given partially observed examples of a process with a low-rank covariance matrix and would like to estimate the missing data. A typical situation where the hidden matrix is low-rank is when the columns are i.i.d. samples of a random process with low-rank covariance. Such models are ubiquitous in Factor Analysis, Collaborative Filtering, and Latent Semantic Indexing . In many of these settings, some prior probability distribution (such as a Bernoulli model or uniform distribution on subsets) is assumed to generate the set of available entries.

Suppose we are presented with a set of triples (I(i),J(i),S(i))(I(i),J(i),S(i)) for i=1,,pi=1,\ldots,p and wish to find the matrix with S(i)S(i) in the entry corresponding to row I(i)I(i) and column J(i)J(i) for all ii. The matrix completion problem seeks to solve

which is a special case of the affine rank minimization problem.

Low-dimensional Euclidean embedding problems

A problem that arises in a variety of fields is the determination of configurations of points in low-dimensional Euclidean spaces, subject to some given distance information. In Multi-Dimensional Scaling (MDS), such problems occur in extracting the underlying geometric structure of distance data. In psychometrics, the information about inter-point distances is usually gathered through a set of experiments where subjects are asked to make quantitative (in metric MDS) or qualitative (in non-metric MDS) comparisons of objects. In computational chemistry, they come up in inferring the three-dimensional structure of a molecule (molecular conformation) from information about interatomic distances .

This problem involves a Linear Matrix Inequality (LMI) and appears to be more general than the equality constrained rank minimization problem. However, general LMIs can equivalently be expressed as rank constraints on an appropriately defined block matrix. The rank of a block symmetric matrix is equal to the rank of a diagonal block plus the rank of its Schur complement (see, e.g., [33, §2.2]). Given a function ff that maps matrices into q×qq\times q symmetric matrices, that f(X)f(X) is positive semidefinite can be equivalently expressed through a rank constraint as

That is, if there exists a matrix BB satisfying the inequality above, then f(X)=BB0f(X)=B^{\prime}B\succeq 0. Using this equivalent representation allows us to rewrite (1.1) as an affine rank minimization problem.

Image Compression

A simple and well-known method to compress two-dimensional images can be obtained by using the singular value decomposition (e.g., ). The basic idea is to associate to the given grayscale image a rectangular matrix MM, with the entries MijM_{ij} corresponding to the gray level of the (i,j)(i,j) pixel. The best rank-kk approximation of MM is given by

where ||\cdot|| is any unitarily invariant norm. By the classical Eckart-Young-Mirsky theorem (), the optimal approximant is given by a truncated singular value decomposition of MM, i.e., if M=UΣVTM=U\Sigma V^{T}, then X=UΣkVTX^{*}=U\Sigma_{k}V^{T}, where the first kk diagonal entries of Σk\Sigma_{k} are the largest kk singular values, and the rest of the entries are zero. If for a given rank kk, the approximation error MX||M-X^{*}|| is small enough, then the amount of data needed to encode the information about the image is k(m+nk)k(m+n-k) real numbers, which can be much smaller than the mnmn required to transmit the values of all the entries.

Consider a given image, whose associated matrix MM has low-rank, or can be well-approximated by a low-rank matrix. As proposed by Wakin et al. , a single-pixel camera would ideally produce measurements that are random linear combinations of all the pixels of the given image. Under this situation, the image reconstruction problem boils down exactly to affine rank minimization, where the constraints are given by the random linear functionals.

It should be remarked that the simple SVD image compression scheme described has certain deficiencies that more sophisticated techniques do not share (in particular, the lack of invariance of the description length under rotations). Nevertheless, due to its simplicity and relatively good practical performance, this method is particularly popular in introductory treatments and numerical linear algebra textbooks.

From Compressed Sensing to Rank Minimization

As discussed above, when the matrix variable is constrained to be diagonal, the affine rank minimization problem (1.1) reduces to the cardinality minimization problem of finding the element in the affine space with the fewest number of nonzero components. In this section we will establish a dictionary between the concepts of rank and cardinality minimization. The main elements of this correspondence are outlined in Table 1. With these elements in place, the existing proofs of sparsity recovery provide a template for the more general case of low-rank recovery.

In establishing our dictionary, we will provide a review of useful facts regarding matrix norms and their characterization as convex optimization problems. We will show how computing both the operator norm and the nuclear norm of a matrix can be cast as semidefinite programming problems. We also establish the suitable optimality conditions for the minimization of the nuclear norm under affine equality constraints, the main convex optimization problem studied in this article. Our discussion of matrix norms will mostly follow the discussion in where extensive lists of references are provided.

The nuclear norm of a matrix is equal to the sum of its singular values, i.e.,

Dual norms

For any given norm \|\cdot\| in an inner product space, there exists a dual norm d\|\cdot\|_{d} defined as

Furthermore, the norm dual to the norm d||\cdot||_{d} is again the original norm ||\cdot||.

is equal to XF\|X\|_{F}, with the maximizing YY being equal to X/XFX/\|X\|_{F}. Similarly, as shown below, the dual norm of the operator norm is the nuclear norm. The proof of this fact will also allow us to present variational characterizations of each of these norms as semidefinite programs.

Proof First consider an m×nm\times n matrix ZZ. The fact that ZZ has operator norm less than tt can be expressed as a linear matrix inequality:

where the last implication follows from a Schur complement argument. As a consequence, we can give a semidefinite optimization characterization of the operator norm, namely

Now let X=UΣVX=U\Sigma V^{\prime} be a singular value decomposition of an m×nm\times n matrix XX, where UU is an m×rm\times r matrix, VV is an n×rn\times r matrix, Σ\Sigma is a r×rr\times r diagonal matrix and rr is the rank of XX. Let Y:=UVY:=U\,V^{\prime}. Then Y=1\|Y\|=1 and Tr(XY)=i=1rσi(X)=X\operatorname{Tr}(XY^{\prime})=\sum_{i=1}^{r}\sigma_{i}(X)=||X||_{*}, and hence the dual norm is greater than or equal to the nuclear norm.

To provide an upper bound on the dual norm, we appeal to semidefinite programming duality. From the characterization in (2.3), the optimization problem

is equivalent to the semidefinite program

The dual of this SDP (after an inconsequential rescaling) is given by

Set W1:=UΣUW_{1}:=U\Sigma U^{\prime} and W2:=VΣVW_{2}:=V\Sigma V^{\prime}. Then the triple (W1,W2,X)(W_{1},W_{2},X) is feasible for (2.6) since

Furthermore, we have Tr(W1)=Tr(W2)=Tr(Σ)\operatorname{Tr}(W_{1})=\operatorname{Tr}(W_{2})=\operatorname{Tr}(\Sigma), and thus the objective function satisfies (Tr(W1)+Tr(W2))/2=TrΣ=X(\operatorname{Tr}(W_{1})+\operatorname{Tr}(W_{2}))/2=\operatorname{Tr}\Sigma=\|X\|_{*}. Since any feasible solution of (2.6) provides an upper bound for (2.5), we have that the dual norm is less than or equal to the nuclear norm of XX, thus proving the proposition.

Notice that the argument given in the proof above further shows that the nuclear norm X||X||_{*} can be computed using either the SDP (2.5) or its dual (2.6), since there is no duality gap between them. Alternatively, this could have also been proven using a Slater-type interior point condition since both (2.5) and (2.6) admit strictly feasible solutions.

Convex envelopes of rank and cardinality functions

By the chain of inequalities in (2.1), we have that rank(X)X/X\operatorname{rank}(X)\geq\|X\|_{*}/\|X\| for all XX. For all matrices with X1\|X\|\leq 1, we must have that rank(X)X\operatorname{rank}(X)\geq\|X\|_{*}, so the nuclear norm is a convex lower bound of the rank function on the unit ball in the operator norm. In fact, it can be shown that this is the tightest convex lower bound.

The proof is given in and uses a basic result from convex analysis that establishes that (under some technical conditions) the biconjugate of a function is its convex envelope .

providing an upper and lower bound on the optimal rank when the norm of the optimal solution is known. Furthermore, this is the tightest lower bound among all convex lower bounds of the rank function on the set C\mathcal{C}.

Additivity of rank and nuclear norm

Let AA and BB be matrices of the same dimensions. If AB=0AB^{\prime}=0 and AB=0A^{\prime}B=0 then A+B=A+B\|A+B\|_{*}=\|A\|_{*}+\|B\|_{*}.

Proof Partition the singular value decompositions of AA and BB to reflect the zero and non-zero singular vectors

The condition AB=0AB^{\prime}=0 implies that VA1VB1=0V_{A1}^{\prime}V_{B1}=0, and similarly, AB=0A^{\prime}B=0 implies that UA1UB1=0U_{A1}^{\prime}U_{B1}=0. Hence, there exist matrices UCU_{C} and VCV_{C} such that [UA1UB1UC][U_{A1}\,U_{B1}\,U_{C}] and [VA1VB1VC][V_{A1}\,V_{B1}\,V_{C}] are orthogonal matrices. Thus, the following are valid singular value decompositions for AA and BB:

This shows that the singular values of A+BA+B are equal to the union (with repetition) of the singular values of AA and BB. Hence, A+B=A+B\|A+B\|_{*}=\|A\|_{*}+\|B\|_{*} as desired.

Let AA and BB be matrices of the same dimensions. If the row and column spaces of AA and BB are orthogonal, then A+B=A+B\|A+B\|_{*}=\|A\|_{*}+\|B\|_{*}.

Proof It suffices to show that if the row and column spaces of AA and BB are orthogonal, then AB=0AB^{\prime}=0 and AB=0A^{\prime}B=0. But this is immediate: if the columns of AA are orthogonal to the columns of BB, we have AB=0A^{\prime}B=0. Similarly, orthogonal row spaces imply that AB=0AB^{\prime}=0 as well.

Nuclear norm minimization

Using the SDP characterizations of the nuclear and operator norms given in (2.5)-(2.6) above allows us to rewrite (2.7) as the primal-dual pair of semidefinite programs

Optimality conditions

The similarity between (2.9) and (2.10) is particularly transparent if we recall the polar decomposition of a matrix into a product of orthogonal and positive semidefinite matrices (see, e.g., ). The “angular” component of the matrix XX is exactly given by UVUV^{\prime}. Thus, these subgradients always have the form of an “angle” (or sign), plus possibly a contraction in an orthogonal direction if the norm is not differentiable at the current point.

The first condition in (2.11) requires feasibility of the linear equations, and the second one guarantees that there is no feasible direction of improvement. Indeed, since A(z)\mathcal{A}^{*}(z) is in the subdifferential at XX, for any YY in the primal feasible set of (2.7) we have

Restricted Isometry and Recovery of Low-Rank Matrices

In this section, we will characterize specific cases when we can a priori guarantee that X=X0X^{*}=X_{0}. The key conditions will be determined by the values of a sequence of parameters δr\delta_{r} that quantify the behavior of the linear map A\mathcal{A} when restricted to the subvariety of matrices of rank rr. The following definition is the natural generalization of the Restricted Isometry Property from vectors to matrices.

holds for all matrices XX of rank at most rr.

Note that by definition, δr(A)δr(A)\delta_{r}(\mathcal{A})\leq\delta_{r^{\prime}}(\mathcal{A}) for rrr\leq r^{\prime}.

The Restricted Isometry Property for sparse vectors was developed by Candès and Tao in , and requires (3.2) to hold with the Euclidean norm replacing the Frobenius norm and rank being replaced by cardinality. Since for diagonal matrices, the Frobenius norm is equal to the Euclidean norm of the diagonal, this definition reduces to the original Restricted Isometry Property of in the diagonal case.In , the authors define the restricted isometry properties with squared norms. We note here that the analysis is identical modulo some algebraic rescaling of constants. We choose to drop the squares as it greatly simplifies the analysis in Section 4.

Unlike the case of “standard” compressed sensing, our RIP condition for low-rank matrices cannot be interpreted as guaranteeing all sub-matrices of the linear transform A\mathcal{A} of a certain size are well conditioned. Indeed, the set of matrices XX for which (3.2) must hold is not a finite union of subspaces, but rather a certain “generalized Stiefel manifold,” which is also an algebraic variety (in fact, it is the rrth-secant variety of the variety of rank-one matrices). Surprisingly, we are still able to derive analogous recovery results for low-rank solutions of equations when A\mathcal{A} obeys this RIP condition. Furthermore, we will see in Section 4 that many ensembles of random matrices have the Restricted Isometry Property with δr\delta_{r} quite small with high probability for reasonable values of mm,nn, and pp.

The following two recovery theorems will characterize the power of the restricted isometry constants. Both theorems are more or less immediate generalizations from the sparse case to the low-rank case and use only minimal properties of the rank of matrices and the nuclear norm. The first theorem generalizes Lemma 1.3 in to low-rank recovery.

Suppose that δ2r<1\delta_{2r}<1 for some integer r1r\geq 1. Then X0X_{0} is the only matrix of rank at most rr satisfying A(X)=b\mathcal{A}(X)=b.

Proof Assume, on the contrary, that there exists a rank rr matrix XX satisfying A(X)=b\mathcal{A}(X)=b and XX0X\neq X_{0}. Then Z:=X0XZ:=X_{0}-X is a nonzero matrix of rank at most 2r2r, and A(Z)=0\mathcal{A}(Z)=0. But then we would have 0=A(Z)(1δ2r)ZF>00=\|\mathcal{A}(Z)\|\geq(1-\delta_{2r})\|Z\|_{F}>0 which is a contradiction.

The proof of the preceding theorem is identical to the argument given by Candès and Tao and is an immediate consequence of our definition of the constant δr\delta_{r}. No adjustment is necessary in the transition from sparse vectors to low-rank matrices. The key property used is the sub-additivity of the rank.

Suppose that r1r\geq 1 is such that δ5r<1/10\delta_{5r}<1/10. Then X=X0X^{*}=X_{0}.

We will need the following technical lemma that shows for any two matrices AA and BB, we can decompose BB as the sum of two matrices B1B_{1} and B2B_{2} such that rank(B1)\operatorname{rank}(B_{1}) is not too large and B2B_{2} satisfies the conditions of Lemma 2.3. This will be the key decomposition for proving Theorem 3.3.

Let AA and BB be matrices of the same dimensions. Then there exist matrices B1B_{1} and B2B_{2} such that

rank(B1)2rank(A)\operatorname{rank}(B_{1})\leq 2\operatorname{rank}(A)

AB2=0AB_{2}^{\prime}=0 and AB2=0A^{\prime}B_{2}=0

Proof Consider a full singular value decomposition of AA

and let B^:=UBV\hat{B}:=U^{\prime}BV. Partition B^\hat{B} as

it can be easily verified that B1B_{1} and B2B_{2} satisfy the conditions (1)–(4).

We now proceed to a proof of Theorem 3.3.

Proof [of Theorem 3.3] By optimality of XX^{*}, we have X0X\|X_{0}\|_{*}\geq\|X^{*}\|_{*}. Let R:=XX0R:=X^{*}-X_{0}. Applying Lemma 3.4 to the matrices X0X_{0} and RR, there exist matrices R0R_{0} and RcR_{c} such that R=R0+RcR=R_{0}+R_{c}, rank(R0)2rank(X0)\operatorname{rank}(R_{0})\leq 2\operatorname{rank}(X_{0}), and X0Rc=0X_{0}R_{c}^{\prime}=0 and X0Rc=0X_{0}^{\prime}R_{c}=0. Then,

where the middle assertion follows from the triangle inequality and the last one from Lemma 2.3. Rearranging terms, we can conclude that

Next we partition RcR_{c} into a sum of matrices R1,R2,R_{1},R_{2},\ldots, each of rank at most 3r3r. Let Rc=Udiag(σ)VR_{c}=U\operatorname{diag}(\sigma)V^{\prime} be the singular value decomposition of RcR_{c}. For each i1i\geq 1 define the index set Ii={3r(i1)+1,,3ri}I_{i}=\{3r(i-1)+1,\ldots,3ri\}, and let Ri:=UIidiag(σIi)VIiR_{i}:=U_{I_{i}}\operatorname{diag}(\sigma_{I_{i}})V_{I_{i}}^{\prime} (notice that Ri,Rj=0\langle R_{i},R_{j}\rangle=0 if iji\not=j). By construction, we have

which implies Ri+1F213rRi2\|R_{i+1}\|_{F}^{2}\leq\frac{1}{3r}\|R_{i}\|_{*}^{2}. We can then compute the following bound

where the last inequality follows from (2.1) and the fact that rank(R0)2r\operatorname{rank}(R_{0})\leq 2r. Finally, note that the rank of R0+R1R_{0}+R_{1} is at most 5r5r, so we may put this all together as

By assumption A(R)=A(XX0)=0\mathcal{A}(R)=\mathcal{A}(X^{*}-X_{0})=0, so if the factor on the right-hand side is strictly positive, R0=0R_{0}=0, which further implies Rc=0R_{c}=0 by (3.4), and thus X=X0X^{*}=X_{0}. Simple algebra reveals that the right-hand side is positive when 9δ3r+11δ5r<29\delta_{3r}+11\delta_{5r}<2. Since δ3rδ5r\delta_{3r}\leq\delta_{5r}, we immediately have that X=X0X^{*}=X_{0} if δ5r<1/10\delta_{5r}<1/10.

The rational number (9/119/11) in the proof of the theorem is chosen for notational simplicity and is clearly not optimal. A slightly tighter bound can be achieved working directly with the second to last line in (3.7). The most important point is that our recovery condition on δ5r\delta_{5r} is an absolute constant, independent of mm, nn, rr, and pp.

We have yet to demonstrate any specific linear mappings A\mathcal{A} for which δr<1\delta_{r}<1. We shall show in the next section that linear transformations sampled from several families of random matrices with appropriately chosen dimensions have this property with overwhelming probability. The analysis is again similar to the compressive sampling literature, but several details specific to the rank recovery problem need to be employed.

Nearly Isometric Families

In this section, we will demonstrate that when we sample linear maps from a class of probability distributions obeying certain tail bounds, then they will obey the Restricted Isometry Property (3.2) as pp, mm, and nn tend to infinity at appropriate rates. The following definition characterizes this family of random linear transformation.

There are two ingredients for a random linear map to be nearly isometric. First, it must be isometric in expectation. Second, the probability of large distortions of length must be exponentially small. The exponential bound in (4.2) guarantees union bounds will be small even for rather large sets. This concentration is the typical ingredient required to prove the Johnson-Lindenstrauss Lemma (cf ).

where vec(X)\operatorname{vec}(X) denotes the vector of XX with its columns stacked in order on top of one another, and A\mathbf{A} is a p×mnp\times mn matrix. We now give several examples of nearly isometric random variables in this matrix representation. The most well known is the ensemble with independent, identically distributed (i.i.d.) Gaussian entries

We also mention the two following ensembles of matrices, described in . One has entries sampled from an i.i.d. symmetric Bernoulli distribution

and the other has zeros in two-thirds of the entries

The fact that the top singular value of the matrix A\mathbf{A} is concentrated around 1+D/p1+\sqrt{D/p} for all of these ensembles follows from the work of Yin, Bai, and Krishnaiah, who showed that whenever the entries AijA_{ij} are i.i.d. with zero mean and finite fourth moment, then the maximum singular value of A\mathbf{A} is almost surely 1+D/p1+\sqrt{D/p} for DD sufficiently large . El Karoui uses this result to prove the concentration inequality (4.3) for all such distributions . The result for Gaussians is rather tight with γ=1/2\gamma=1/2 (see, e.g., ).

Finally, note that a random projection also obeys all of the necessary concentration inequalities. Indeed, since the norm of a random projection is exactly D/p\sqrt{D/p}, (4.3) holds trivially. The concentration inequality (4.2) is proven in .

The main result of this section is the following:

Fix 0<δ<10<\delta<1. If A\mathcal{A} is a nearly isometric random variable, then for every 1rm1\leq r\leq m, there exist constants c0,c1>0c_{0},c_{1}>0 depending only on δ\delta such that, with probability at least 1exp(c1p)1-\exp(-c_{1}p), δr(A)δ\delta_{r}(\mathcal{A})\leq\delta whenever pc0r(m+n)log(mn)p\geq c_{0}r(m+n)\log(mn).

The proof will make use of standard techniques in concentration of measure. We first extend the concentration results of to subspaces of matrices. We will show that the distortion of a subspace by a linear map is robust to perturbations of the subspace. Finally, we will provide an epsilon net over the set of all subspaces and, using a union bound, will show that with overwhelming probability, nearly isometric random variables will obey the Restricted Isometry Property (3.2) as the size of the matrices tend to infinity.

The following lemma characterizes the behavior of a nearly isometric random mapping A\mathcal{A} when restricted to an arbitrary subspace of matrices UU of dimension dd.

Let A\mathcal{A} be a nearly isometric linear map and let UU be an arbitrary subspace of m×nm\times n matrices with d=dim(U)pd=\dim(U)\leq p. Then for any 0<δ<10<\delta<1 we have

Proof The proof of this theorem is identical to the argument in where the authors restricted their attention to subspaces aligned with the coordinate axes. We will sketch the proof here as the argument is straightforward.

There exists a finite set Ω\Omega of at most (12/δ)d(12/\delta)^{d} points such that for every XUX\in U with XF1\|X\|_{F}\leq 1, there exists a QΩQ\in\Omega such that XQFδ/4\|X-Q\|_{F}\leq\delta/4. By the standard union bound, the concentration inequality (4.2) holds for all QΩQ\in\Omega with ϵ=δ/2\epsilon=\delta/2 with probability at least (4.9). If (4.2) holds for all QΩQ\in\Omega, then we immediately have that (1δ/2)QFA(Q)(1+δ/2)QF(1-\delta/2)\|Q\|_{F}\leq\|\mathcal{A}(Q)\|\leq(1+\delta/2)\|Q\|_{F} for all QΩQ\in\Omega as well.

Let XX be in {XU:XF1}\{X\in U:\|X\|_{F}\leq 1\}, and MM be the maximum of A(X)\|\mathcal{A}(X)\| on this set. Then there exists a QΩQ\in\Omega such that XQFδ/4\|X-Q\|_{F}\leq\delta/4. We then have

and since M1+δ/2+Mδ/4M\leq 1+\delta/2+M\delta/4 by definition, we have M1+δM\leq 1+\delta. The lower bound is proven by the following chain of inequalities

The proof of preceding lemma revealed that the near isometry of a linear map is robust to small perturbations of the matrix on which the map is acting. We will now show that this behavior is robust with respect to small perturbations of the subspace UU as well. This perturbation will be measured in the natural distance between two subspaces

where T1T_{1} and T2T_{2} are subspaces and PTiP_{T_{i}} is the orthogonal projection associated with each subspace. This distance measures the operator norm of the difference between the corresponding projections, and is equal to the sine of the largest principal angle between T1T_{1} and T2T_{2} .

for some constant 0<δ<10<\delta<1. Then for all YU2Y\in U_{2}

We now characterize how many subspaces are necessary to cover this set to arbitrary resolution. The covering number N(ϵ)\mathfrak{N}(\epsilon) of Σmnr\Sigma_{mnr} at resolution ϵ\epsilon is defined to be the smallest number of subspaces (Vi,Wi)(V_{i},W_{i}) such that for any pair of subspaces (V,W)(V,W), there is an ii with ρ(Σ(V,W),Σ(Vi,Wi))ϵ\rho(\Sigma(V,W),\Sigma(V_{i},W_{i}))\leq\epsilon. That is, the covering number is the smallest cardinality of an ϵ\epsilon-net. The following Lemma characterizes the cardinality of such a set.

The covering number N(ϵ)\mathfrak{N}(\epsilon) of Σmnr\Sigma_{mnr} is bounded above by

where C0C_{0} is a constant independent of ϵ\epsilon, mm, nn, and rr.

Proof Note that the projection operator onto Σ(V,W)\Sigma(V,W) can be written as PΣ(V,W)=PVPWP_{\Sigma(V,W)}=P_{V}\otimes P_{W}, so for a pair of subspaces (V1,W1)(V_{1},W_{1}) and (V2,W2)(V_{2},W_{2}), we have

The exact value of the universal constant C0C_{0} is not provided by Szarek in . It takes the same value for any homogeneous space whose automorphism group is a subgroup of the orthogonal group, and is independent of the dimension of the homogeneous space. Hence, one might expect this constant to be quite large. However, it is known that for the sphere C03C_{0}\leq 3 , and there is no indication that this constant is not similarly small for the Grassmannian.

We now proceed to the proof of the main result in this section. For this, we use a union bound to combine the probabilistic guarantees of Lemma 4.3 with the estimates of the covering number of Σ(U,V)\Sigma(U,V).

Let Ω={(Vi,Wi)}\Omega=\{(V_{i},W_{i})\} be a finite set of subspaces that satisfies the conditions of Lemma 4.5 for ϵ>0\epsilon>0, so ΩN(ϵ)|\Omega|\leq\mathfrak{N}(\epsilon). For each pair (Vi,Wi)(V_{i},W_{i}), define the set of matrices

Since Ω\Omega is an ϵ\epsilon-net, we have that the union of all the Bi\mathcal{B}_{i} is equal to Σmnr\Sigma_{mnr}. Therefore, if for all ii, (1δ)XFA(X)(1+δ)XF(1-\delta)\|X\|_{F}\leq\|\mathcal{A}(X)\|\leq(1+\delta)\|X\|_{F} for all XBiX\in\mathcal{B}_{i}, we must have that δr(A)δ\delta_{r}(\mathcal{A})\leq\delta proving that

Now note that if we have (1+A)ϵδ/2(1+\|\mathcal{A}\|)\epsilon\leq\delta/2 and, for all YΣ(Vi,Wi)Y\in\Sigma(V_{i},W_{i}), (1δ/2)YFA(Y)(1+δ/2)YF(1-\delta/2)\|Y\|_{F}\leq\|\mathcal{A}(Y)\|\leq(1+\delta/2)\|Y\|_{F}, Lemma 4.3 implies that (1δ)XFA(X)(1+δ)XF(1-\delta)\|X\|_{F}\leq\|\mathcal{A}(X)\|\leq(1+\delta)\|X\|_{F} for all XBiX\in\mathcal{B}_{i}. Therefore, using a union bound, (4.20) is greater than or equal to

We can bound these quantities separately. First we have by Lemmas 4.3 and 4.5

Secondly, since A\mathcal{A} is nearly isometric, there exists a constant γ\gamma such that

We now must pick a suitable resolution ϵ\epsilon to guarantee that this probability is less than exp(c1p)\exp(-c_{1}p) for a suitably chosen constant c1c_{1}. First note that if we choose ϵ<(δ/4)(mn/p+1)1\epsilon<(\delta/4)(\sqrt{mn/p}+1)^{-1},

which achieves the desired scaling because mn>pmn>p. With this choice of ϵ\epsilon, the quantity in Equation (4.22) is less than or equal to

where a(δ)=δ2/64δ3/192a(\delta)=\delta^{2}/64-\delta^{3}/192. Since mn/p<mnmn/p<mn for all p>1p>1, there exists a constant c0c_{0} independent of mm,nn,pp, and rr, such that the sum of the last three terms in the exponent are bounded above by (c0/a(δ))r(m+n)log(mn)(c_{0}/a(\delta))r(m+n)\log(mn). It follows that there exists a constant c1c_{1} independent of mm,nn,pp, and rr such that pc0r(m+n)log(mn)p\geq c_{0}r(m+n)\log(mn) observations are sufficient to yield an RIP of δ\delta with probability greater than 1ec1p1-e^{-c_{1}p}.

Heuristically, the scaling p=O(r(m+n)log(mn))p=O\left(r(m+n)\log(mn)\right) is very reasonable, since a rank rr matrix has r(m+nr)r(m+n-r) degrees of freedom. This coarse tail bound only provides asymptotic estimates for recovery, and is quite conservative in practice. As we demonstrate in Section 6, minimum rank solutions can be determined from between 2r(m+nr)2r(m+n-r) to 4r(m+nr)4r(m+n-r) observations for many practical problems.

Algorithms for nuclear norm minimization

A variety of methods can be developed for the effective minimization of the nuclear norm over an affine subspace of matrices, and we do not have room for a comprehensive treatment here. Instead, we focus on three methods highlighting the trade-offs between computational speed and guarantees on accuracy of the resulting solution. Directly solving the semidefinite characterization of the nuclear norm problem using primal-dual interior point methods is a numerically efficient method for small problems and can be used to yield accuracy up to floating-point precision.

Since interior point methods use second order information, the memory requirements for computing descent directions quickly becomes too large as the problem size increases. Moreover, for larger problem sizes it is preferable to use methods that exploit, at least partially, the structure of the problem. This can be done at several levels, either by taking into account further information that may be available about the linear map A\mathcal{A} (e.g., the case of partially observed Fourier measurements) or by formulating algorithms that are specific to the nuclear norm problem. For the latter, we show how to apply subgradient methods to minimize the nuclear norm over an affine set. Such first-order methods cannot yield as high numerical precision as interior point methods, but much larger problems can be solved because no second-order information needs to be stored. For even larger problems, we discuss a low-rank semidefinite programming that explicitly works with a factorization of the decision variable. This method can be applied even when the matrix decision variable cannot fit into memory, but convergence guarantees are much less satisfactory than in the other two cases.

For small problems where a high-degree of numerical precision is required, interior point methods for semidefinite programming can be directly applied to solve affine nuclear minimization problems. As we have seen in earlier sections, the nuclear norm minimization problem can be directly posed as a semidefinite programming problem via the standard form primal-dual pair (2.8). As written, the primal problem has one (n+m)×(n+m)(n+m)\times(n+m) semidefinite constraint and pp affine constraints. Conversely, the dual problem has one (n+m)×(n+m)(n+m)\times(n+m) semidefinite constraint and pp scalar decision variables. Thus, the total number of decision variables (primal and dual) is equal to (n+m+12)+p\binom{n+m+1}{2}+p.

Modern interior point solvers for semidefinite programming generally use primal-dual methods, and compute an update direction for the current solution by solving a suitable Newton system. Depending on the structure of the linear mapping A\mathcal{A}, this may entail solving a potentially large, dense linear system.

If the matrix dimensions nn and mm are not too large, then any good interior point SDP solver, such as SeDuMi or SDPT3 , will quickly produce accurate solutions. In fact, as we will see in the next section, problems with nn and mm around 5050 can be solved to machine precision in minutes on a desktop computer to machine precision. However, solving such a primal-dual pair of programs with traditional interior point methods can prove to be quite challenging when the dimensions of the matrix XX are much bigger than 100×100100\times 100, since in this case the corresponding Newton systems become quite large. In the absence of any specific additional structure, the memory requirements of such dense systems quickly limit the size of problems that can be solved.

Perhaps the most important drawback of the direct SDP approach is that it completely ignores the possibility of efficiently computing the nuclear norm via a singular value decomposition, instead of the less efficient eigenvalue decomposition of a bigger matrix. The method we discuss next will circumvent this obstacle, by directly working with subgradients of the nuclear norm.

2 Projected subgradient methods

The nuclear norm minimization (3.1) is a linearly constrained nondifferentiable convex problem. There are numerous techniques to approach this kind of problems, depending on the specific nature of the constraints (e.g., dense vs. sparse), and the possibility of using first- or second-order information.

In this section we describe a simple, easy to implement, subgradient projection approach to the solution of (3.1). This first-order method will proceed by computing a sequence of feasible points {Xk}\{X_{k}\}, with iterates satisfying the update rule

where Π\Pi is the orthogonal projection onto the affine subspace defined by the linear constraints A(X)=b\mathcal{A}(X)=b, and sk>0s_{k}>0 is a stepsize parameter. In other words, the method updates the current iterate XkX_{k} by taking a step along the direction of a subgradient at the current point and then projecting back onto the feasible set. Alternatively, since XkX_{k} is feasible, we can rewrite this as

where ΠA\Pi_{\mathcal{A}} is the orthogonal projection onto the kernel of A\mathcal{A}. Since the feasible set is an affine subspace, there are several options for the projection ΠA\Pi_{\mathcal{A}}. For small problems, one can precompute it using, for example, a QR decomposition of the matrix representation of A\mathcal{A} and store it. Alternatively, one can solve a least squares problem at each step by iterative methods such as conjugate gradients.

The subgradient-based method described above is extremely simple to implement, since only a subgradient evaluation is required at every step. The computation of the subgradient can be done using the formula given in (2.9) earlier, thus requiring only a singular value decomposition of the current point XkX_{k}.

A possible alternative here to the use of the SVD for the subgradient computation is to directly focus on the “angular” factor of the polar decomposition of XkX_{k}, using for instance the Newton-like methods developed by Gander in . Specifically, for a given matrix XkX_{k}, the Halley-like iteration

converges globally and quadratically to the polar factor of XX, and thus yields an element of the subdifferential of the nuclear norm. This iteration method (suitable scaled) can be faster than a direct SVD computation, particularly if the singular values of the initial matrix are close to 1. This could be appealing since presumably only a very small number of iterations would be needed to update the polar factor of XkX_{k}, although the nonsmoothness of the subdifferential is bound to cause some additional difficulties.

Regarding convergence, for general nonsmooth problems, subgradient methods do not guarantee a decrease of the cost function at every iteration, even for arbitrarily small step sizes (see, e.g., [7, §6.3.1]), unless the minimum-norm subgradient is used. Instead, convergence is usually shown through the decrease (for small stepsize) of the distance from the iterates XkX_{k} to any optimal point. There are several possibilities for the choice of stepsize sks_{k}. The simplest choice that can guarantee convergence is to use a diminishing stepsize with an infinite travel condition (i.e., such that limksk=0\lim_{k\rightarrow\infty}s_{k}=0 and k>0sk\sum_{k>0}{s_{k}} diverging).

Often times, even the computation of a singular value decomposition or Halley-like iteration can be too computationally expensive. The next section proposes a reduction of the size of the search space to alleviate such demands. We must give up guarantees of convergence for this convenience, but this may be an acceptable trade-off for very large-scale problems.

3 Low-rank parametrization

We now turn to a method that works with an explicit low-rank factorization of XX. This algorithm not only requires less storage capacity and computational overhead than the previous methods, but for many problems does not even require one to be able to store the decision variable XX in memory. This is the case, for example, in the matrix completion problem where A(X)\mathcal{A}(X) is a subset of the entries of XX.

Given observations of the form A(X)=b\mathcal{A}(X)=b of an m×nm\times n matrix XX of rank rr, a possible search algorithm to find a suitable XX would be to find a factorization X=LRX=LR^{\prime}, where LL is an m×rm\times r matrix and RR an n×rn\times r matrix, such that the equality constraints are satisfied. Since there are many possible such factorizations, we could search for one where the matrices LL and RR have Frobenius norm as small as possible, i.e., the solution of the optimization problem

Even though the cost function is convex, the constraint is not. Such a problem is a nonconvex quadratic program, and it is not evidently easy to optimize. We show below that the minimization of the nuclear norm subject to equality constraints is in fact equivalent to this rather natural heuristic optimization, as long as rr is chosen to be sufficiently larger than the rank of the optimum of the nuclear norm problem.

Assume rrank(X0)r\geq\operatorname{rank}(X_{0}). The nonconvex quadratic optimization problem (5.1) is equivalent to the minimum nuclear norm relaxation (3.1).

Proof Consider any feasible solution (L,R)(L,R) of (5.1). Then, defining W1:=LLW_{1}:=LL^{\prime}, W2:=RRW_{2}:=RR^{\prime}, and X:=LRX:=LR^{\prime} yields a feasible solution of the primal SDP problem (2.8) that achieves the same cost. Since the SDP formulation is equivalent to the nuclear norm problem, we have that the optimal value of (5.1) is always greater than or equal to the nuclear norm heuristic.

For the converse, we can use an argument similar to the proof of Proposition 2.1. From the SVD decomposition X=UΣVX^{*}=U\Sigma V^{\prime} of the optimal solution of the nuclear norm relaxation (3.1), we can explicitly construct matrices L:=UΣ12L:=U\Sigma^{\frac{1}{2}} and R:=VΣ12R:=V\Sigma^{\frac{1}{2}} for (5.1) that yield exactly the same value of the objective.

The main advantage of this reformulation is to substantially decrease the number of primal decision variables from nmnm to (n+m)r(n+m)r. For large problems, this is quite a significant reduction that allows us to search for matrices of rank smaller than the order of 100100, and n+mn+m in the hundreds of thousands on a desktop computer. However, this problem is nonconvex and potentially subject to local minima. This is not as much of a problem as it could be, for two reasons. First recall from Theorem 3.2 that if δ2r(A)<1\delta_{2r}(\mathcal{A})<1, there is a unique XX^{*} with rank at most rr such that A(X)=b\mathcal{A}(X^{*})=b. Since any local minima (L,R)(L^{*},R^{*}) of (5.1) is feasible, we would have X=LRX^{*}=L^{*}{R^{*}}^{\prime} and we would have found the minimum rank solution. Second, we now present an algorithm that is guaranteed to converge to a local minima for a judiciously selected rr. We will also provide a sufficient condition for when we can construct an optimal solution of (2.8) from the solution computed by the method of multipliers.

For general semidefinite programming problems, Burer and Monteiro have developed in a nonlinear programming approach that relies on a low-rank factorization of the matrix decision variable. We will adapt this idea to our problem, to provide a first-order Lagrangian minimization algorithm that efficiently finds a local minima of (5.1). As a consequence of the work in , it will follow that for values of rr larger than the rank of the true optimal solution, the local minima of (5.1) can be transformed into global minima of (2.8) under the identification W1=LLW_{1}=LL^{\prime}, W2=RRW_{2}=RR^{\prime} and Y=LRY=LR^{\prime}. We summarize below the details of this approach.

The algorithm employed is called the method of multipliers, a standard approach for solving equality constrained optimization problems . The method of multipliers works with an augmented Lagrangian for (5.1)

where the yiy_{i} are arbitrarily signed Lagrange multipliers and σ\sigma is a positive constant. A somewhat similar algorithm was proposed by Rennie et al in in the collaborative filtering. In this work, the authors minimize La\mathcal{L}_{a} with σ\sigma fixed and y=0y=0 to serve as a regularized algorithm for matrix completion. Remarkably, by deterministically varying σ\sigma and yy, this method can be adapted into an algorithm for solving linearly constrained nuclear-norm minimization.

In the method of multipliers, one alternately minimizes the augmented Lagrangian with respect to the decision variables LL and RR, and then increases the value of the penalty coefficient σ\sigma and updates yy. The augmented Lagrangian can be minimized using any local search technique, and the partial derivatives are particularly simple to compute. Let y^:=yσ(A(LR)b)\hat{y}:=y-\sigma(\mathcal{A}(LR^{\prime})-b). Then we have

To calculate the gradients, we first compute the constraint violations A(LR)b\mathcal{A}(LR^{\prime})-b, then form y^\hat{y}, and finally use the above equations to compute the gradients.

As the number of iterations tends to infinity, only feasible points will have finite values of La\mathcal{L}_{a}, and for any feasible point, La(L,R)\mathcal{L}_{a}(L,R) is equal to the original cost function (LF2+RF2)/2(\|L\|_{F}^{2}+\|R\|_{F}^{2})/2. The method terminates when LL and RR are feasible, as in this case the Lagrangian is stationary and we are at a local minima of (5.1). Including the yy multipliers improves the conditioning of each subproblem where La\mathcal{L}_{a} is minimized and enhances the rate of convergence. The following theorem shows that when the method of multipliers converges, it converges to a local minimum of (5.1).

Suppose we have a sequence (L(k),R(k),y(k))(L^{(k)},R^{(k)},y^{(k)}) of local minima of the augmented Lagrangian at each step of the method of multipliers. Assume that our sequence of σ(k)\sigma^{(k)}\rightarrow\infty and that the sequence of y(k)y^{(k)} is bounded. If (L(k),R(k))(L^{(k)},R^{(k)}) converges to (L,R)(L^{*},R^{*}) and the linear map

has kernel equal to the zero vector for all kk, then there exists a vector yy^{*} such that

La(L,R;y)=0\nabla\mathcal{L}_{a}(L^{*},R^{*};y^{*})=0

Proof This proof is standard and follows the approach in . As above, we define y^(k):=y(k)σ(k)(A(L(k)R(k))b)\hat{y}^{(k)}:=y^{(k)}-\sigma^{(k)}(\mathcal{A}(L^{(k)}{R^{(k)}}^{\prime})-b) for all kk. Since (L(k),R(k))(L^{(k)},R^{(k)}) minimize the augmented Lagrangian at iteration kk, we have

Since we have assumed that there is no non-zero yy with Λ(k)(y)=0\Lambda^{(k)}(y)=0, there exists a left-inverse and we can solve for y^(k)\hat{y}^{(k)}.

Everything on the right-hand side is bounded, and L(k)L^{(k)} and R(k)R^{(k)} converge. Therefore, we must have that y^(k)\hat{y}^{(k)} converges to some y^\hat{y}^{*}. Taking the limit of (5.4) proves (i). To prove (ii), note that we must have y^(k){\hat{y}^{(k)}} is bounded. Since y(k)y^{(k)} is also bounded, we find that σ(k)(A(L(k)R(k))b)\sigma^{(k)}(\mathcal{A}(L^{(k)}{R^{(k)}}^{\prime})-b) is also bounded. But σ(k)\sigma^{(k)}\rightarrow\infty implies that A(LR)=b\mathcal{A}(L^{*}{R^{*}}^{\prime})=b, completing the proof.

Suppose the decision variables are chosen to be of size m×rdm\times r_{d} and n×rdn\times r_{d}. A necessary condition for Λk(y)\Lambda^{k}(y) to be full rank is for the number of decision variables rd(m+n)r_{d}(m+n) to be greater than the number of equalities pp. In particular, this means that we must choose rdp/(m+n)r_{d}\geq p/(m+n) in order to have any hopes of satisfying the conditions of Theorem 5.2.

We close this section by relating the solution found by the method of multipliers to the optimal solution of the nuclear norm minimization problem. We have already shown that when the low-rank algorithm converges, it converges to a low-rank solution of A(X)=b\mathcal{A}(X)=b. If we additionally find that A(y)\mathcal{A}^{*}(y^{*}) has norm less than or equal to one, then it is dual feasible. One can check using straightforward algebra that (LR,LL,RR)(L^{*}{R^{*}}^{\prime},L^{*}{L^{*}}^{\prime},R^{*}{R^{*}}^{\prime}) and yy^{*} form an optimal primal-dual pair for (2.8). This analysis proves the following theorem.

Let (L,R,y)(L^{*},R^{*},y^{*}) satisfy (i)-(ii) in Theorem 5.2 and suppose A(y)1\|\mathcal{A}^{*}(y^{*})\|\leq 1. Then (LR,LL,RR)(L^{*}{R^{*}}^{\prime},L^{*}{L^{*}}^{\prime},R^{*}{R^{*}}^{\prime}) is an optimal primal solution and yy^{*} is an optimal dual solution of (2.8).

Numerical Experiments

To illustrate the scaling of low-rank recovery for a particular matrix MM, consider the MIT logo presented in Figure 1. The image has a total of 4646 rows and 8181 columns (total 3726 elements), and 3 distinct non-zero numerical values corresponding to the colors white, red, and grey. Since the logo only has 55 distinct rows, it has rank 55. For each of the ensembles discussed in Section 4, we sampled measurement matrices with pp ranging between 700700 and 15001500, and solved the semidefinite program (2.6) using the freely available software SeDuMi . On a 2.0 GHz Laptop, each semidefinite program could be solved in less than four minutes. We chose to use this interior point method because it yielded the highest accuracy in the shortest amount of time, and we were interested in characterizing precisely when the nuclear norm heuristic succeeded and failed.

Figure 2 plots the Frobenius norm of the difference between the optimal point of the semidefinite program and the true image in Figure 2. We observe a sharp transition to perfect recovery near 1200 measurements which is approximately equal to 2r(m+nr)2r(m+n-r). In Figure 3, we graphically plot the recovered solutions for various values of pp under the Gaussian ensemble.

To demonstrate the average behavior of low-rank recovery, we conducted a series of experiments for a variety of the matrix sizes nn, ranks rr, and numbers of measurements pp. For a fixed nn, we constructed random recovery scenarios for low-rank n×nn\times n matrices. For each nn, we varied pp between and n2n^{2} where the matrix is completely discovered. For a fixed nn and pp, we generated all possible ranks such that r(2nr)pr(2n-r)\leq p. This cutoff was chosen because beyond that point there would be an infinite set of matrices of rank rr satisfying the pp equations.

For each (n,p,r)(n,p,r) triple, we repeated the following procedure 1010 times. A matrix of rank rr was generated by choosing two random n×rn\times r factors YLY_{L} and YRY_{R} with i.i.d. random entries and setting Y0=YLYRY_{0}=Y_{L}Y_{R}^{\prime}. A matrix A\mathbf{A} was sampled from the Gaussian ensemble with pp rows and n2n^{2} columns. Then the nuclear norm minimization

was solved using the SDP solver SeDuMi on the formulation (2.6). Again, we chose to use SeDuMi because we wanted to precisely distinguish between success and failure of the heuristic. We declared Y0Y_{0} to be recovered if XY0F/Y0F<103\|X-Y_{0}\|_{F}/\|Y_{0}\|_{F}<10^{-3}. Figure 4 shows the results of these experiments for n=30n=30 and 4040. The color of the cell in the figures reflects the empirical recovery rate of the 1010 runs (scaled between and 11). White denotes perfect recovery in all experiments, and black denotes failure for all experiments.

These experiments demonstrate that the logarithmic factors and constants present in our scaling results are somewhat conservative. For example, as one might expect, low-rank matrices are perfectly recovered by nuclear norm minimization when p=n2p=n^{2} as the matrix is uniquely determined. Moreover, as pp is reduced slightly away from this value, low-rank matrices are still recovered 100100 percent of the time for most values of rr. Finally, we note that despite the asymptotic nature of our analysis, our experiments demonstrate excellent performance with low-rank matrices of size 30×3030\times 30 and 40×4040\times 40 matrices, showing that the heuristic is practical even in low-dimensional settings.

Intriguingly, Figure 4 also demonstrates a “phase transition” between perfect recovery and failure. As observed in several recent papers by Donoho and his collaborators (See e.g. ), the random sparsity recovery problem has two distinct connected regions of parameter space: one where the sparsity pattern is perfectly recovered, and one where no sparse solution is found. Not surprisingly, Figure 4 illustrates an analogous phenomenon in rank recovery. Computing explicit formulas for the transition between perfect recovery and failure is left for future work.

Discussion and future developments

All of the measurement ensembles require the storage of O(mnp)O(mnp) numbers. For large problems this is wholly impractical. There are many promising alternative measurement ensembles that seem to obey the same scaling laws as those presented in Section 4. For example, “factored” measurements, of the form Ai:XuiTXviA_{i}:X\mapsto u_{i}^{T}Xv_{i}, where ui,viu_{i},v_{i} are Gaussian random vectors empirically yield the same performance as the Gaussian ensemble. This factored ensemble only requires storage of O((m+n)p)O((m+n)p) numbers, which is a rather significant savings for very large problems. The proof in Section 4 does not seem to extend to this ensemble, thus new machinery must be developed to guarantee properties about such low-rank measurements.

Noisy Measurements and Low-rank approximation

Incoherent Ensembles and Partially Observed Transforms

Alternative numerical methods

Besides the techniques described in Section 5, there are a number of interesting additional possibilities to solve the nuclear norm minimization problem. An appealing suggestion is to combine the strength of second-order methods (as in the SDP approach) with the known geometry of the nuclear norm (as in the subgradient approach), and develop a customized interior point method, possibly yielding faster convergence rates, while still being relatively memory-efficient.

Geometric interpretations

In the case of rank minimization, the direct application of these concepts fails, since the unit ball of the nuclear norm is not a polyhedral set. Nevertheless, it seems likely that a similar explanation could be developed, where the key feature would be the preservation under a linear map of the extremality of the components of the boundary of the nuclear norm unit ball defined by low-rank conditions.

Jordan algebras

As we have seen, our results for the rank minimization problem closely parallel the earlier developments in cardinality minimization. A convenient mathematical framework that allows the simultaneous consideration of these cases as well as a few new ones, is that of Jordan algebras and the related symmetric cones . In the Jordan-algebraic setting, there is an intrinsic notion of rank that agrees with the cardinality of the support in the case of the nonnegative orthant or the rank of a matrix in the case of the positive semidefinite cone. Besides mathematical elegance, a direct Jordan-algebraic approach would transparently yield similar results for the case of second-order (or Lorentz) cone constraints.

As specific examples of the power and elegance of this approach, we mention the work of Faybusovich and Schmieta and Alizadeh that provide a unified development of interior point methods for symmetric cones, as well as Faybusovich’s work on convexity theorems for quadratic mappings .

Parsimonious models and optimization

Sparsity and low-rank are two specific classes of parsimonious (or low-complexity) descriptions. Are there other kinds of easy-to-describe parametric models that are amenable to exact solutions via convex optimizations techniques? Given the intimate connections between linear and semidefinite programming and the Jordan algebraic approaches described earlier, it is likely that this will require alternative tractable convex optimization formulations.

Acknowledgements

We thank Stephen Boyd, Emmanuel Candès, José Costa, John Doyle, Ali Jadbabaie, Ali Rahimi, and Michael Wakin for their useful comments and suggestions. We also thank the IMA in Minneapolis for hosting us during the initial stages of our collaboration.

References