Rank-Sparsity Incoherence for Matrix Decomposition

Venkat Chandrasekaran, Sujay Sanghavi, Pablo A. Parrilo, Alan S. Willsky

Introduction

Complex systems and models arise in a variety of problems in science and engineering. In many applications such complex systems and models are often composed of multiple simpler systems and models. Therefore, in order to better understand the behavior and properties of a complex system a natural approach is to decompose the system into its simpler components. In this paper we consider matrix representations of systems and statistical models in which our matrices are formed by adding together sparse and low-rank matrices. We study the problem of recovering the sparse and low-rank components given no prior knowledge about the sparsity pattern of the sparse matrix, or the rank of the low-rank matrix. We propose a tractable convex program to recover these components, and provide sufficient conditions under which our procedure recovers the sparse and low-rank matrices exactly.

Such a decomposition problem arises in a number of settings, with the sparse and low-rank matrices having different interpretations depending on the application. In a statistical model selection setting, the sparse matrix can correspond to a Gaussian graphical model and the low-rank matrix can summarize the effect of latent, unobserved variables. Decomposing a given model into these simpler components is useful for developing efficient estimation and inference algorithms. In computational complexity, the notion of matrix rigidity captures the smallest number of entries of a matrix that must be changed in order to reduce the rank of the matrix below a specified level (the changes can be of arbitrary magnitude). Bounds on the rigidity of a matrix have several implications in complexity theory . Similarly, in a system identification setting the low-rank matrix represents a system with a small model order while the sparse matrix represents a system with a sparse impulse response. Decomposing a system into such simpler components can be used to provide a simpler, more efficient description.

Formally the decomposition problem we are interested can be defined as follows:

Given C=A+BC=A^{\star}+B^{\star} where AA^{\star} is an unknown sparse matrix and BB^{\star} is an unknown low-rank matrix, recover AA^{\star} and BB^{\star} from CC using no additional information on the sparsity pattern and/or the rank of the components.

In the absence of any further assumptions, this decomposition problem is fundamentally ill-posed. Indeed, there are a number of scenarios in which a unique splitting of CC into “low-rank” and “sparse” parts may not exist; for example, the low-rank matrix may itself be very sparse leading to identifiability issues. In order to characterize when such a decomposition is possible we develop a notion of rank-sparsity incoherence, an uncertainty principle between the sparsity pattern of a matrix and its row/column spaces. This condition is based on quantities involving the tangent spaces to the algebraic variety of sparse matrices and the algebraic variety of low-rank matrices .

Here \|\cdot\| is the spectral norm (i.e., the largest singular value), and \|\cdot\|_{\infty} denotes the largest entry in magnitude. Thus ξ(M)\xi(M) being small implies that (appropriately scaled) elements of the tangent space T(M)T(M) are “diffuse”, i.e., these elements are not too sparse; as a result MM cannot be very sparse. As shown in Proposition 4 (see Section 4.3) a low-rank matrix MM with row/column spaces that are not closely aligned with the coordinate axes has small ξ(M)\xi(M).

The quantity μ(M)\mu(M) being small for a matrix implies that the spectrum of any element of the tangent space Ω(M)\Omega(M) is “diffuse”, i.e., the singular values of these elements are not too large. We show in Proposition 3 (see Section 4.3) that a sparse matrix MM with “bounded degree” (a small number of non-zeros per row/column) has small μ(M)\mu(M).

For a given matrix MM, it is impossible for both quantities ξ(M)\xi(M) and μ(M)\mu(M) to be simultaneously small. Indeed, we prove that for any matrix M0M\neq 0 we must have that ξ(M)μ(M)1\xi(M)\mu(M)\geq 1 (see Theorem 1 in Section 3.3). Thus, this uncertainty principle asserts that there is no non-zero matrix MM with all elements in T(M)T(M) being diffuse and all elements in Ω(M)\Omega(M) having diffuse spectra. As we describe later, the quantities ξ\xi and μ\mu are also used to characterize fundamental identifiability in the decomposition problem.

and the nuclear norm, which is the sum of the singular values, is given by

Here γ\gamma is a parameter that provides a trade-off between the low-rank and sparse components. This optimization problem is convex, and can in fact be rewritten as a semidefinite program (SDP) (see Appendix A).

In the sequel we discuss concrete classes of sparse and low-rank matrices that have small μ\mu and ξ\xi respectively. We also show that when the sparse and low-rank matrices AA^{\star} and BB^{\star} are drawn from certain natural random ensembles, then the sufficient conditions of Theorem 2 are satisfied with high probability; consequently, (3) provides exact recovery with high probability for such matrices.

2 Previous work using incoherence

3 Outline

In Section 2 we elaborate on the applications mentioned previously, and discuss the implications of our results for each of these applications. Section 3 formally describes conditions for fundamental identifiability in the decomposition problem based on the quantities ξ\xi and μ\mu defined in (1) and (2). We also provide a proof of the rank-sparsity uncertainty principle of Theorem 1. We prove Theorem 2 in Section 4, and also provide concrete classes of sparse and low-rank matrices that satisfy the sufficient conditions of Theorem 2. Section 5 describes the results of simulations of our approach applied to synthetic matrix decomposition problems. We conclude with a discussion in Section 6. The Appendix provides additional details and proofs.

Applications

In this section we describe several applications that involve decomposing a matrix into sparse and low-rank components.

We begin with a problem in statistical model selection. In many applications large covariance matrices are approximated as low-rank matrices based on the assumption that a small number of latent factors explain most of the observed statistics (e.g., principal component analysis). Another well-studied class of models are those described by graphical models in which the inverse of the covariance matrix (also called the precision or concentration or information matrix) is assumed to be sparse (typically this sparsity is with respect to some graph). We describe a model selection problem involving graphical models with latent variables. Let the covariance matrix of a collection of jointly Gaussian variables be denoted by Σ(o h)\Sigma_{(o~{}h)}, where oo represents observed variables and hh represents unobserved, hidden variables. The marginal statistics corresponding to the observed variables oo are given by the marginal covariance matrix Σo\Sigma_{o}, which is simply a submatrix of the full covariance matrix Σ(o h)\Sigma_{(o~{}h)}. Suppose, however, that we parameterize our model by the information matrix given by K(o h)=Σ(o h)1K_{(o~{}h)}=\Sigma_{(o~{}h)}^{-1} (such a parameterization reveals the connection to graphical models). In such a parameterization, the marginal information matrix corresponding to the inverse Σo1\Sigma_{o}^{-1} is given by the Schur complement with respect to the block KhK_{h}:

Thus if we only observe the variables oo, we only have access to Σo\Sigma_{o} (or K^o\hat{K}_{o}). A simple explanation of the statistical structure underlying these variables involves recognizing the presence of the latent, unobserved variables hh. However (4) has the interesting structure that KoK_{o} is often sparse due to graphical structure amongst the observed variables oo, while Ko,hKh1Kh,oK_{o,h}K_{h}^{-1}K_{h,o} has low-rank if the number of latent, unobserved variables hh is small relative to the number of observed variables oo (the rank is equal to the number of latent variables hh). Therefore, decomposing K^o\hat{K}_{o} into these sparse and low-rank components reveals the graphical structure in the observed variables as well as the effect due to (and the number of) the unobserved latent variables. We discuss this application in more detail in a separate report .

2 Matrix rigidity

3 Composite system identification

A decomposition problem can also be posed in the system identification setting. Linear time-invariant (LTI) systems can be represented by Hankel matrices, where the matrix represents the input-output relationship of the system . Thus, a sparse Hankel matrix corresponds to an LTI system with a sparse impulse response. A low-rank Hankel matrix corresponds to a system with small model order, and provides a minimal realization for a system . Given an LTI system HH as follows

where HsH_{s} is sparse and HlrH_{lr} is low-rank, obtaining a simple description of HH requires decomposing it into its simpler sparse and low-rank components. One can obtain these components by solving our rank-sparsity decomposition problem. Note that in practice one can impose in (3) the additional constraint that the sparse and low-rank matrices have Hankel structure.

4 Partially coherent decomposition in optical systems

We outline an optics application that is described in greater detail in . Optical imaging systems are commonly modeled using the Hopkins integral , which gives the output intensity at a point as a function of the input transmission via a quadratic form. In many applications the operator in this quadratic form can be well-approximated by a (finite) positive semi-definite matrix. Optical systems described by a low-pass filter are called coherent imaging systems, and the corresponding system matrices have small rank. For systems that are not perfectly coherent various methods have been proposed to find an optimal coherent decomposition , and these essentially identify the best approximation of the system matrix by a matrix of lower rank. At the other end are incoherent optical systems that allow some high frequencies, and are characterized by system matrices that are diagonal. As most real-world imaging systems are some combination of coherent and incoherent, it was suggested in that optical systems are better described by a sum of coherent and incoherent systems rather than by the best coherent (i.e., low-rank) approximation as in . Thus, decomposing an imaging system into coherent and incoherent components involves splitting the optical system matrix into low-rank and diagonal components. Identifying these simpler components has important applications in tasks such as optical microlithography .

Rank-Sparsity Incoherence

Throughout this paper, we restrict ourselves to square n×nn\times n matrices to avoid cluttered notation. All our analysis extends to rectangular n1×n2n_{1}\times n_{2} matrices, if we simply replace nn by max(n1,n2)\max(n_{1},n_{2}).

As described in the introduction, the matrix decomposition problem can be fundamentally ill-posed. We describe two situations in which identifiability issues arise. These examples suggest the kinds of additional conditions that are required in order to ensure that there exists a unique decomposition into sparse and low-rank matrices.

First, let AA^{\star} be any sparse matrix and let B=eiejTB^{\star}=e_{i}e_{j}^{T}, where eie_{i} represents the ii-th standard basis vector. In this case, the low-rank matrix BB^{\star} is also very sparse, and a valid sparse-plus-low-rank decomposition might be A^=A+eiejT\hat{A}=A^{\star}+e_{i}e_{j}^{T} and B^=0\hat{B}=0. Thus, we need conditions that ensure that the low-rank matrix is not too sparse. One way to accomplish this is to require that the quantity ξ(B)\xi(B^{\star}) be small. As will be discussed in Section 4.3), if the row and column spaces of BB^{\star} are “incoherent” with respect to the standard basis, i.e., the row/column spaces are not aligned closely with any of the coordinate axes, then ξ(B)\xi(B^{\star}) is small.

2 Tangent-space identifiability

We begin by describing the sets of sparse and low-rank matrices. These sets can be considered either as differentiable manifolds (away from their singularities) or as algebraic varieties; we emphasize the latter viewpoint here. Recall that an algebraic variety is defined as the zero set of a system of polynomial equations . The variety of rank-constrained matrices is defined as:

Next we consider the set of all matrices that are constrained by the size of their support. Such sparse matrices can also be viewed as algebraic varieties:

Before analyzing whether (A,B)(A^{\star},B^{\star}) can be recovered in general (for example, using the SDP (3)), we ask a simpler question. Suppose that we had prior information about the tangent spaces Ω(A)\Omega(A^{\star}) and T(B)T(B^{\star}), in addition to being given C=A+BC=A^{\star}+B^{\star}. Can we then uniquely recover (A,B)(A^{\star},B^{\star}) from CC? Assuming such prior knowledge of the tangent spaces is unrealistic in practice; however, we obtain useful insight into the kinds of conditions required on sparse and low-rank matrices for exact decomposition. Given this knowledge of the tangent spaces, a necessary and sufficient condition for unique recovery is that the tangent spaces Ω(A)\Omega(A^{\star}) and T(B)T(B^{\star}) intersect transversally:

That is, the subspaces Ω(A)\Omega(A^{\star}) and T(B)T(B^{\star}) have a trivial intersection. The sufficiency of this condition for unique decomposition is easily seen. For the necessity part, suppose for the sake of a contradiction that a non-zero matrix MM belongs to Ω(A)T(B)\Omega(A^{\star})\cap T(B^{\star}); one can add and subtract MM from AA^{\star} and BB^{\star} respectively while still having a valid decomposition, which violates the uniqueness requirement. The following proposition, proved in Appendix B, provides a simple condition in terms of the quantities μ(A)\mu(A^{\star}) and ξ(B)\xi(B^{\star}) for the tangent spaces Ω(A)\Omega(A^{\star}) and T(B)T(B^{\star}) to intersect transversally.

Given any two matrices AA^{\star} and BB^{\star}, we have that

where ξ(B)\xi(B^{\star}) and μ(A)\mu(A^{\star}) are defined in (1) and (2), and the tangent spaces Ω(A)\Omega(A^{\star}) and T(B)T(B^{\star}) are defined in (8) and (6).

Thus, both μ(A)\mu(A^{\star}) and ξ(B)\xi(B^{\star}) being small implies that the tangent spaces Ω(A)\Omega(A^{\star}) and T(B)T(B^{\star}) intersect transversally; consequently, we can exactly recover (A,B)(A^{\star},B^{\star}) given Ω(A)\Omega(A^{\star}) and T(B)T(B^{\star}). As we shall see, the condition required in Theorem 2 (see Section 4.2) for exact recovery using the convex program (3) will be simply a mild tightening of the condition required above for unique decomposition given the tangent spaces.

3 Rank-sparsity uncertainty principle

Another important consequence of Proposition 1 is that we have an elementary proof of the following rank-sparsity uncertainty principle.

where ξ(M)\xi(M) and μ(M)\mu(M) are as defined in (1) and (2) respectively.

Proof: Given any M0M\neq 0 it is clear that MΩ(M)T(M)M\in\Omega(M)\cap T(M), i.e., MM is an element of both tangent spaces. However μ(M)ξ(M)<1\mu(M)\xi(M)<1 would imply from Proposition 1 that Ω(M)T(M)={0}\Omega(M)\cap T(M)=\{0\}, which is a contradiction. Consequently, we must have that μ(M)ξ(M)1\mu(M)\xi(M)\geq 1. \square

Hence, for any matrix M0M\neq 0 both μ(M)\mu(M) and ξ(M)\xi(M) cannot be simultaneously small. Note that Proposition 1 is an assertion involving μ\mu and ξ\xi for (in general) different matrices, while Theorem 1 is a statement about μ\mu and ξ\xi for the same matrix. Essentially the uncertainty principle asserts that no matrix can be too sparse while having “diffuse” row and column spaces. An extreme example is the matrix eiejTe_{i}e_{j}^{T}, which has the property that μ(eiejT)ξ(eiejT)=1\mu(e_{i}e_{j}^{T})\xi(e_{i}e_{j}^{T})=1.

Exact Decomposition Using Semidefinite Programming

We begin this section by studying the optimality conditions of the convex program (3), after which we provide a proof of Theorem 2 with simple conditions that guarantee exact decomposition. Next we discuss concrete classes of sparse and low-rank matrices that satisfy the conditions of Theorem 2, and can thus be uniquely decomposed using (3).

Similarly the orthogonal projection onto the space T(B)T(B^{\star}) is denoted PT(B)P_{T(B^{\star})}. Letting B=UΣVTB^{\star}=U\Sigma V^{T} be the SVD of BB^{\star}, we have the following explicit relation for PT(B)P_{T(B^{\star})}:

Here PU=UUTP_{U}=UU^{T} and PV=VVTP_{V}=VV^{T}. The space orthogonal to T(B)T(B^{\star}) is denoted T(B)T(B^{\star})^{\bot}, and the corresponding projection is denoted PT(B)(M)P_{T(B^{\star})^{\bot}}(M). The space T(B)T(B^{\star})^{\bot} consists of matrices with row-space orthogonal to the row-space of BB^{\star} and column-space orthogonal to the column-space of BB^{\star}. We have that

where In×nI_{n\times n} is the n×nn\times n identity matrix.

Following standard notation in convex analysis , we denote the subgradient of a convex function ff at a point x^\hat{x} in its domain by f(x^)\partial f(\hat{x}). The subgradient f(x^)\partial f(\hat{x}) consists of all yy such that

Note that these are necessary and sufficient conditions for (A,B)(A^{\star},B^{\star}) to be an optimum of (3). The following proposition provides sufficient conditions for (A,B)(A^{\star},B^{\star}) to be the unique optimum of (3), and it involves a slight tightening of the conditions (11), (12), and (13).

Suppose that C=A+BC=A^{\star}+B^{\star}. Then (A^,B^)=(A,B)(\hat{A},\hat{B})=(A^{\star},B^{\star}) is the unique optimizer of (3) if the following conditions are satisfied:

Ω(A)T(B)={0}\Omega(A^{\star})\cap T(B^{\star})=\{0\}.

PΩ(A)c(Q)<γ\|P_{\Omega(A^{\star})^{c}}(Q)\|_{\infty}<\gamma

The proof of the proposition can be found in Appendix B. Figure 1 provides a visual representation of these conditions. In particular, we see that the spaces Ω(A)\Omega(A^{\star}) and T(B)T(B^{\star}) intersect transversely (part (1)(1) of Proposition 2). One can also intuitively see that guaranteeing the existence of a dual QQ with the requisite conditions (part (2)(2) of Proposition 2) is perhaps easier if the intersection between Ω(A)\Omega(A^{\star}) and T(B)T(B^{\star}) is more transverse. Note that condition (1)(1) of this proposition essentially requires identifiability with respect to the tangent spaces, as discussed in Section 3.2.

Next we provide simple sufficient conditions on AA^{\star} and BB^{\star} that guarantee the existence of an appropriate dual QQ (as required by Proposition 2). Given matrices AA^{\star} and BB^{\star} with μ(A)ξ(B)<1\mu(A^{\star})\xi(B^{\star})<1, we have from Proposition 1 that Ω(A)T(B)={0}\Omega(A^{\star})\cap T(B^{\star})=\{0\}, i.e., condition (1)(1) of Proposition 2 is satisfied. We prove that if a slightly stronger condition holds, there exists a dual QQ that satisfies the requirements of condition (2)(2) of Proposition 2.

the unique optimum (A^,B^)(\hat{A},\hat{B}) of (3) is (A,B)(A^{\star},B^{\star}) for the following range of γ\gamma:

Specifically γ=3ξ(B)2μ(A)\gamma=\sqrt{\frac{3\xi(B^{\star})}{2\mu(A^{\star})}} is always inside the above range, and thus guarantees exact recovery of (A,B)(A^{\star},B^{\star}).

One consequence of Theorem 2 is that if μ(A)ξ(B)<16\mu(A^{\star})\xi(B^{\star})<\frac{1}{6}, then there exists no other (A,B)(A,B) such that A+B=A+BA+B=A^{\star}+B^{\star} with μ(A)ξ(B)<16\mu(A)\xi(B)<\frac{1}{6}. We consider this implication locally around (A,B)(A^{\star},B^{\star}). Recall that the quantities μ(A)\mu(A^{\star}) and ξ(B)\xi(B^{\star}) are defined with respect to the tangent spaces Ω(A)\Omega(A^{\star}) and T(B)T(B^{\star}). Suppose BB^{\star} is slightly perturbed along the variety of rank-constrained matrices to some BB. This ensures that the tangent space varies smoothly from T(B)T(B^{\star}) to T(B)T(B), and consequently that ξ(B)ξ(B)\xi(B)\approx\xi(B^{\star}). However, compensating for this by changing AA^{\star} to A+(BB)A^{\star}+(B^{\star}-B) moves AA^{\star} outside the variety of sparse matrices. This is because BBB^{\star}-B is not sparse. Thus the dimension of the tangent space Ω(A+BB)\Omega(A^{\star}+B^{\star}-B) is much greater than that of the tangent space Ω(A)\Omega(A^{\star}), as a result of which μ(A+BB)μ(A)\mu(A^{\star}+B^{\star}-B)\gg\mu(A^{\star}); therefore we have that ξ(B)μ(A+BB)16\xi(B)\mu(A^{\star}+B^{\star}-B)\gg\frac{1}{6}. The same reasoning holds in the opposite scenario. Consider perturbing AA^{\star} slightly along the variety of sparse matrices to some AA. While this ensures that μ(A)μ(A)\mu(A)\approx\mu(A^{\star}), changing BB^{\star} to B+(AA)B^{\star}+(A^{\star}-A) moves BB^{\star} outside the variety of rank-constrained matrices. Therefore the dimension of the tangent space T(B+AA)T(B^{\star}+A^{\star}-A) is greater than that of T(B)T(B^{\star}), resulting in ξ(B+AA)ξ(B)\xi(B^{\star}+A^{\star}-A)\gg\xi(B^{\star}); consequently we have that μ(A)ξ(B+AA)16\mu(A)\xi(B^{\star}+A^{\star}-A)\gg\frac{1}{6}.

We discuss concrete classes of sparse and low-rank matrices that satisfy the sufficient condition of Theorem 2 for exact decomposition. We begin by showing that sparse matrices with “bounded degree”, i.e., bounded number of non-zeros per row/column, have small μ\mu.

then the unique optimum of the convex program (3) is (A^,B^)=(A,B)(\hat{A},\hat{B})=(A^{\star},B^{\star}) for a range of values of γ\gamma:

We emphasize that this is a result with deterministic sufficient conditions on exact decomposability.

4 Decomposing random sparse and low-rank matrices

Next we show that sparse and low-rank matrices drawn from certain natural random ensembles satisfy the sufficient conditions of Corollary 3 with high probability. We first consider random sparse matrices with a fixed number of non-zero entries.

The proof of this lemma follows from a standard balls and bins argument, and can be found in several references (see for example ).

Next we consider low-rank matrices in which the singular vectors are chosen uniformly at random from the set of all partial isometries. Such a model was considered in recent work on the matrix completion problem , which aims to recover a low-rank matrix given observations of a subset of entries of the matrix.

Random orthogonal model [4]

As shown in , low-rank matrices drawn from such a model have incoherent row/column spaces.

Applying these two results in conjunction with Corollary 3, we have that sparse and low-rank matrices drawn from the random sparsity model and the random orthogonal model can be uniquely decomposed with high probability.

Thus, for matrices BB^{\star} with rank kk smaller than nn the SDP (3) yields exact recovery with high probability even when the size of the support of AA^{\star} is super-linear in nn. During final preparation of this manuscript we learned of related contemporaneous work that specifically studies the problem of decomposing random sparse and low-rank matrices. In addition to the assumptions of our random sparsity and random orthogonal models, also requires that the non-zero entries of AA^{\star} have independently chosen signs that are ±1\pm 1 with equal probability, while the left and right singular vectors of BB^{\star} are chosen independent of each other. For this particular specialization of our more general framework, the results in improve upon our bound in Corollary 4.

Implications for the matrix rigidity problem

which satisfies the sufficient condition of Corollary 4 for exact recovery. Therefore, while the rigidity of a matrix is NP-hard to compute in general , for such low-rigidity matrices MM one can compute the rigidity RM(ϵn)R_{M}(\epsilon n); in fact the SDP (3) provides a certificate of the sparse and low-rank matrices that form the low rigidity matrix MM.

Simulation Results

We confirm the theoretical predictions in this paper with some simple experimental results. We also present a heuristic to choose the trade-off parameter γ\gamma. All our simulations were performed using YALMIP and the SDPT3 software for solving SDPs.

Next we consider the problem of choosing the trade-off parameter γ\gamma. Based on Theorem 2 we know that exact recovery is possible for a range of γ\gamma. Therefore, one can simply check the stability of the solution (A^,B^)(\hat{A},\hat{B}) as γ\gamma is varied without knowing the appropriate range for γ\gamma in advance. To formalize this scheme we consider the following SDP for tt\in, which is a slightly modified version of (3):

There is a one-to-one correspondence between (3) and (18) given by t=γ1+γt=\frac{\gamma}{1+\gamma}. The benefit in looking at (18) is that the range of valid parameters is compact, i.e., tt\in, as opposed to the situation in (3) where γ[0,)\gamma\in[0,\infty). We compute the difference between solutions for some tt and tϵt-\epsilon as follows:

Discussion

We have studied the problem of exactly decomposing a given matrix C=A+BC=A^{\star}+B^{\star} into its sparse and low-rank components AA^{\star} and BB^{\star}. This problem arises in a number of applications in model selection, system identification, complexity theory, and optics. We characterized fundamental identifiability in the decomposition problem based on a notion of rank-sparsity incoherence, which relates the sparsity pattern of a matrix and its row/column spaces via an uncertainty principle. As the general decomposition problem is NP-hard we propose a natural SDP relaxation (3) to solve the problem, and provide sufficient conditions on sparse and low-rank matrices so that the SDP exactly recovers such matrices. Our sufficient conditions are deterministic in nature; they essentially require that the sparse matrix must have support that is not too concentrated in any row/column, while the low-rank matrix must have row/column spaces that are not closely aligned with the coordinate axes. Our analysis centers around studying the tangent spaces with respect to the algebraic varieties of sparse and low-rank matrices. Indeed the sufficient conditions for identifiability and for exact recovery using the SDP can also be viewed as requiring that certain tangent spaces have a transverse intersection. We also demonstrated the implications of our results for the matrix rigidity problem.

An interesting problem for further research is the development of special-purpose algorithms that take advantage of structure in (3) to provide a more efficient solution than a general-purpose SDP solver. Another question that arises in applications such as model selection (due to noise or finite sample effects) is to approximately decompose a matrix into sparse and low-rank components.

Acknowledgments

The authors would like to thank Dr. Benjamin Recht and Prof. Maryam Fazel for helpful discussions.

Appendix A SDP formulation

The problem (3) can be recast as a semidefinite program (SDP). We appeal to the fact that the spectral norm \|\cdot\| is the dual norm of the nuclear norm \|\cdot\|_{\ast}:

Further, the spectral norm admits a simple semidefinite characterization :

From duality, we can obtain the following SDP characterization of the nuclear norm:

Putting these facts together, (3) can be rewritten as

Appendix B Proofs

where PΩ(A)(N)P_{\Omega(A^{\star})}(N) denotes the projection onto the space Ω(A)\Omega(A^{\star}). Assume for the sake of a contradiction that this assertion is not true. Thus, there exists N0N\neq 0 such that NΩ(A)T(B)N\in\Omega(A^{\star})\cap T(B^{\star}). Scale NN appropriately such that N=1\|N\|=1. Thus NT(B)N\in T(B^{\star}) with N=1\|N\|=1, but we also have that PΩ(A)(N)=N=1\|P_{\Omega(A^{\star})}(N)\|=\|N\|=1 as NΩ(A)N\in\Omega(A^{\star}). This leads to a contradiction.

which would allow us to conclude the proof of this proposition. We have the following sequence of inequalities

Here the first inequality follows from the definition (2) of μ(A)\mu(A^{\star}) as PΩ(A)(N)Ω(A)P_{\Omega(A^{\star})}(N)\in\Omega(A^{\star}), the second inequality is due to the fact that PΩ(A)(N)N\|P_{\Omega(A^{\star})}(N)\|_{\infty}\leq\|N\|_{\infty}, and the final inequality follows from the definition (1) of ξ(B)\xi(B^{\star}). \square

Proof of Proposition 2

We first show that (A,B)(A^{\star},B^{\star}) is an optimum of (3), before moving on to showing uniqueness. Based on subgradient optimality conditions applied at (A,B)(A^{\star},B^{\star}), there must exist a dual QQ such that

The second condition in this proposition guarantees the existence of a dual QQ that satisfies both these subgradient conditions simultaneously (see (12) and (13)). Therefore, we have that (A,B)(A^{\star},B^{\star}) is an optimum. Next we show that under the conditions specified in the lemma, (A,B)(A^{\star},B^{\star}) is also a unique optimum. To avoid cluttered notation, in the rest of this proof we let Ω=Ω(A)\Omega=\Omega(A^{\star}), T=T(B)T=T(B^{\star}), Ωc(A)=Ωc\Omega^{c}(A^{\star})=\Omega^{c}, and T(B)=TT^{\bot}(B^{\star})=T^{\bot}.

Suppose that there is another feasible solution (A+NA,B+NB)(A^{\star}+N_{A},B^{\star}+N_{B}) that is also a minimizer. We must have that NA+NB=0N_{A}+N_{B}=0 because A+B=C=(A+NA)+(B+NB)A^{\star}+B^{\star}=C=(A^{\star}+N_{A})+(B^{\star}+N_{B}). Applying the subgradient property at (A,B)(A^{\star},B^{\star}), we have that for any subgradient (QA,QB)(Q_{A},Q_{B}) of the function γA1+B\gamma\|A\|_{1}+\|B\|_{\ast} (at (A,B)(A^{\star},B^{\star}))

Since (QA,QB)(Q_{A},Q_{B}) is a subgradient of the function γA1+B\gamma\|A\|_{1}+\|B\|_{\ast} at (A,B)(A^{\star},B^{\star}), we must have from (12) and (13) that

QB=UV+PT(QB)Q_{B}=UV^{\prime}+P_{T^{\bot}}(Q_{B}), with PT(QB)1\|P_{T^{\bot}}(Q_{B})\|\leq 1.

Using these conditions we rewrite QA,NA\langle Q_{A},N_{A}\rangle and QB,NB\langle Q_{B},N_{B}\rangle. Based on the existence of the dual QQ as described in the lemma, we have that

where we have used the fact that Q=UV+PT(Q)Q=UV^{\prime}+P_{T^{\bot}}(Q). Putting (24) and (25) together, we have that

In the second equality, we used the fact that NA+NB=0N_{A}+N_{B}=0.

Since PΩc(Q)<γ\|P_{\Omega^{c}}(Q)\|_{\infty}<\gamma and PT(Q)<1\|P_{T^{\bot}}(Q)\|<1, we have that QA,NA+QB,NB\langle Q_{A},N_{A}\rangle+\langle Q_{B},N_{B}\rangle is strictly positive unless PΩc(NA)=0P_{\Omega^{c}}(N_{A})=0 and PT(NB)=0P_{T^{\bot}}(N_{B})=0. (Note that if QA,NA+QB,NB>0\langle Q_{A},N_{A}\rangle+\langle Q_{B},N_{B}\rangle>0 then γA+NA1+B+NB>γA1+B\gamma\|A^{\star}+N_{A}\|_{1}+\|B^{\star}+N_{B}\|_{\ast}>\gamma\|A^{\star}\|_{1}+\|B^{\star}\|_{\ast}.) However, we have that NA+NB=0N_{A}+N_{B}=0. If PΩc(NA)=PT(NB)=0P_{\Omega^{c}}(N_{A})=P_{T^{\bot}}(N_{B})=0, then PΩ(NA)+PT(NB)=0P_{\Omega}(N_{A})+P_{T}(N_{B})=0. In other words, PΩ(NA)=PT(NB)P_{\Omega}(N_{A})=-P_{T}(N_{B}). This can only be possible if PΩ(NA)=PT(NB)=0P_{\Omega}(N_{A})=P_{T}(N_{B})=0 (as ΩT={0}\Omega\cap T=\{0\}), which implies that NA=NB=0N_{A}=N_{B}=0. Therefore, γA+NA1+B+NB>γA1+B\gamma\|A^{\star}+N_{A}\|_{1}+\|B^{\star}+N_{B}\|_{\ast}>\gamma\|A^{\star}\|_{1}+\|B^{\star}\|_{\ast} unless NA=NB=0N_{A}=N_{B}=0. \square

Proof of Theorem 2

As with the previous proof, we avoid cluttered notation by letting Ω=Ω(A)\Omega=\Omega(A^{\star}), T=T(B)T=T(B^{\star}), Ωc(A)=Ωc\Omega^{c}(A^{\star})=\Omega^{c}, and T(B)=TT^{\bot}(B^{\star})=T^{\bot}. One can check that

Thus, we show that if ξ(B)μ(A)<16\xi(B^{\star})\mu(A^{\star})<\frac{1}{6} then there exists a range of γ\gamma for which a dual QQ with the requisite properties exists. Also note that plugging in ξ(B)μ(A)=16\xi(B^{\star})\mu(A^{\star})=\tfrac{1}{6} in the above range gives the smaller range (3ξ(B),12μ(A))(3\xi(B^{\star}),\frac{1}{2\mu(A^{\star})}) for γ\gamma; the geometric mean of the extreme values gives γ=3ξ(B)2μ(A)\gamma=\sqrt{\tfrac{3\xi(B^{\star})}{2\mu(A^{\star})}}, which is always within the above range.

Next, we obtain the following bound on PΩc(Q^)\|P_{\Omega^{c}}(\hat{Q})\|_{\infty}:

where we obtain the second inequality based on the definition of ξ(B)\xi(B^{\star}) (since UV+ϵTTUV^{\prime}+\epsilon_{T}\in T). Similarly, we can obtain the following bound on PT(Q^)\|P_{T^{\bot}}(\hat{Q})\|

By definition of ξ(B)\xi(B^{\star}) and using (28),

where the second inequality is obtained because UV+ϵTTUV^{\prime}+\epsilon_{T}\in T. Similarly, by definition of μ(A)\mu(A^{\star}) and using (29)

Similarly, putting (33) in (32), we have that

We now show that PT(Q^)<1\|P_{T^{\bot}}(\hat{Q})\|<1. Combining (35) and (31),

since γ<13ξ(B)μ(A)μ(A)\gamma<\frac{1-3\xi(B^{\star})\mu(A^{\star})}{\mu(A^{\star})} by assumption.

Finally, we show that PΩc(Q^)<γ\|P_{\Omega^{c}}(\hat{Q})\|_{\infty}<\gamma. Combining (34) and (30),

Here, we used the fact that ξ(B)14ξ(B)μ(A)<γ\frac{\xi(B^{\star})}{1-4\xi(B^{\star})\mu(A^{\star})}<\gamma in the second inequality. \square

Proof of Proposition 3

Based on the Perron-Frobenius theorem , one can conclude that PQ\|P\|\geq\|Q\| if Pi,jQi,j,  i,jP_{i,j}\geq|Q_{i,j}|,~{}\forall~{}i,j. Thus, we need only consider the matrix that has 11 in every location in the support set Ω(A)\Omega(A) and everywhere else. Based on the definition of the spectral norm, we can re-write μ(A)\mu(A) as follows:

Without loss of generality we restrict our attention to optima that are achieved by element-wise non-negative vectors x,yx,y.

Since the reformulation of μ(A)\mu(A) above involves the maximization of a continuous function over a compact set, the maximum is achieved at some point in the constraint set. Therefore, we have that any optimal (x^,y^)(\hat{x},\hat{y}) must satisfy the following necessary optimality conditions: There exist Lagrange multipliers λ1,λ2\lambda_{1},\lambda_{2} such that

This reduces to the following system of equations:

Multiplying the first system of equations (37) element-wise by x^\hat{x} and then summing, we have that

Similarly, we have that (i,j)Ω(A)x^iy^j=2λ2\sum_{(i,j)\in\Omega(A)}\hat{x}_{i}\hat{y}_{j}=2\lambda_{2}, which implies that the Lagrange multipliers are equal to each other and to one-half of the optimal value attained

We recall here that the optimal points x^,y^\hat{x},\hat{y} are element-wise non-negative. Let σ\sigma denote the element-wise sum of the optimal points x^,y^\hat{x},\hat{y}:

Summing over all ii in (37) and all jj in (38), we have that

Lower bound

Here we set x=y=1n1,x=y=\frac{1}{\sqrt{n}}\mathbf{1}, with 1\mathbf{1} representing the all-ones vector, as candidates in the optimization problem (36). \square

Proof of Proposition 4

For the second inequality, we have used the fact that the maximum of a convex function over a convex set is achieved at one of the extreme points of the constraint set. The unitary matrices are the extreme points of the set of contractions (i.e., matrices with spectral norm 1\leq 1). We have used PT(B)(M)=PUM+MPVPUMPVP_{T(B)}(M)=P_{U}M+MP_{V}-P_{U}MP_{V} from (9) in the last inequality, where PU=UUTP_{U}=UU^{T} and PV=VVTP_{V}=VV^{T} denote the projections onto the spaces spanned by UU and VV respectively.

We have the following simple bound for PUM\|P_{U}M\|_{\infty} with MM unitary:

Here we used the Cauchy-Schwartz inequality in the second line, and the definition of β\beta from (14) in the last line.

Lower bound

Next we prove a lower bound on ξ(B)\xi(B). Recall the definition of the tangent space T(B)T(B) from (6). We restrict our attention to elements of the tangent space T(B)T(B) of the form PUM=UUTMP_{U}M=UU^{T}M for MM unitary (an analogous argument follows for elements of the form PVMP_{V}M for MM unitary). One can check that

Thus, we only need to show that the inequality in line (2)(2) of (39) is achieved by some unitary matrix MM in order to conclude that ξ(B)β(U)\xi(B)\geq\beta(U). Define the “most aligned” basis vector with the subspace UU as follows:

Let MM be any unitary matrix with one of its columns equal to 1β(U)PUei\frac{1}{\beta(U)}P_{U}e_{i^{\ast}}, i.e., a normalized version of the projection onto UU of the most aligned basis vector. One can check that such a unitary matrix achieves equality in line (2)(2) of (39). Consequently, we have that

By a similar argument with respect to VV, we have the lower bound as claimed in the proposition. \square

References