Relative-Error CUR Matrix Decompositions

Petros Drineas, Michael W. Mahoney, S. Muthukrishnan

Introduction

Large m×nm\times n matrices are common in applications since the data often consist of mm objects, each of which is described by nn features. Examples of object–feature pairs include: documents and words contained in those documents; genomes and environmental conditions under which gene responses are measured; stocks and their associated temporal resolution; hyperspectral images and frequency resolution; and web groups and individual users. In each of these application areas, practitioners spend vast amounts of time analyzing the data in order to understand, interpret, and ultimately use this data for some application-specific task.

Say that AA is the m×nm\times n data matrix. In many cases, an important step in data analysis is to construct a compressed representation of AA that may be easier to analyze and interpret. The most common such representation is obtained by truncating the Singular Value Decomposition (SVD) at some number kmin{m,n}k\ll\min\{m,n\} terms, in large part because this provides the “best” rank-kk approximation to AA when measured with respect to any unitarily invariant matrix norm. Unfortunately, the basis vectors (the so-called eigencolumns and eigenrows) provided by this approximation (and with respect to which every column and row of the original data matrix is expressed) are notoriously difficult to interpret in terms of the underlying data and processes generating that data. For example, the vector [(1/2)(1/2) age - (1/2)(1/\sqrt{2}) height + (1/2)(1/2) income], being one of the significant uncorrelated “factors” from a dataset of people’s features is not particularly informative. It would be highly preferable to have a low-rank approximation that is nearly as good as that provided by the SVD but that is expressed in terms of a small number of actual columns and/or actual rows of a matrix, rather than linear combinations of those columns and rows.

The main contribution of this paper is to provide such decompositions. In particular, we provide what we call a relative-error CUR matrix decomposition: given an m×nm\times n matrix AA, we decompose it as a product of three matrices, CC, UU, and RR, where CC consists of a small number of actual columns of AA, RR consists of a small number of actual rows of AA, and UU is a small carefully constructed matrix that guarantees that the product CURCUR is “close” to AA. In fact, CURCUR will be nearly as good as the best low-rank approximation to AA that is traditionally used and that is obtained by truncating the SVD. Hence, the columns of AA that are included in CC, as well as the rows of AA that are included in RR, can be used in place of the eigencolumns and eigenrows, with the added benefit of improved interpretability in terms of the original data.

Before describing applications of our main results in the next subsection, we would like to emphasize that two research communities, the Numerical Linear Algebra (NLA) community and the Theoretical Computer Science (TCS) community, have provided significant practical and theoretical motivation for studying variants of these matrix decompositions over the last ten years. In Section 3, we provide a detailed treatment of relevant prior work in both the NLA and the TCS literature. The two algorithms presented in this paper are the first polynomial time algorithms for such low-rank matrix approximations that come with relative-error guarantees; previously, in some cases, it was not even known whether such matrix decompositions exist.

As an example of this preference for having the data matrix expressed in terms of a small number of actual columns and rows of the original matrix, as opposed to a small number of eigencolumns and eigenrows, consider recent data analysis work in DNA microarray and DNA Single Nucleotide Polymorphism (SNP) analysis [KPS02, LA04, Paschou07a]. DNA SNP data are often modeled as an m×nm\times n matrix AA, where mm is the number of individuals in the study, nn is the number of SNPs being analyzed, and AijA_{ij} is an encoding of the jj-th SNP value for the ii-th individual. Similarly, for DNA microarray data, mm is the number of genes under consideration, nn is the number of arrays or environmental conditions, and AijA_{ij} is the absolute or relative expression level of the ii-th gene in the jj-th environmental condition. Biologists typically have an understanding of a single gene that they fail to have about a linear combination of 60006000 genes (and also similarly for SNPs, individuals, and arrays); thus, recent work in genetics on DNA microarray and DNA SNP data has focused on heuristics to extract actual genes, environmental conditions, individuals, and SNPs from the eigengenes, eigenconditions, eigenpeople, and eigenSNPs computed from the original data matrices [KPS02, LA04].For example, in their review article “Vector algebra in the analysis of genome-wide expression data” [KPS02], which appeared in Genome Biology, Kuruvilla, Park, and Schreiber describe many uses of the vectors provided by the SVD and PCA in DNA microarray analysis. The three biologists then conclude by stating that: “While very efficient basis vectors, the vectors themselves are completely artificial and do not correspond to actual (DNA expression) profiles. … Thus, it would be interesting to try to find basis vectors for all experiment vectors, using actual experiment vectors and not artificial bases that offer little insight.” That is, they explicitly state that they would like decompositions of the form we provide in this paper! Our CUR matrix decomposition is a direct formulation of this problem: determine a small number of actual SNPs that serve as a basis with which to express the remaining SNPs, and a small number of individuals to serve as a basis with which to express the remaining individuals. In fact, motivated in part by this, we have successfully applied a variant of the CUR matrix decomposition presented in this paper to intra- and inter-population genotype reconstruction from tagging SNPs in DNA SNP data from a geographically-diverse set of populations [Paschou07a]. In addition, we have applied a different variant of our CUR matrix decomposition to hyperspectrally-resolved medical imaging data [MMD06]. In this application, a column corresponds to an image at a single physical frequency and a row corresponds to a single spectrally-resolved pixel, and we have shown that data reconstruction and classification tasks can be performed with little loss in quality even after substantial data compression [MMD06].

A quite different motivation for low-rank matrix approximations expressed in terms of a small number of columns and/or rows of the original matrix is to decompose efficiently large low-rank matrices that possess additional structure such as sparsity or non-negativity. This often arises in the analysis of, e.g., large term-document matrices [Ste99, Ste04_TR, BPS04_TR]. Another motivation comes from statistical learning theory, where the data need not even be elements in a vector space, and thus expressing the Gram matrix in terms of a small number of actual data points is of interest [WS01, WRST02_TR, dm_kernel_CONF, dm_kernel_JRNL]. This procedure has been shown empirically to perform well for approximate Gaussian process classification and regression [WS01], to approximate the solution of spectral partitioning for image and video segmentation [FBCM04], and to extend the eigenfunctions of a data-dependent kernel to new data points [BPVDRO03, LafonTH]. Yet another motivation is provided by integral equation applications [GE96, GTZ97, GT01], where large coefficient matrices arise that have blocks corresponding to regions where the kernel is smooth and that are thus well-approximated by low-rank matrices. In these applications, partial SVD algorithms can be expensive, and a description in terms of actual columns and/or rows is of interest [GTZ97, GT01]. A final motivation for studying matrix decompositions of this form is to obtain low-rank matrix approximations to extremely large matrices where a computation of the SVD is too expensive [FKV98, FKV04, dkm_matrix1, dkm_matrix2, dkm_matrix3].

2 Our Main Results

Our main algorithmic results have to do with efficiently computing low-rank matrix approximations that are explicitly expressed in terms of a small number of columns and/or rows of the input matrix. We start with the following definition.

Let AA be an m×nm\times n matrix. For any given CC, an m×cm\times c matrix whose columns consist of cc columns of the matrix AA, the m×nm\times n matrix A=CXA^{\prime}=CX is a column-based matrix approximation to AA, or CX matrix decomposition, for any c×nc\times n matrix XX.

Here, CC is a matrix consisting of the chosen columns of AA, CC+ACC^{+}A is the projection of AA on the subspace spanned by the chosen columns, and AkA_{k} is the best rank-kk approximation to AA. Both algorithms run in time O(SVD(A,k))O(SVD(A,k)), which is the time required to compute the best rank-kk approximation to the matrix AA [GVL96].

Note that we use c>kc>k and have an ϵ\epsilon error, which allows us to take advantage of linear algebraic structure in order to obtain an efficient algorithm. In general, this would not be the case if, given an m×nm\times n matrix AA, we had specified a parameter kk and asked for the “best” subset of kk columns, where “best” is measured, e.g., by maximizing the Frobenius norm captured by projecting onto those columns or by maximizing the volume of the parallelepiped defined by those columns. Also, it is not clear a priori that CC with properties above even exists; see the discussion in Sections 3.2 and 3.3. Finally, our result does not include any reference to regularization or conditioning, as is common in certain application domains; a discussion of similar work on related problems in numerical linear algebra may be found in Section 3.1.

Our second main result extends the previous result to CUR matrix decompositions.

Let AA be an m×nm\times n matrix. For any given CC, an m×cm\times c matrix whose columns consist of cc columns of the matrix AA, and RR, an r×nr\times n matrix whose rows consist of rr rows of the matrix AA, the m×nm\times n matrix A=CURA^{\prime}=CUR is a column-row-based matrix approximation to AA, or CUR matrix decomposition, for any c×rc\times r matrix UU.

Several things should be noted about this definition. First, a CUR matrix decomposition is a CX matrix decomposition, but one with a very special structure, i.e., every column of AA can be expressed in terms of the basis provided by CC using only the information contained in a small number of rows of AA and a low-dimensional encoding matrix. Second, in terms of its singular value structure, UU must clearly contain “inverse-of-AA” information. For the CUR decomposition described in this paper, UU will be a generalized inverse of the intersection between CC and RR. More precisely, if C=ASCDCC=AS_{C}D_{C} and R=DRSRTAR=D_{R}S_{R}^{T}A then U=(DRSRTASCDC)+U=(D_{R}S_{R}^{T}AS_{C}D_{C})^{+}. (See Section 2 for a review of linear algebra and notation, such as that for SCS_{C}, DCD_{C}, SRS_{R}, and DRD_{R}.) Third, the combined size of CC, UU and RR is O(mc+rn+cr)O(mc+rn+cr), which is an improvement over AA’s size of O(mn)O(mn) when c,rn,mc,r\ll n,m. Finally, note the structural simplicity of a CUR matrix decomposition:

Our main result for CUR matrix decomposition is the following.

Here, the matrix UU is a weighted Moore-Penrose inverse of the intersection between CC and RR, and AkA_{k} is the best rank-kk approximation to AA. Both algorithms run in time O(SVD(A,k))O(SVD(A,k)), which is the time required to compute the best rank-kk approximation to the matrix AA [GVL96].

3 Summary of Main Technical Result

The key technical insight that leads to the relative-error guarantees is that the columns are selected by a novel sampling procedure that we call “subspace sampling.” Rather than sample columns from AA with a probability distribution that depends on the Euclidean norms of the columns of AA (which gives provable additive-error bounds [dkm_matrix1, dkm_matrix2, dkm_matrix3]), in “subspace sampling” we randomly sample columns of AA with a probability distribution that depends on the Euclidean norms of the rows of the top kk right singular vectors of AA. This allows us to capture entirely a certain subspace of interest. Let VA,kV_{A,k} be the n×kn\times k matrix whose columns consist of the top kk right singular vectors of AA. The “subspace sampling” probabilities pi,i[n]p_{i},i\in[n] will satisfy

for some β(0,1]\beta\in(0,1], where (VA,k)(i)\left(V_{A,k}\right)_{(i)} is the ii-th row of VA,kV_{A,k}. That is, we will sample based on the norms of the rows (not the columns) of the truncated matrix of singular vectors. Note that j=1n\mbox(VA,k)(j)22=k\sum_{j=1}^{n}\mbox{}\left|\left(V_{A,k}\right)_{(j)}\right|_{2}^{2}=k and that i[n]pi=1\sum_{i\in[n]}p_{i}=1. To construct sampling probabilities satisfying Condition (4), it is sufficient to spend O(SVD(A,k))O(SVD(A,k)) time to compute (exactly or approximately, in which case β=1\beta=1 or β<1\beta<1, respectively) the top kk right singular vectors of AA. Sampling probabilities of this form will allow us to deconvolute subspace information and “size-of-AA” information in the input matrix AA, which in turn will allow us to obtain the relative-error guarantees we desire. Note that we have used this method previously [DMM06], but in that case the sampling probabilities contained other terms that complicated their interpretation.

That is, fit every column of the matrix BB to the basis provided by the columns of the rank-kk matrix AA. Also of interest is the computation of

4 Outline of the Remainder of the Paper

Review of Linear Algebra

In this section, we provide a review of linear algebra that will be useful throughout the paper; for more details, see [Nashed76, HJ85, Stewart90, GVL96, Bhatia97, BIG03]. We also review a sampling matrix formalism that will be convenient in our discussion [dkm_matrix1].

Relationship with Previous Related Work

In this section, we discuss the relationship between our results and related work in numerical linear algebra and theoretical computer science.

Within the numerical linear algebra community, several groups have studied matrix decompositions with similar structural, if not algorithmic, properties to the CX and CUR matrix decompositions we have defined. Much of this work is related to the QR decomposition, originally used extensively in pivoted form by Golub [Golub65, BG65].

Stewart and collaborators were interested in computing sparse low-rank approximations to large sparse term-document matrices [Ste99, Ste04_TR, BPS04_TR]. He developed the quasi-Gram-Schmidt method. This method is a variant of the QR decomposition which, when given as input an m×nm\times n matrix AA and a rank parameter kk, returns an m×km\times k matrix CC consisting of kk columns of AA whose span approximates the column space of AA and also a nonsingular upper-triangular k×kk\times k matrix TCT_{C} that orthogonalizes these columns (but it does not explicitly compute the nonsparse orthogonal matrix QC=CTC1Q_{C}=CT_{C}^{-1}). This provides a matrix decomposition of the form ACXA\approx CX. By applying this method to AA to obtain CC and to ATA^{T} to obtain an k×nk\times n matrix RR consisting of kk rows of AA, one can show that ACURA\approx CUR, where the matrix UU is computed to minimize \mboxACURF2\mbox{}\left\|A-CUR\right\|_{F}^{2}. Although provable approximation guarantees of the form we present were not provided, backward error analysis was performed and the method was shown to perform well empirically [Ste99, Ste04_TR, BPS04_TR].

Goreinov, Tyrtyshnikov, and Zamarashkin [GTZ97, GT01, Tyr04b] were interested in applications such as scattering, in which large coefficient matrices have blocks that can be easily approximated by low-rank matrices. They show that if the matrix AA is approximated by a rank-kk matrix to within an accuracy ϵ\epsilon then there exists a choice of kk columns and kk rows, i.e., CC and RR, and a low-dimensional k×kk\times k matrix UU constructed from the elements of CC and RR, such that ACURA\approx CUR in the sense that \mboxACUR2ϵf(m,n,k)\mbox{}\left\|A-CUR\right\|_{2}\leq\epsilon f(m,n,k), where f(m,n,k)=1+2km+2knf(m,n,k)=1+2\sqrt{km}+2\sqrt{kn}. In [GTZ97], the choice for these matrices is related to the problem of determining the minimum singular value σk\sigma_{k} of k×kk\times k submatrices of n×kn\times k orthogonal matrices. In addition: in [GT01] the choice for CC and RR is interpreted in terms of the maximum volume concept from interpolation theory, in the sense that columns and rows should be chosen such that their intersection WW defines a parallelepiped of maximum volume among all k×kk\times k submatrices of AA; and in [Tyr04b] an empirically effective deterministic algorithm is presented which ensures that UU is well-conditioned.

Gu and Eisenstat, in their seminal paper [GE96], describe a strong rank-revealing QR factorization that deterministically selects exactly kk columns from an m×nm\times n matrix AA. The algorithms of [GE96] are efficient, in that their running time is O(mn2)O(mn^{2}) (assuming that mnm\geq n), which is essentially the time required to compute the SVD of AA. In addition, Gu and Eisenstat prove that if the m×km\times k matrix CC contains the kk selected columns (without any rescaling), then σmin(C)σk(A)/f(k,n)\sigma_{\min}(C)\geq\sigma_{k}(A)/f(k,n), where f(k,n)=O(k(nk))f(k,n)=O(\sqrt{k(n-k)}). Thus, the columns of CC span a parallelepiped whose volume (equivalently, the product of the singular values of CC) is “large.” Currently, we do not know how to convert this property into a statement similar to that of Theorem 1, although perhaps this can be accomplished by relaxing the number of columns selected by the algorithms of [GE96] to O(poly(k,1/ϵ))O(poly(k,1/\epsilon)). For related work prior to Gu and Eisenstat, see Chan and Hansen [CH90, CH92].

2 Related Work in Theoretical Computer Science

Within the theory of algorithms community, much research has followed the seminal work of Frieze, Kannan, and Vempala [FKV98, FKV04]. Their work may be viewed, in our parlance, as sampling columns from a matrix AA to form a matrix CC such that \mboxACXF\mboxAAkF+ϵ\mboxAF\mbox{}\left\|A-CX\right\|_{F}\leq\mbox{}\left\|A-A_{k}\right\|_{F}+\epsilon\mbox{}\left\|A\right\|_{F}. The matrix CC has poly(k,1/ϵ,1/δ)poly(k,1/\epsilon,1/\delta) columns and is constructed after making only two passes over AA using O(m+n)O(m+n) work space. Under similar resource constraints, a series of papers have followed [FKV98, FKV04] in the past seven years [DFKVV99, dkm_matrix2, RV03], improving the dependency of cc on k,1/ϵk,1/\epsilon, and 1/δ1/\delta, and analyzing the spectral as well as the Frobenius norm, yielding bounds of the form

for ξ=2,F\xi=2,F, and thus providing additive-error guarantees for column-based low-rank matrix approximations.

Additive-error approximation algorithms for CUR matrix decompositions have also been analyzed by Drineas, Kannan, and Mahoney [DK03, dkm_matrix1, dkm_matrix2, dkm_matrix3, dm_kernel_CONF, dm_kernel_JRNL]. In particular, in [dkm_matrix3], they compute an approximation to an m×nm\times n matrix AA by sampling cc columns and rr rows from AA to form m×cm\times c and r×nr\times n matrices CC and RR, respectively. From CC and RR, a c×rc\times r matrix UU is constructed such that under appropriate assumptions

with high probability, for both the spectral and Frobenius norms, ξ=2,F\xi=2,F. In [dm_kernel_CONF, dm_kernel_JRNL], it is further shown that if AA is a symmetric positive semidefinite (SPSD) matrix, then one can choose R=CTR=C^{T} and U=W+U=W^{+}, where WW is the c×cc\times c intersection between CC and R=CTR=C^{T}, thus obtaining an approximation AA=CW+CTA\approx A^{\prime}=CW^{+}C^{T}. This approximation is SPSD and has provable bounds of the form (12), except that the scale of the additional additive error is somewhat larger [dm_kernel_CONF, dm_kernel_JRNL].

Most relevant for our relative-error CX and CUR matrix decomposition algorithms is the recent work of Rademacher, Vempala and Wang [RVW05] and Deshpande, Rademacher, Vempala and Wang [DRVW06]. Using two different methods (in one case iterative sampling in a backwards manner and an induction on kk argument [RVW05] and in the other case an argument which relies on estimating the volume of the simplex formed by each of the kk-sized subsets of the columns [DRVW06]), they reported the existence of a set of O(k2/ϵ2)O(k^{2}/\epsilon^{2}) columns that provide relative-error CX matrix decomposition. No algorithmic result was presented, except for an exhaustive algorithm that ran in Ω(nk)\Omega(n^{k}) time. Note that their results did not apply to columns and rows simultaneously. Thus, ours is the first CUR matrix decomposition algorithm with relative error, and it was previously not even known whether such a relative-error CURCUR representation existed, i.e., it was not previously known whether columns and rows satisfying the conditions of Theorem 2 existed.

Other related work includes that of Rudelson and Vershynin [Rud99, V03, RV06_DRAFT], who provide an algorithm for CX matrix decomposition which has an improved additive error spectral norm bound of the form

Their proof uses an elegant result on random vectors in the isotropic position [Rud99], and since we use a variant of their result, it is described in more detail in Appendix LABEL:sxn:matrix_multiply. Achlioptas and McSherry have computed low-rank matrix approximations using sampling techniques that involve zeroing-out and/or quantizing individual elements [AM01, AM03]. The primary focus of their work was in introducing methods to accelerate orthogonal iteration and Lanczos iteration methods, and their analysis relied heavily on ideas from random matrix theory [AM01, AM03]. Agarwal, Har-Peled, and Varadarajan have analyzed so-called “core sets” as a tool for efficiently approximating various extent measures of a point set [AHV04, AHV06]. The choice of columns and/or rows we present are a “core set” for approximate matrix computations; in fact, our algorithmic solution to Theorem 1 solves an open question in their survey [AHV06]. The choice of columns and rows we present may also be viewed as a set of variables and features chosen from a data matrix [BL97, CGKRS00, GE03]. “Feature selection” is a broad area that addresses the choice of columns explicitly for dimension reduction, but the metrics there are typically optimization-based [CGKRS00] or machine-learning based [BL97]. These formulations tend to have set-cover like solutions and are incomparable with the linear-algebraic structure such as the low-rank criteria we consider here that is common among data analysts.

3 Very Recent Work on Relative-Error Approximation Algorithms

To the best of our knowledge, the first nontrivial algorithmic result for relative-error low-rank matrix approximation was provided by a preliminary version of this paper [DMM06_relerr_110305, DMM06_relerr_TR]. In particular, an earlier version of Theorem 1 provided the first known relative-error column-based low-rank approximation in polynomial time [DMM06_relerr_110305, DMM06_relerr_TR]. The major difference between our Theorem 1 and our result in [DMM06_relerr_110305, DMM06_relerr_TR] is that the sampling probabilities in [DMM06_relerr_110305, DMM06_relerr_TR] are more complicated. (See Section LABEL:sxn:main_l2_alg:discussion for details on this.) The algorithm from [DMM06_relerr_110305, DMM06_relerr_TR] runs in O(SVD(A,k))O(SVD(A,k)) time (although it was originally reported to run in only O(SVD(A))O(SVD(A)) time), and it has a sampling complexity of O(k2log(1/δ)/ϵ2)O(k^{2}\log(1/\delta)/\epsilon^{2}) columns.

Subsequent to the completion of the preliminary version of this paper [DMM06_relerr_110305, DMM06_relerr_TR], several developments have been made on relative-error low-rank matrix approximation algorithms. First, Har-Peled reported an algorithm that takes as input an m×nm\times n matrix AA, and in roughly O(mnk2logk)O(mnk^{2}\log k) time returns as output a rank-kk matrix AA^{\prime} with a relative-error approximation guarantee [HarPeled06_relerr_DRAFT]. His algorithm uses geometric ideas and involves sampling and merging approximately optimal kk-flats; it is not clear if this approximation can be expressed in terms of a small number of columns of AA. Then, Deshpande and Vempala [DV06_relerr_TR] reported an algorithm that takes as input an m×nm\times n matrix AA that also returns a relative-error approximation guarantee. Their algorithm extends ideas from [RVW05, DRVW06], and it leads to a CX matrix decomposition consisting of O(klogk)O(k\log k) columns of AA. The complexity of their algorithm is O(Mk2logk)O(Mk^{2}\log k), where MM is the number of nonzero elements of AA, and their algorithm can be implemented in a data streaming framework with O(klogk)O(k\log k) passes over the data. In light of these developments, we simplified and generalized our preliminary results [DMM06_relerr_110305, DMM06_relerr_TR], and we performed a more refined analysis to improve our sampling complexity to O(klogk)O(k\log k). Most recently, we learned of work by Sarlos [Sarlos06], who used ideas from the recently developed fast Johnson-Lindenstrauss transform of Ailon and Chazelle [AC06] to yield further improvements to a CX matrix decomposition.

Our Main Column-Based Matrix Approximation Algorithm

In this section, we describe an algorithm and a theorem, from which our first main result, Theorem 1, will follow.

Algorithm LABEL:alg:algCX takes as input an m×nm\times n matrix AA, a rank parameter kk, and an error parameter ϵ\epsilon. It returns as output an m×cm\times c matrix CC consisting of a small number of columns of AA. The algorithm is very simple: sample a small number of columns according to a carefully-constructed nonuniform probability distribution. Algorithm LABEL:alg:algCX uses the sampling probabilities

but it will be clear from the analysis of Section LABEL:sxn:generalized_l2_regression that any sampling probabilities such that piβ\mbox(VA,kT)(i)22/kp_{i}\geq\beta\mbox{}\left|\left(V_{A,k}^{T}\right)^{(i)}\right|_{2}^{2}/k, for some β(0,1]\beta\in(0,1], will also work with a small β\beta-dependent loss in accuracy. Note that Algorithm LABEL:alg:algCX actually consists of two related algorithms, depending on how exactly the columns are chosen. The Exactly(cc) algorithm picks exactly cc columns of AA to be included in CC in cc i.i.d. trials, where in each trial the ii-th column of AA is picked with probability pip_{i}. The Expected(cc) algorithm picks in expectation at most cc columns of AA to create CC, by including the ii-th column of AA in CC with probability min{1,cpi}\min\left\{1,cp_{i}\right\}. See Algorithms LABEL:alg:SDconstruct_exact and LABEL:alg:SDconstruct_expected in Appendix LABEL:sxn:matrix_multiply for more details about these two column-sampling procedures.