Relative-Error CUR Matrix Decompositions

Petros Drineas, Michael W. Mahoney, S. Muthukrishnan

Introduction

Large $m\times n$ matrices are common in applications since the data often consist of $m$ objects, each of which is described by $n$ features. Examples of object–feature pairs include: documents and words contained in those documents; genomes and environmental conditions under which gene responses are measured; stocks and their associated temporal resolution; hyperspectral images and frequency resolution; and web groups and individual users. In each of these application areas, practitioners spend vast amounts of time analyzing the data in order to understand, interpret, and ultimately use this data for some application-specific task.

Say that $A$ is the $m\times n$ data matrix. In many cases, an important step in data analysis is to construct a compressed representation of $A$ that may be easier to analyze and interpret. The most common such representation is obtained by truncating the Singular Value Decomposition (SVD) at some number $k\ll\min\{m,n\}$ terms, in large part because this provides the “best” rank- $k$ approximation to $A$ when measured with respect to any unitarily invariant matrix norm. Unfortunately, the basis vectors (the so-called eigencolumns and eigenrows) provided by this approximation (and with respect to which every column and row of the original data matrix is expressed) are notoriously difficult to interpret in terms of the underlying data and processes generating that data. For example, the vector [ $(1/2)$ age - $(1/\sqrt{2})$ height + $(1/2)$ income], being one of the significant uncorrelated “factors” from a dataset of people’s features is not particularly informative. It would be highly preferable to have a low-rank approximation that is nearly as good as that provided by the SVD but that is expressed in terms of a small number of actual columns and/or actual rows of a matrix, rather than linear combinations of those columns and rows.

The main contribution of this paper is to provide such decompositions. In particular, we provide what we call a relative-error CUR matrix decomposition: given an $m\times n$ matrix $A$ , we decompose it as a product of three matrices, $C$ , $U$ , and $R$ , where $C$ consists of a small number of actual columns of $A$ , $R$ consists of a small number of actual rows of $A$ , and $U$ is a small carefully constructed matrix that guarantees that the product $CUR$ is “close” to $A$ . In fact, $CUR$ will be nearly as good as the best low-rank approximation to $A$ that is traditionally used and that is obtained by truncating the SVD. Hence, the columns of $A$ that are included in $C$ , as well as the rows of $A$ that are included in $R$ , can be used in place of the eigencolumns and eigenrows, with the added benefit of improved interpretability in terms of the original data.

Before describing applications of our main results in the next subsection, we would like to emphasize that two research communities, the Numerical Linear Algebra (NLA) community and the Theoretical Computer Science (TCS) community, have provided significant practical and theoretical motivation for studying variants of these matrix decompositions over the last ten years. In Section 3, we provide a detailed treatment of relevant prior work in both the NLA and the TCS literature. The two algorithms presented in this paper are the first polynomial time algorithms for such low-rank matrix approximations that come with relative-error guarantees; previously, in some cases, it was not even known whether such matrix decompositions exist.

As an example of this preference for having the data matrix expressed in terms of a small number of actual columns and rows of the original matrix, as opposed to a small number of eigencolumns and eigenrows, consider recent data analysis work in DNA microarray and DNA Single Nucleotide Polymorphism (SNP) analysis [KPS02, LA04, Paschou07a]. DNA SNP data are often modeled as an $m\times n$ matrix $A$ , where $m$ is the number of individuals in the study, $n$ is the number of SNPs being analyzed, and $A_{ij}$ is an encoding of the $j$ -th SNP value for the $i$ -th individual. Similarly, for DNA microarray data, $m$ is the number of genes under consideration, $n$ is the number of arrays or environmental conditions, and $A_{ij}$ is the absolute or relative expression level of the $i$ -th gene in the $j$ -th environmental condition. Biologists typically have an understanding of a single gene that they fail to have about a linear combination of $6000$ genes (and also similarly for SNPs, individuals, and arrays); thus, recent work in genetics on DNA microarray and DNA SNP data has focused on heuristics to extract actual genes, environmental conditions, individuals, and SNPs from the eigengenes, eigenconditions, eigenpeople, and eigenSNPs computed from the original data matrices [KPS02, LA04].For example, in their review article “Vector algebra in the analysis of genome-wide expression data” [KPS02], which appeared in Genome Biology, Kuruvilla, Park, and Schreiber describe many uses of the vectors provided by the SVD and PCA in DNA microarray analysis. The three biologists then conclude by stating that: “While very efficient basis vectors, the vectors themselves are completely artificial and do not correspond to actual (DNA expression) profiles. … Thus, it would be interesting to try to find basis vectors for all experiment vectors, using actual experiment vectors and not artificial bases that offer little insight.” That is, they explicitly state that they would like decompositions of the form we provide in this paper! Our CUR matrix decomposition is a direct formulation of this problem: determine a small number of actual SNPs that serve as a basis with which to express the remaining SNPs, and a small number of individuals to serve as a basis with which to express the remaining individuals. In fact, motivated in part by this, we have successfully applied a variant of the CUR matrix decomposition presented in this paper to intra- and inter-population genotype reconstruction from tagging SNPs in DNA SNP data from a geographically-diverse set of populations [Paschou07a]. In addition, we have applied a different variant of our CUR matrix decomposition to hyperspectrally-resolved medical imaging data [MMD06]. In this application, a column corresponds to an image at a single physical frequency and a row corresponds to a single spectrally-resolved pixel, and we have shown that data reconstruction and classification tasks can be performed with little loss in quality even after substantial data compression [MMD06].

A quite different motivation for low-rank matrix approximations expressed in terms of a small number of columns and/or rows of the original matrix is to decompose efficiently large low-rank matrices that possess additional structure such as sparsity or non-negativity. This often arises in the analysis of, e.g., large term-document matrices [Ste99, Ste04_TR, BPS04_TR]. Another motivation comes from statistical learning theory, where the data need not even be elements in a vector space, and thus expressing the Gram matrix in terms of a small number of actual data points is of interest [WS01, WRST02_TR, dm_kernel_CONF, dm_kernel_JRNL]. This procedure has been shown empirically to perform well for approximate Gaussian process classification and regression [WS01], to approximate the solution of spectral partitioning for image and video segmentation [FBCM04], and to extend the eigenfunctions of a data-dependent kernel to new data points [BPVDRO03, LafonTH]. Yet another motivation is provided by integral equation applications [GE96, GTZ97, GT01], where large coefficient matrices arise that have blocks corresponding to regions where the kernel is smooth and that are thus well-approximated by low-rank matrices. In these applications, partial SVD algorithms can be expensive, and a description in terms of actual columns and/or rows is of interest [GTZ97, GT01]. A final motivation for studying matrix decompositions of this form is to obtain low-rank matrix approximations to extremely large matrices where a computation of the SVD is too expensive [FKV98, FKV04, dkm_matrix1, dkm_matrix2, dkm_matrix3].

2 Our Main Results

Our main algorithmic results have to do with efficiently computing low-rank matrix approximations that are explicitly expressed in terms of a small number of columns and/or rows of the input matrix. We start with the following definition.

Let $A$ be an $m\times n$ matrix. For any given $C$ , an $m\times c$ matrix whose columns consist of $c$ columns of the matrix $A$ , the $m\times n$ matrix $A^{\prime}=CX$ is a column-based matrix approximation to $A$ , or CX matrix decomposition, for any $c\times n$ matrix $X$ .

Here, $C$ is a matrix consisting of the chosen columns of $A$ , $CC^{+}A$ is the projection of $A$ on the subspace spanned by the chosen columns, and $A_{k}$ is the best rank- $k$ approximation to $A$ . Both algorithms run in time $O(SVD(A,k))$ , which is the time required to compute the best rank- $k$ approximation to the matrix $A$ [GVL96].

Note that we use $c>k$ and have an $\epsilon$ error, which allows us to take advantage of linear algebraic structure in order to obtain an efficient algorithm. In general, this would not be the case if, given an $m\times n$ matrix $A$ , we had specified a parameter $k$ and asked for the “best” subset of $k$ columns, where “best” is measured, e.g., by maximizing the Frobenius norm captured by projecting onto those columns or by maximizing the volume of the parallelepiped defined by those columns. Also, it is not clear a priori that $C$ with properties above even exists; see the discussion in Sections 3.2 and 3.3. Finally, our result does not include any reference to regularization or conditioning, as is common in certain application domains; a discussion of similar work on related problems in numerical linear algebra may be found in Section 3.1.

Our second main result extends the previous result to CUR matrix decompositions.

Let $A$ be an $m\times n$ matrix. For any given $C$ , an $m\times c$ matrix whose columns consist of $c$ columns of the matrix $A$ , and $R$ , an $r\times n$ matrix whose rows consist of $r$ rows of the matrix $A$ , the $m\times n$ matrix $A^{\prime}=CUR$ is a column-row-based matrix approximation to $A$ , or CUR matrix decomposition, for any $c\times r$ matrix $U$ .

Several things should be noted about this definition. First, a CUR matrix decomposition is a CX matrix decomposition, but one with a very special structure, i.e., every column of $A$ can be expressed in terms of the basis provided by $C$ using only the information contained in a small number of rows of $A$ and a low-dimensional encoding matrix. Second, in terms of its singular value structure, $U$ must clearly contain “inverse-of- $A$ ” information. For the CUR decomposition described in this paper, $U$ will be a generalized inverse of the intersection between $C$ and $R$ . More precisely, if $C=AS_{C}D_{C}$ and $R=D_{R}S_{R}^{T}A$ then $U=(D_{R}S_{R}^{T}AS_{C}D_{C})^{+}$ . (See Section 2 for a review of linear algebra and notation, such as that for $S_{C}$ , $D_{C}$ , $S_{R}$ , and $D_{R}$ .) Third, the combined size of $C$ , $U$ and $R$ is $O(mc+rn+cr)$ , which is an improvement over $A$ ’s size of $O(mn)$ when $c,r\ll n,m$ . Finally, note the structural simplicity of a CUR matrix decomposition:

Our main result for CUR matrix decomposition is the following.

Here, the matrix $U$ is a weighted Moore-Penrose inverse of the intersection between $C$ and $R$ , and $A_{k}$ is the best rank- $k$ approximation to $A$ . Both algorithms run in time $O(SVD(A,k))$ , which is the time required to compute the best rank- $k$ approximation to the matrix $A$ [GVL96].

3 Summary of Main Technical Result

The key technical insight that leads to the relative-error guarantees is that the columns are selected by a novel sampling procedure that we call “subspace sampling.” Rather than sample columns from $A$ with a probability distribution that depends on the Euclidean norms of the columns of $A$ (which gives provable additive-error bounds [dkm_matrix1, dkm_matrix2, dkm_matrix3]), in “subspace sampling” we randomly sample columns of $A$ with a probability distribution that depends on the Euclidean norms of the rows of the top $k$ right singular vectors of $A$ . This allows us to capture entirely a certain subspace of interest. Let $V_{A,k}$ be the $n\times k$ matrix whose columns consist of the top $k$ right singular vectors of $A$ . The “subspace sampling” probabilities $p_{i},i\in[n]$ will satisfy

for some $\beta\in(0,1]$ , where $\left(V_{A,k}\right)_{(i)}$ is the $i$ -th row of $V_{A,k}$ . That is, we will sample based on the norms of the rows (not the columns) of the truncated matrix of singular vectors. Note that $\sum_{j=1}^{n}\mbox{}\left|\left(V_{A,k}\right)_{(j)}\right|_{2}^{2}=k$ and that $\sum_{i\in[n]}p_{i}=1$ . To construct sampling probabilities satisfying Condition (4), it is sufficient to spend $O(SVD(A,k))$ time to compute (exactly or approximately, in which case $\beta=1$ or $\beta<1$ , respectively) the top $k$ right singular vectors of $A$ . Sampling probabilities of this form will allow us to deconvolute subspace information and “size-of- $A$ ” information in the input matrix $A$ , which in turn will allow us to obtain the relative-error guarantees we desire. Note that we have used this method previously [DMM06], but in that case the sampling probabilities contained other terms that complicated their interpretation.

That is, fit every column of the matrix $B$ to the basis provided by the columns of the rank- $k$ matrix $A$ . Also of interest is the computation of

4 Outline of the Remainder of the Paper

Review of Linear Algebra

In this section, we provide a review of linear algebra that will be useful throughout the paper; for more details, see [Nashed76, HJ85, Stewart90, GVL96, Bhatia97, BIG03]. We also review a sampling matrix formalism that will be convenient in our discussion [dkm_matrix1].

Relationship with Previous Related Work

In this section, we discuss the relationship between our results and related work in numerical linear algebra and theoretical computer science.

Within the numerical linear algebra community, several groups have studied matrix decompositions with similar structural, if not algorithmic, properties to the CX and CUR matrix decompositions we have defined. Much of this work is related to the QR decomposition, originally used extensively in pivoted form by Golub [Golub65, BG65].

Stewart and collaborators were interested in computing sparse low-rank approximations to large sparse term-document matrices [Ste99, Ste04_TR, BPS04_TR]. He developed the quasi-Gram-Schmidt method. This method is a variant of the QR decomposition which, when given as input an $m\times n$ matrix $A$ and a rank parameter $k$ , returns an $m\times k$ matrix $C$ consisting of $k$ columns of $A$ whose span approximates the column space of $A$ and also a nonsingular upper-triangular $k\times k$ matrix $T_{C}$ that orthogonalizes these columns (but it does not explicitly compute the nonsparse orthogonal matrix $Q_{C}=CT_{C}^{-1}$ ). This provides a matrix decomposition of the form $A\approx CX$ . By applying this method to $A$ to obtain $C$ and to $A^{T}$ to obtain an $k\times n$ matrix $R$ consisting of $k$ rows of $A$ , one can show that $A\approx CUR$ , where the matrix $U$ is computed to minimize $\mbox{}\left\|A-CUR\right\|_{F}^{2}$ . Although provable approximation guarantees of the form we present were not provided, backward error analysis was performed and the method was shown to perform well empirically [Ste99, Ste04_TR, BPS04_TR].

Goreinov, Tyrtyshnikov, and Zamarashkin [GTZ97, GT01, Tyr04b] were interested in applications such as scattering, in which large coefficient matrices have blocks that can be easily approximated by low-rank matrices. They show that if the matrix $A$ is approximated by a rank- $k$ matrix to within an accuracy $\epsilon$ then there exists a choice of $k$ columns and $k$ rows, i.e., $C$ and $R$ , and a low-dimensional $k\times k$ matrix $U$ constructed from the elements of $C$ and $R$ , such that $A\approx CUR$ in the sense that $\mbox{}\left\|A-CUR\right\|_{2}\leq\epsilon f(m,n,k)$ , where $f(m,n,k)=1+2\sqrt{km}+2\sqrt{kn}$ . In [GTZ97], the choice for these matrices is related to the problem of determining the minimum singular value $\sigma_{k}$ of $k\times k$ submatrices of $n\times k$ orthogonal matrices. In addition: in [GT01] the choice for $C$ and $R$ is interpreted in terms of the maximum volume concept from interpolation theory, in the sense that columns and rows should be chosen such that their intersection $W$ defines a parallelepiped of maximum volume among all $k\times k$ submatrices of $A$ ; and in [Tyr04b] an empirically effective deterministic algorithm is presented which ensures that $U$ is well-conditioned.

Gu and Eisenstat, in their seminal paper [GE96], describe a strong rank-revealing QR factorization that deterministically selects exactly $k$ columns from an $m\times n$ matrix $A$ . The algorithms of [GE96] are efficient, in that their running time is $O(mn^{2})$ (assuming that $m\geq n$ ), which is essentially the time required to compute the SVD of $A$ . In addition, Gu and Eisenstat prove that if the $m\times k$ matrix $C$ contains the $k$ selected columns (without any rescaling), then $\sigma_{\min}(C)\geq\sigma_{k}(A)/f(k,n)$ , where $f(k,n)=O(\sqrt{k(n-k)})$ . Thus, the columns of $C$ span a parallelepiped whose volume (equivalently, the product of the singular values of $C$ ) is “large.” Currently, we do not know how to convert this property into a statement similar to that of Theorem 1, although perhaps this can be accomplished by relaxing the number of columns selected by the algorithms of [GE96] to $O(poly(k,1/\epsilon))$ . For related work prior to Gu and Eisenstat, see Chan and Hansen [CH90, CH92].

2 Related Work in Theoretical Computer Science

Within the theory of algorithms community, much research has followed the seminal work of Frieze, Kannan, and Vempala [FKV98, FKV04]. Their work may be viewed, in our parlance, as sampling columns from a matrix $A$ to form a matrix $C$ such that $\mbox{}\left\|A-CX\right\|_{F}\leq\mbox{}\left\|A-A_{k}\right\|_{F}+\epsilon\mbox{}\left\|A\right\|_{F}$ . The matrix $C$ has $poly(k,1/\epsilon,1/\delta)$ columns and is constructed after making only two passes over $A$ using $O(m+n)$ work space. Under similar resource constraints, a series of papers have followed [FKV98, FKV04] in the past seven years [DFKVV99, dkm_matrix2, RV03], improving the dependency of $c$ on $k,1/\epsilon$ , and $1/\delta$ , and analyzing the spectral as well as the Frobenius norm, yielding bounds of the form

for $\xi=2,F$ , and thus providing additive-error guarantees for column-based low-rank matrix approximations.

Additive-error approximation algorithms for CUR matrix decompositions have also been analyzed by Drineas, Kannan, and Mahoney [DK03, dkm_matrix1, dkm_matrix2, dkm_matrix3, dm_kernel_CONF, dm_kernel_JRNL]. In particular, in [dkm_matrix3], they compute an approximation to an $m\times n$ matrix $A$ by sampling $c$ columns and $r$ rows from $A$ to form $m\times c$ and $r\times n$ matrices $C$ and $R$ , respectively. From $C$ and $R$ , a $c\times r$ matrix $U$ is constructed such that under appropriate assumptions

with high probability, for both the spectral and Frobenius norms, $\xi=2,F$ . In [dm_kernel_CONF, dm_kernel_JRNL], it is further shown that if $A$ is a symmetric positive semidefinite (SPSD) matrix, then one can choose $R=C^{T}$ and $U=W^{+}$ , where $W$ is the $c\times c$ intersection between $C$ and $R=C^{T}$ , thus obtaining an approximation $A\approx A^{\prime}=CW^{+}C^{T}$ . This approximation is SPSD and has provable bounds of the form (12), except that the scale of the additional additive error is somewhat larger [dm_kernel_CONF, dm_kernel_JRNL].

Most relevant for our relative-error CX and CUR matrix decomposition algorithms is the recent work of Rademacher, Vempala and Wang [RVW05] and Deshpande, Rademacher, Vempala and Wang [DRVW06]. Using two different methods (in one case iterative sampling in a backwards manner and an induction on $k$ argument [RVW05] and in the other case an argument which relies on estimating the volume of the simplex formed by each of the $k$ -sized subsets of the columns [DRVW06]), they reported the existence of a set of $O(k^{2}/\epsilon^{2})$ columns that provide relative-error CX matrix decomposition. No algorithmic result was presented, except for an exhaustive algorithm that ran in $\Omega(n^{k})$ time. Note that their results did not apply to columns and rows simultaneously. Thus, ours is the first CUR matrix decomposition algorithm with relative error, and it was previously not even known whether such a relative-error $CUR$ representation existed, i.e., it was not previously known whether columns and rows satisfying the conditions of Theorem 2 existed.

Other related work includes that of Rudelson and Vershynin [Rud99, V03, RV06_DRAFT], who provide an algorithm for CX matrix decomposition which has an improved additive error spectral norm bound of the form

Their proof uses an elegant result on random vectors in the isotropic position [Rud99], and since we use a variant of their result, it is described in more detail in Appendix LABEL:sxn:matrix_multiply. Achlioptas and McSherry have computed low-rank matrix approximations using sampling techniques that involve zeroing-out and/or quantizing individual elements [AM01, AM03]. The primary focus of their work was in introducing methods to accelerate orthogonal iteration and Lanczos iteration methods, and their analysis relied heavily on ideas from random matrix theory [AM01, AM03]. Agarwal, Har-Peled, and Varadarajan have analyzed so-called “core sets” as a tool for efficiently approximating various extent measures of a point set [AHV04, AHV06]. The choice of columns and/or rows we present are a “core set” for approximate matrix computations; in fact, our algorithmic solution to Theorem 1 solves an open question in their survey [AHV06]. The choice of columns and rows we present may also be viewed as a set of variables and features chosen from a data matrix [BL97, CGKRS00, GE03]. “Feature selection” is a broad area that addresses the choice of columns explicitly for dimension reduction, but the metrics there are typically optimization-based [CGKRS00] or machine-learning based [BL97]. These formulations tend to have set-cover like solutions and are incomparable with the linear-algebraic structure such as the low-rank criteria we consider here that is common among data analysts.

3 Very Recent Work on Relative-Error Approximation Algorithms

To the best of our knowledge, the first nontrivial algorithmic result for relative-error low-rank matrix approximation was provided by a preliminary version of this paper [DMM06_relerr_110305, DMM06_relerr_TR]. In particular, an earlier version of Theorem 1 provided the first known relative-error column-based low-rank approximation in polynomial time [DMM06_relerr_110305, DMM06_relerr_TR]. The major difference between our Theorem 1 and our result in [DMM06_relerr_110305, DMM06_relerr_TR] is that the sampling probabilities in [DMM06_relerr_110305, DMM06_relerr_TR] are more complicated. (See Section LABEL:sxn:main_l2_alg:discussion for details on this.) The algorithm from [DMM06_relerr_110305, DMM06_relerr_TR] runs in $O(SVD(A,k))$ time (although it was originally reported to run in only $O(SVD(A))$ time), and it has a sampling complexity of $O(k^{2}\log(1/\delta)/\epsilon^{2})$ columns.

Subsequent to the completion of the preliminary version of this paper [DMM06_relerr_110305, DMM06_relerr_TR], several developments have been made on relative-error low-rank matrix approximation algorithms. First, Har-Peled reported an algorithm that takes as input an $m\times n$ matrix $A$ , and in roughly $O(mnk^{2}\log k)$ time returns as output a rank- $k$ matrix $A^{\prime}$ with a relative-error approximation guarantee [HarPeled06_relerr_DRAFT]. His algorithm uses geometric ideas and involves sampling and merging approximately optimal $k$ -flats; it is not clear if this approximation can be expressed in terms of a small number of columns of $A$ . Then, Deshpande and Vempala [DV06_relerr_TR] reported an algorithm that takes as input an $m\times n$ matrix $A$ that also returns a relative-error approximation guarantee. Their algorithm extends ideas from [RVW05, DRVW06], and it leads to a CX matrix decomposition consisting of $O(k\log k)$ columns of $A$ . The complexity of their algorithm is $O(Mk^{2}\log k)$ , where $M$ is the number of nonzero elements of $A$ , and their algorithm can be implemented in a data streaming framework with $O(k\log k)$ passes over the data. In light of these developments, we simplified and generalized our preliminary results [DMM06_relerr_110305, DMM06_relerr_TR], and we performed a more refined analysis to improve our sampling complexity to $O(k\log k)$ . Most recently, we learned of work by Sarlos [Sarlos06], who used ideas from the recently developed fast Johnson-Lindenstrauss transform of Ailon and Chazelle [AC06] to yield further improvements to a CX matrix decomposition.

Our Main Column-Based Matrix Approximation Algorithm

In this section, we describe an algorithm and a theorem, from which our first main result, Theorem 1, will follow.

Algorithm LABEL:alg:algCX takes as input an $m\times n$ matrix $A$ , a rank parameter $k$ , and an error parameter $\epsilon$ . It returns as output an $m\times c$ matrix $C$ consisting of a small number of columns of $A$ . The algorithm is very simple: sample a small number of columns according to a carefully-constructed nonuniform probability distribution. Algorithm LABEL:alg:algCX uses the sampling probabilities

but it will be clear from the analysis of Section LABEL:sxn:generalized_l2_regression that any sampling probabilities such that $p_{i}\geq\beta\mbox{}\left|\left(V_{A,k}^{T}\right)^{(i)}\right|_{2}^{2}/k$ , for some $\beta\in(0,1]$ , will also work with a small $\beta$ -dependent loss in accuracy. Note that Algorithm LABEL:alg:algCX actually consists of two related algorithms, depending on how exactly the columns are chosen. The Exactly( $c$ ) algorithm picks exactly $c$ columns of $A$ to be included in $C$ in $c$ i.i.d. trials, where in each trial the $i$ -th column of $A$ is picked with probability $p_{i}$ . The Expected( $c$ ) algorithm picks in expectation at most $c$ columns of $A$ to create $C$ , by including the $i$ -th column of $A$ in $C$ with probability $\min\left\{1,cp_{i}\right\}$ . See Algorithms LABEL:alg:SDconstruct_exact and LABEL:alg:SDconstruct_expected in Appendix LABEL:sxn:matrix_multiply for more details about these two column-sampling procedures.