Efficient volume sampling for row/column subset selection

Amit Deshpande, Luis Rademacher

Introduction

Volume sampling, i.e., picking kk-subsets of the rows of any given matrix with probabilities proportional to the squared volumes of the simplicies defined by them, was introduced in in the context of low-rank approximation of matrices. It is equivalent to sampling kk-subsets of {1,,m}\{1,\dotsc,m\} with probabilities proportional to the corresponding kk by kk principal minors of any given mm by mm positive semidefinite matrix.

In the context of low-rank approximation, volume sampling is related to a problem called row/column-subset selection . Most large data sets that arise in search, microarray experiments, computer vision, data mining etc. can be thought of as matrices where rows and columns are indexed by objects and features, respectively (or vice versa), and we need to pick a small subset of features that are dominant. For example, while studying gene expression data biologists want a small subset of genes that are responsible for a particular disease. Usual dimension reduction techniques such as principal component analysis (PCA) or random projection fail to do this as they typically output singular vectors or random vectors which are linear combinations of a large number of feature vectors. A recent article by Mahoney and Drineas highlights the limitations of PCA and gives experimental data on practical applications of low-rank approximation based on row/column-subset selection.

Row/column-subset selection and volume sampling

In other words, the rows of AkA_{k} are projections of the rows of AA onto span(vi  :  1ik)\operatorname{span}\left(v_{i}\;:\;1\leq i\leq k\right). Because of this, most dimension reduction techniques based on the singular value decomposition, e.g., principal component analysis (PCA), are interpreted as giving viv_{i}’s as the dominant vectors, which happen to be linear combinations of a large number of the rows or feature vectors of AA.

Several row-sampling techniques have been considered in the past as an approximate but faster alternative to the singular value decomposition, in the context of streaming algorithms and large data sets that cannot be stored in random access memory . The first among these is the squared-length sampling of rows introduced by Frieze, Kannan and Vempala . Another sampling scheme due to Drineas, Mahoney and Muthukrishnan uses the singular values and singular vectors to decide the sampling probabilities. Later Deshpande, Rademacher, Vempala and Wang introduced volume sampling as a generalization of squared-length sampling.

The application of volume sampling to low-rank approximation and, more importantly, to the row-subset selection problem, is given by the following theorem shown in . It says that picking a subset of kk rows according to volume sampling and projecting all the rows of AA onto their span gives a (k+1)(k+1)-approximation to the nearest rank-kk matrix to AA.

As we will see later, this easily implies

Theorem 2 gives only an existence result for row-subset selection and we also know a matching lower bound that says this is the best we can possibly do.

However, no efficient algorithm was known for volume sampling prior to this work. An algorithm mentioned in Deshpande and Vempala does k!k!-approximate volume sampling in time O(kmn)O(kmn), which means that plugging it in Theorem 2 can only guarantee (k+1)!(k+1)!-approximation instead of (k+1)(k+1). Finding an efficient algorithm for volume sampling is mentioned as an open problem in the recent monograph on spectral algorithms by Kannan and Vempala (see Section 7.47.4 of ).

Boutsidis, Drineas and Mahoney gave an alternative approach to row-subset selection (without going through volume sampling) and here is a re-statement of the main theorem from their paper which uses columns instead of rows.

Row/column-subset selection problem is related to rank-revealing decompositions considered in linear algebra , and the previous best algorithmic result for row-subset selection in the spectral norm case was given by a result of Gu and Eisenstat on strong rank-revealing QR decompositions. The following theorem is a direct consequence of as pointed out in .

Moreover, this subset SS can be found in time O((m+nlogfn)n2)O\left((m+n\log_{f}n)n^{2}\right).

In the context of volume sampling, it is interesting to note that Pan has used an idea of picking submatrices of locally maximum volume (or determinants) for rank-revealing matrix decompositions. We refer the reader to for details.

The results of Goreinov, Tyrtyshnikov and Zamarashkin on pseudo-skeleton approximations of matrices look at submatrices of maximum determinants as good candidates for row/column-subset selection.

Our main result is a polynomial time algorithm for exact volume sampling. In Section 4, we give an outline of our Algorithm 1, followed by two possible subroutines given by Algorithms 2 and 3 that could be plugged into it.

The algorithm just described informally, if implemented as stated, would have a polynomial dependence in mm, nn and kk, for some low-degree polynomial. We can do better and get a linear dependence in mm by working with ATAA^{T}A in place of AATAA^{T} and computing the projected matrices using rank-11 updates (Theorem 7), while still having a polynomial time guarantee and sampling exactly. It would be even faster to perform rank-1 updates to the characteristic polynomial itself, but that requires the computation of the inverse of a polynomial matrix (Proposition 18), and it is not clear to us at this time that there is a fast enough exact algorithm that works for arbitrary matrices. Jeannerod and Villard give an algorithm to invert a generic nn-by-nn matrix with entries of degree dd, with nn a power of two, in time O(n3d)O(n^{3}d). This would lead to the computation of all marginal probabilities for one row in time O(n3+mn2)O(n^{3}+mn^{2}) (a variation of Algorithm 3 and its analysis).

Instead, if we are willing to be more practical, while sacrificing our guarantees, then we can perform rank-11 updates to the characteristic polynomial by using the singular value decomposition (SVD). In , an algorithm with cost O(min{mn2,m2n})O(\min\{mn^{2},m^{2}n\}) arithmetic operations is given for the singular value decomposition but the SVD cannot be computed exactly and we do not know how its error propagates in our algorithm which uses many such computations. If the SVD of an mm-by-nn matrix can be computed in time O(Tsvd)O(T_{svd}), this leads to a nearly-exact algorithm for volume sampling in time O(kTsvd+kmn2)O(kT_{svd}+kmn^{2}). See Proposition 18 for details.

Volume sampling was originally defined in to prove Theorem 2, in particular, to show that any matrix AA contains kk rows in whose span lie the rows of a rank-kk approximation to AA that is no worse than the best in the Frobenius norm. Efficient volume sampling leads to an efficient selection of kk rows that satisfy this guarantee, in expectation. In Section 5, we use the method of conditional expectations to derandomize this selection. This gives an efficient deterministic algorithm (Algorithm 4) for row-subset selection with the following guarantee in the Frobenius norm. This guarantee immediately implies a guarantee in the spectral norm, as follows:

This improves the O(klogk)O(k\sqrt{\log k})-approximation of Boutsidis, Drineas and Mahoney for the Frobenius norm case and matches the lower bound shown in Theorem 3 due to .

Using random projection for dimensionality reduction, the polynomial time algorithm for volume sampling mentioned in Theorem 7 (i.e., Algorithm 1 with Algorithm 2 as its subroutine), gives (1+ϵ)(1+\epsilon)-approximate volume sampling, using

Finally, we show a lower bound for row/column-subset selection in the spectral norm that almost matches our upper bound in terms of the dependence on nn.

Preliminaries and notation

Throughout the paper we assume mnm\geq n. This assumption is not needed most of the time, but justifies sometimes working with ATAA^{T}A instead of AATAA^{T} and, more generally, some choices in the design of our algorithms. It is also partially justified by our use of a random projection as a preprocessing step that makes nn small.

can be generalized into the following lemma.

where σ1,,σm\sigma_{1},\dotsc,\sigma_{m} are the singular values of AA, i.e., eigenvalues of AATAA^{T}, and

is the characteristic polynomial of AATAA^{T}. Using det(xIAAT)=xmndet(xIATA)\operatorname{det}\left(xI-AA^{T}\right)=x^{m-n}\operatorname{det}\left(xI-A^{T}A\right), we can alternatively use cmk(AAT)=cnk(ATA)c_{m-k}(AA^{T})=c_{n-k}(A^{T}A) in the above formula, for knk\leq n.

Let ω\omega be the exponent of the arithmetic complexity of matrix multiplication. We use that there is an algorithm for computing the characteristic polynomial of an nn-by-nn matrix using O(nωlogn)O(n^{\omega}\log n) arithmetic operations [2, Section 16.6].

Here is another lemma that we will need about dividing determinants into products of two determinants.

To show this, we consider two cases. If CSCSTC_{S}C_{S}^{T} is singular, then both sides of the equality are zero. If CSCSTC_{S}C_{S}^{T} is invertible, then we can perform block Gaussian elimination and write

applied to (EFGH)=C\begin{pmatrix}E&F\\ G&H\end{pmatrix}=C. Writing the determinants of the block-triangular matrices gives

Now, the projection of the rows of a matrix KK onto the row-space of a matrix LL can be written as

so that DT=CTCTCST(CSCST)1CSD_{T}=C_{T}-C_{T}C_{S}^{T}(C_{S}C_{S}^{T})^{-1}C_{S}, and

Finally, a well-known lemma about how the determinant of a matrix changes under a rank-11 update.

Efficient volume sampling algorithms

We first outline our volume sampling algorithm to convince the reader that volume sampling can be done in polynomial time. In the subsequent subsections, we give improved subroutines to get faster implementations of the same idea.

The main idea behind our algorithm is based on Lemma 14 about the marginal probabilities encountered in volume sampling. To explain this, it is more convenient to look at volume sampling defined as a distribution on kk-tuples (X1,X2,,Xk)(X_{1},X_{2},\dotsc,X_{k}) instead of kk-subsets, where each of the k!k! permutations of a kk-subset is equally likely, i.e., for any (i1,i2,,ik)[m]k(i_{1},i_{2},\dotsc,i_{k})\in[m]^{k},

Then the marginal probabilities Pr(Xt=i    X1=i1,,Xt1=it1){\sf Pr}\left(X_{t}=i\;|\;X_{1}=i_{1},\dotsc,X_{t-1}=i_{t-1}\right) have the following interpretation in terms of the coefficients of certain characteristic polynomials.

Let (i1,,it1)[m]t1(i_{1},\dotsc,i_{t-1})\in[m]^{t-1} such that Pr(X1=i1,,Xt1=it1)>0{\sf Pr}\left(X_{1}=i_{1},\dotsc,X_{t-1}=i_{t-1}\right)>0, for a random kk-tuple (X1,X2,,Xk)(X_{1},X_{2},\dotsc,X_{k}) from the extended volume sampling over kk-tuples. Let S={i1,,it1}S=\{i_{1},\dotsc,i_{t-1}\}, B=AπS(A)B=A-\pi_{S}(A) and Ci=Bπ{i}(B)=AπS{i}(A)C_{i}=B-\pi_{\{i\}}(B)=A-\pi_{S\cup\{i\}}(A). Then,

With this lemma in hand, let us consider the following outline of our algorithm. We will later give two more efficient implementations of this outline, depending on how the pip_{i}’s are computed.

Output: a subset SS of kk rows of AA picked with probability proportional to det(ASAST)\operatorname{det}\left(A_{S}A_{S}^{T}\right).

Initialize SS\leftarrow\emptyset and BAB\leftarrow A. For t=1t=1 to kk do:

where Ci=Bπ{i}(B)C_{i}=B-\pi_{\{i\}}(B) is a matrix obtained by projecting each row of BB orthogonal to bib_{i}.

Pick ii with probability proportional to pip_{i}. Let SS{i}S\leftarrow S\cup\{i\} and BCiB\leftarrow C_{i}.

Now we show the correctness of the algorithm:

The probability that our volume sampling algorithm outlined above picks a kk-subset SS is proportional to det(ASAST)\operatorname{det}\left(A_{S}A_{S}^{T}\right). This algorithm can be implemented with a cost of O(km3n+kmω+1logm)O(km^{3}n+km^{\omega+1}\log m) arithmetic operations.

By Lemma 14, for any i1,i2,,iki_{1},i_{2},\dotsc,i_{k} such that Pr(X1=i1,,Xk=ik){\sf Pr}\left(X_{1}=i_{1},\dotsc,X_{k}=i_{k}\right), the probability that our algorithm picks a sequence of rows indexed i1,i2,,iki_{1},i_{2},\dotsc,i_{k} in that order is equal to

Otherwise, the probability is zero because in the execution of the algorithm, bi=0\left\|b_{i}\right\|=0 for some step tt. This proves the correctness of our algorithm.

Given that one can compute the characteristic polynomial of an mm-by-mm matrix in O(mωlogm)O(m^{\omega}\log m) (see Section 3), our outline can be implemented with the following count of arithmetic operations: for every tt and ii, O(m2n)O(m^{2}n) to compute CiCiTC_{i}C_{i}^{T}, O(m2n+mωlogm)O(m^{2}n+m^{\omega}\log m) in total for pip_{i}. Thus, volume sampling in O(km3n+kmω+1logm)O(km^{3}n+km^{\omega+1}\log m). ∎

First subroutine for marginal probabilities

Compute the characteristic polynomial of CiTCiC_{i}^{T}C_{i} and output

BTBB^{T}B can be computed in time O(mn2)O(mn^{2}). Observe that since CiC_{i} is obtained by projecting each row of BB orthogonal to bib_{i},

So once we have BTBB^{T}B, for each ii, CiTCiC_{i}^{T}C_{i} can be computed in time O(n2)O(n^{2}) and the characteristic polynomial of CiTCiC_{i}^{T}C_{i} can be computed in time O(nωlogn)O(n^{\omega}\log n) [2, Section 16.6]. By Lemma 11, cmk+t(CiCiT)=cnk+t(CiTCi)c_{m-k+t}(C_{i}C_{i}^{T})=c_{n-k+t}(C_{i}^{T}C_{i}) and hence, the above subroutine results into an O(kmnωlogn)O(kmn^{\omega}\log n) time algorithm for volume sampling. ∎

The proof follows by combining Proposition 15 and Proposition 16, and since we compute all the pip_{i}’s simultaneously in each round in O(mnωlogn)O(mn^{\omega}\log n) arithmetic operations, the total number of arithmetic operations is O(kmnωlogn)O(kmn^{\omega}\log n). ∎

2 Efficient volume sampling using SVD

Taking further the idea that each CiC_{i} is a rank-11 update of BB, we can give a faster algorithm based on the singular value decomposition of BB. Given the singular value decomposition of a matrix and using the matrix determinant lemma (Lemma 13), one can give a precise formula for how the characteristic polynomial changes under a rank-11 update. Using this subroutine in the volume sampling algorithm outlined in Section 4 we get an algorithm for nearly-exact volume sampling (depending on the precision of the computed SVD) in time O(kTsvd+kmn2)O(kT_{svd}+kmn^{2}), where TsvdT_{svd} is the running time of SVD on an mm-by-nn matrix.

Second subroutine for marginal probabilities.

In the real arithmetic model and given exact UU and Σ\Sigma, using the Algorithm 3 as a subroutine inside Algorithm 1 outlined for volume sampling, we get an algorithm for volume sampling. If TsvdT_{svd} is the running time for computing the singular value decomposition of mm-by-nn matrices, the algorithm runs in time O(kTsvd+kmn2)O(kT_{svd}+kmn^{2}).

Using the matrix determinant lemma (Lemma 13), the characteristic polynomial of CiCiTC_{i}C_{i}^{T} can be written as

Once we have the singular value decomposition of BB, f(x)f(x) and gj(x)g_{j}(x) can all be computed in time O(n2)O(n^{2}) using polynomial products. This is because there are at most nn non-zero σi\sigma_{i}’s. Thus, f(x)f(x) and all the gj(x)g_{j}(x) for 1jm1\leq j\leq m can be computed in time O(mn2)O(mn^{2}) and then using the above formula we get cmk+t(CiCiT)c_{m-k+t}(C_{i}C_{i}^{T}). ∎

3 Approximate volume sampling in nearly linear time

Magen and Zouzias showed that the random projection lemma of Johnson and Lindenstrauss can be generalized to preserve volumes of subsets after embedding. Here is a restatement of Theorem 1 of using O(ϵ/k)O(\epsilon/k) instead of ϵ\epsilon in their original statement.

Using random projection for dimensionality reduction, the polynomial time algorithm for volume sampling mentioned in Theorem 7 (i.e., Algorithm 1 with Algorithm 2 as its subroutine), gives (1+ϵ)(1+\epsilon)-approximate volume sampling, using

Moreover, this can be implemented using only one pass over the matrix AA with extra space mlogmk2/ϵ2m\log m\cdot k^{2}/\epsilon^{2}. ∎

Derandomized row/column-subset selection

Our derandomized row-subset selection algorithm is based on a derandomization of the volume sampling algorithm in Section 4, using the method of conditional expectations. Again, it may be easier to consider volume sampling extended to random kk-tuples (X1,,Xk)(X_{1},\dotsc,X_{k}) where

where the expectation is over (X1,,Xk)(X_{1},\dotsc,X_{k}).

Let us consider i1,,it1i_{1},\dotsc,i_{t-1} for which Pr(X1=i1,,Xt1=it1)>0{\sf Pr}\left(X_{1}=i_{1},\dotsc,X_{t-1}=i_{t-1}\right)>0. Let S={i1,,it1}S=\{i_{1},\dotsc,i_{t-1}\} and look at the conditional expectation. The following lemma shows that these conditional expectations have an easy interpretation in terms of the coefficients of certain characteristic polynomials, and hence can be computed efficiently.

Let (i1,,it1)[m]t1(i_{1},\dotsc,i_{t-1})\in[m]^{t-1} be such that Pr(X1=i1,,Xt1=it1)>0{\sf Pr}\left(X_{1}=i_{1},\dotsc,X_{t-1}=i_{t-1}\right)>0 for a random kk-tuple (X1,X2,,Xk)(X_{1},X_{2},\dotsc,X_{k}) from extended volume sampling. Let S={i1,,it1}S=\{i_{1},\dotsc,i_{t-1}\} and B=AπS(A)B=A-\pi_{S}(A). Then

Knowing the above lemma, it is easy to derandomize our algorithm outlined for volume sampling. In each step, we just compute the new conditional expectations for each additional ii, and finally pick the ii that minimizes the conditional expectation.

Output: a subset SS of kk rows of AA with the guarantee

Initialize SS\leftarrow\emptyset and BAB\leftarrow A. For t=1t=1 to kk do:

For i=1i=1 to mm do: compute cnk+t1(CiTCi)c_{n-k+t-1}(C_{i}^{T}C_{i}) and cnk+t(CiTCi)c_{n-k+t}(C_{i}^{T}C_{i}), where Ci=Bπ{i}(B)C_{i}=B-\pi_{\{i\}}(B) is the matrix obtained by projecting each row of BB orthogonal to bib_{i}.

Pick ii that minimizes cnk+t1(CiTCi)/cnk+t(CiTCi)\left|c_{n-k+t-1}(C_{i}^{T}C_{i})\right|/\left|c_{n-k+t}(C_{i}^{T}C_{i})\right|. Let SS{i}S\leftarrow S\cup\{i\} and BCiB\leftarrow C_{i}.

By applying Lemma 21 to S{i}S\cup\{i\} instead of SS, as Ci=Bπ{i}(B)=AπS{i}(A)C_{i}=B-\pi_{\{i\}}(B)=A-\pi_{S\cup\{i\}}(A), we see that the step tt of our algorithm picks ii that minimizes

The correctness of our algorithm follows immediately from observing that in each step tt,

The guarantee for spectral norm follows immediately from our guarantee in the Frobenius norm, just using properties of norms and the fact that rank(AAk)nk\operatorname{rank}(A-A_{k})\leq n-k:

Moreover, this algorithm runs in time O(kmnωlogn)O(kmn^{\omega}\log n) if we use the subroutine in Subsection 4.1 to compute the characteristic polynomial of CiCiTC_{i}C_{i}^{T} using that of CiTCiC_{i}^{T}C_{i}. ∎

Lower bound for rank-111 spectral approximation using one row

Let BB be the best rank-11 approximation to AA whose rows lie in the span of (1,ϵ,0,,0)(1,\epsilon,0,\dotsc,0) (or for that matter, any fixed row of AA). Then, we want to show that

(1,1,,1)(1,1,\dotsc,1) is an eigenvector of AATAA^{T} with eigenvalue n+ϵ2n+\epsilon^{2}. Thus, σ1(A)=n+ϵ2\sigma_{1}(A)=\sqrt{n+\epsilon^{2}}. Observe that, by symmetry, all other singular values of AA must be equal, i.e., σ2(A)=σ3(A)==σn(A)\sigma_{2}(A)=\sigma_{3}(A)=\dotsc=\sigma_{n}(A). However,

Therefore, AA12=σ2(A)=ϵ\left\|A-A_{1}\right\|_{2}=\sigma_{2}(A)=\epsilon.

Now denote the ii-th row of AA by aia_{i}. By definition, the ii-th row of BB is the projection of aia_{i} onto span(a1)\operatorname{span}\left(a_{1}\right). We are interested in the singular values of ABA-B. For i2i\geq 2:

Again, (0,1,1,,1)(0,1,1,\dotsc,1) is the top eigenvector of (AB)(AB)T(A-B)(A-B)^{T} and using this we get,

Discussion

We analyzed efficient algorithms for volume sampling that can be used for row/column subset selection. Here are some ideas for future investigation suggested by this work:

It would be interesting to explore how these algorithmic ideas are related to determinantal sampling and, in particular, the generation of random spanning trees.

Find practical counterparts of the algorithms discussed here. In particular, we do not analyze the numerical stability of our algorithms.

Is there an efficient algorithm for volume sampling based on random walks? This question is inspired by MCMC as well as random walk algorithms for the generation of random spanning trees.

References