An Improved Approximation Algorithm for the Column Subset Selection Problem
Christos Boutsidis, Michael W. Mahoney, Petros Drineas
Introduction
We consider the problem of selecting the “best” set of exactly columns from an matrix . More precisely, we consider the following Column Subset Selection Problem (CSSP):
is minimized over all possible choices for the matrix . Here, denotes the projection onto the -dimensional space spanned by the columns of and denotes the spectral norm or Frobenius norm.
That is, the goal of the CSSP is to find a subset of exactly columns of that “captures” as much of as possible, with respect to the spectral norm and/or Frobenius norm, in a projection sense. The CSSP has been studied extensively in numerical linear algebra, where it has found applications in, e.g., scientific computing . More recently, a relaxation has been studied in theoretical computer science, where it has been motivated by applications to large scientific and internet data sets .
We briefly comment on the complexity of the problem. Clearly, in time we can generate all possible matrices and thus find the optimal solution in time. However, from a practical perspective, in data analysis applications of the CSSP (see Section 1.2), is often in the order of hundreds or thousands. Thus, in practice, algorithms whose running time depends exponentially on are prohibitively slow even if is, from a theoretical perspective, a constant. Finally, the NP-hardness of the CSSP (assuming is a function of ) is an open problem. Note, though, that a similar problem, asking for the columns of the matrix that maximize the volume of the parallelepiped spanned by the columns of , is provably NP-hard .
2 The CSSP in statistical data analysis
In data applications, where the input matrix models objects represented with respect to features, the CSSP corresponds to unsupervised feature selection. Standard motivations for feature selection include facilitating data visualization, reducing training times, avoiding overfitting, and facilitating data understanding.
Consider, in particular, Principal Components Analysis (PCA), which is the predominant linear dimensionality reduction technique, and which has been widely applied on datasets in all scientific domains, from the social sciences and economics, to biology and chemistry. In words, PCA seeks to map or embed data points from a high dimensional Euclidean space to a low dimensional Euclidean space while keeping all the relevant linear structure intact. PCA is an unsupervised dimensionality reduction technique, with the sole input parameters being the coordinates of the data points and the number of dimensions that will be retained in the embedding (say ), which is typically a constant independent of and ; often it is too. Data analysts often seek a subset of actual features (that is, actual columns, as opposed to the eigenvectors or eigenfeatures returned by PCA) that can accurately reproduce the structure derived by PCA. The CSSP is the obvious optimization problem associated with such unsupervised feature selection tasks.
We should note that similar formulations appeared in . In addition, applications of such ideas include: () , where a “compact CUR matrix decomposition” was applied to static and dynamic data analysis in large sparse graphs; () , where these ideas were used for compression and classification of hyperspectral medical data and the reconstruction of missing entries from recommendation systems data in order to make high-quality recommendations; and () , where the concept of “PCA-correlated SNPs” (Single Nucleotide Polymorphisms) was introduced and applied to classify individuals from throughout the world without the need for any prior ancestry information. See for a detailed evaluation of our main algorithm as an unsupervised feature selection strategy in three application domains of modern statistical data analysis (finance, document-term data, and genetics).
3 Our main results
We present a novel two-stage algorithm for the CSSP. This algorithm is presented in detail in Section 3 as Algorithm 1. In the first stage of this algorithm (the randomized stage), we randomly select columns of , i.e., of the transpose of the matrix consisting of the top right singular vectors of , according to a judiciously-chosen probability distribution that depends on information in the top- right singular subspace of . Then, in the second stage (the deterministic stage), we apply a deterministic column-selection procedure to select exactly columns from the set of columns of selected by the first stage. The algorithm then returns the corresponding columns of . In Section 4 we prove the following theorem.
There exists an algorithm (the two-stage Algorithm 1) that approximates the solution to the CSSP. This algorithm takes as input an matrix and a positive integer ; it runs in time; and it returns as output an matrix consisting of exactly columns of such that with probability at least :
Here, denotes a projection onto the column span of the matrix , and denotes the best rank- approximation to the matrix as computed with the singular value decomposition.
Note that we can trivially boost the success probability in the above theorem to by repeating the algorithm times. Note also that the running time of our algorithm is linear in the larger of the dimensions and , quadratic in the smaller one, and independent of . Thus, it is practically useful and efficient.
To put our results into perspective, we compare them to the best existing results for the CSSP. Prior work provided bounds of the form
where is a polynomial on and . For , i.e., for the spectral norm, the best previously-known bound for approximating the CSSP is , while for , i.e., for the Frobenius norm, the best bound is . Both results are algorithmically efficient, running in time polynomial in all three parameters , , and . The former runs in time and the latter runs in time. Our approach provides an algorithmic bound for the Frobenius norm version of the CSSP that is roughly better than the best previously-known algorithmic result. It should be noted that also proves that by exhaustively testing all possibilities for the matrix , the best one will satisfy eqn. (1) with . Our algorithmic result is only worse than this existential result. A similar existential result for the spectral norm version of the CSSP is proved in with . Our spectral norm bound depends on . In a worst case setting (e.g., when all the bottom singular values of are equal) this quantity is upper bounded by . This is worse than the best results for the spectral norm version of the CSSP by a factor of .
Finally, we should emphasize that a novel feature of the algorithm that we present in this paper is that it combines in a nontrivial manner recent algorithmic developments in the theoretical computer science community with more traditional techniques from the numerical linear algebra community in order to obtain improved bounds for the CSSP.
Background and prior work
First, recall that the -notation can be used to denote an asymptotically tight bound: or if there exist positive constants , , and such that for all . This is similar to the way in which the big--notation can be used to denote an asymptotic upper bound: if there exist positive constants and such that for all .
Finally, we will make frequent use of the following fundamental result from probability theory, known as Markov’s inequality . Let be a random variable assuming non-negative values with expectation . Then, for all , with probability at least . We will also need the so-called union bound. Given a set of probabilistic events holding with respective probabilities , the probability that all events hold (a.k.a., the probability of the union of those events) is upper bounded by .
2 Related prior work
Since solving the CSSP exactly is a hard combinatorial optimization problem, research has historically focused on computing approximate solutions to it. Since provides an immediate lower bound for , for and for any choice of , a large number of approximation algorithms have been proposed to select a subset of columns of such that the resulting matrix satisfies
for some function . Within the numerical linear algebra community, most of the work on the CSSP has focused on spectral norm bounds and is related to the so-called Rank Revealing QR (RRQR) factorization:
(The RRQR factorization) Given a matrix () and an integer (), assume partial factorizations of the form:
where is an orthonormal matrix, is upper triangular, , , , and is a permutation matrix. The above factorization is called a RRQR factorization if it satisfies
where and are functions bounded by low degree polynomials in and .
The work of Golub on pivoted factorizations was followed by much research addressing the problem of constructing an efficient RRQR factorization. Most researchers improved RRQR factorizations by focusing on improving the functions and in Definition 2. Let denote the first columns of a permutation matrix . Then, if is an matrix consisting of columns of , it is straightforward to prove that
for both . Thus, in particular, when applied to the spectral norm, it follows that
i.e., any algorithm that constructs an RRQR factorization of the matrix with provable guarantees also provides provable guarantees for the CSSP. See Table 1 for a summary of existing results, and see for a survey and an empirical evaluation of some of these algorithms. More recently, proposed random-projection type algorithms that achieve the same spectral norm bounds as prior work while improving the running time.
Within the theoretical computer science community, much work has followed that of Frieze, Kannan, and Vempala on selecting a small subset of representative columns of , forming a matrix , such that the projection of on the subspace spanned by the columns of is as close to as possible. The algorithms from this community are randomized, which means that they come with a failure probability, and focus mainly on the Frobenius norm. It is worth noting that they provide a strong tradeoff between the number of selected columns and the desired approximation accuracy. A typical scenario for these algorithms is that the desired approximation error (see below) is given as input, and then the algorithm selects the minimum number of appropriate columns in order to achieve this error. One of the most relevant results for this paper is a bound of , which states that there exist exactly columns in any matrix such that
Here, contains exactly columns of . The only known algorithm to find these columns is to try all choices and keep the best. This existential result relies on the so-called volume sampling method . In , an adaptive sampling method is used to approximate the volume sampling method and leads to an algorithm which finds columns of such that
As mentioned above, much work has also considered algorithms choosing slightly more than columns. This relaxation provides significant flexibility and improved error bounds. For example, in , an adaptive sampling method leads to an algorithm, such that
holds with high probability for some matrix consisting of columns of . Similarly, in , Drineas, Mahoney, and Muthukrishnan leverage the subspace sampling method to give an algorithm such that
holds with high probability if contains columns of .
A two-stage algorithm for the CSSP
In this section, we present and describe Algorithm 1, our main algorithm for approximating the solution to the CSSP. This algorithm takes as input an matrix and a rank parameter . After an initial setup, the algorithm has two stages: a randomized stage and a deterministic stage. In the randomized stage, a randomized procedure is run to select columns from the matrix , i.e., the transpose of the matrix containing the top- right singular vectors of . The columns are chosen by randomly sampling according to a judiciously-chosen nonuniform probability distribution that depends on information in the top- right singular subspace of . Then, in the deterministic stage, a deterministic procedure is employed to select exactly columns from the columns chosen in the randomized stage. The algorithm then outputs exactly columns of that correspond to those columns chosen from . Theorem 1 states that the projection of on the subspace spanned by these columns of is (up to bounded error) close to the best rank approximation to .
In more detail, Algorithm 1 first computes a probability distribution over the set , i.e., over the columns of , or equivalently over the columns of . The probability distribution depends on information in the top- right singular subspace of . In particular, for all , define
and note that , for all , and that . We will describe the computation of probabilities of this form below.
In the randomized stage, Algorithm 1 employs the following randomized column selection algorithm to choose columns from to pass to the second stage. Let assume the value of eqn. (11). In independent identically distributed (i.i.d.) trials, the algorithm chooses a column of where in each trial the -th column of is kept with probability . Additionally, if the -th column is kept, then a scaling factor equal to is kept as well. Thus, at the end of this process, we will be left with columns of and their corresponding scaling factors. Notice that due to random sampling in i.i.d. trials with replacement we might keep a particular column more than once.
In order to conveniently represent the selected columns and the associated scaling factors, we will use the following sampling matrix formalism. First, define an sampling matrix as follows: is initially empty; at each of the i.i.d. trials, if the -th column of is selected by the random sampling process, then (an -vector of all-zeros, except for its -th entry which is set to one) is appended to . Next, define the diagonal rescaling matrix as follows: if the -th column of is selected, then a diagonal entry of is set to . Thus, we may view the randomized stage as outputting the matrix consisting of a small number of rescaled columns of , or simply as outputting and .
In the deterministic stage, Algorithm 1 applies a deterministic column selection algorithm to the output of the first stage in order to choose exactly columns from the input matrix . To do so, we run the Algorithm 4 of (with the parameter set to ) on the matrix , i.e., the column-scaled version of the columns of chosen in the first stage. Thus, a matrix is formed, or equivalently, in the sampling matrix formalism described previously, a new matrix is constructed. Its dimensions are , since it selects exactly columns out of the columns returned after the end of the randomized stage. The algorithm then returns the corresponding columns of the original matrix , i.e., after the second stage of the algorithm is complete, the matrix is returned as the final output.
2 Running time analysis
We now discuss the running time of our algorithm. Note that manipulating the probability distribution of eqn. (10) yields:
Thus, knowledge of , i.e., the matrix consisting of the top- right singular vectors of , suffices to compute the ’s. By eqn. (12), time suffices for our theoretical analysis. In practice iterative algorithms could be used to speed up the algorithm. Note also that in order to obtain the Frobenius norm bound of Theorem 1, our theoretical analysis holds if the sampling probabilities are of the form:
That is, the Frobenius norm bound of Theorem 1 holds even if the second term in the sampling probabilities of eqns. (10) and (12) is omitted.
which is the only guarantee that we need in the deterministic step (see Lemma 3). The running time of the deterministic stage of Algorithm 1 is time, since the (padded) matrix has columns and rows.
An important open problem would be to identify other suitable importance sampling probability distributions that avoid the computation of a basis for the top- right singular subspace.
3 Intuition underlying our main algorithm
Intuitively, we achieve improved bounds for the CSSP because we apply the deterministic algorithm to a lower dimensional matrix (the matrix with columns, as opposed to the matrix with columns) in which the columns are “spread out” in a “nice” manner. To see this, note that the probability distribution of eqn. (13), and thus one of the two terms in the probability distribution of eqns. (10) or (12), equals (up to scaling) the diagonal elements of the projection matrix onto the span of the top- right singular subspace. In diagnostic regression analysis, these quantities have a natural interpretation in terms of statistical leverage, and thus they have been used extensively to identify “outlying” data points . Thus, the importance sampling probabilities that we employ in the randomized stage of our main algorithm provide a bias toward more “outlying” columns, which then provide a “nice” starting point for the deterministic stage of our main algorithm. This also provides intuition as to why using importance sampling probabilities of the form of eqn. (13) leads to relative-error low-rank matrix approximations .
Proof of Theorem 1
In this section, we provide a proof of Theorem 1. We start with an outline of our proof, pointing out conceptual improvements that were necessary in order to obtain improved bounds. An important condition in the first phase of the algorithm is that when we sample columns from the matrix , we obtain a matrix that does not lose any rank. To do so, we will apply a result from matrix perturbation theory to prove that if (see eqn. (11)) then . (See Lemma 1 below.) Then, under the assumption that has full rank, we will prove that the matrix returned by the algorithm will satisfy:
for both . (See Lemma 2 below.) Next, we will provide a bound on . In order to get a strong accuracy guarantee for the overall algorithm, the deterministic column selection algorithm must satisfy
where is a polynomial in both and . Thus, for our main theorem, we will employ Algorithm 4 with , which guarantees the above bound with . (See Lemma 3 below.) Finally, we will show, using relatively straightforward matrix perturbation techniques, that is not too much more, in a multiplicative sense, than , where we note that the factors differ for . (See Lemmas 4 and 5 below.) By combining these results, the main theorem will follow.
The following lemma provides a bound on the singular values of the matrix computed by the randomized phase of Algorithm 1, from which it will follow that the matrix is full rank. To prove the lemma, we employ Theorem 2 of the Appendix (this theorem is a variant of a result of Rudelson and Vershynin in ). Note that probabilities of the form of eqn. (13) actually suffice to establish Lemma 1.
Let and be constructed using Algorithm 1. Then, with probability at least ,
In particular, has full rank.
Proof: In order to bound , we will bound . Towards that end, we will use Theorem 2 with and , which results in the value for in eqn. (11). Note that the sampling probabilities in eqn. (12) satisfy
Now Theorem 2 and our construction of and guarantee that for as in eqn. (11)
We note here that the condition in Theorem 2 is trivially satisfied assuming that is at least one (given our choices for , , and ). Using and Markov’s inequality we get that with probability at least ,
Standard matrix perturbation theory results now imply that for all ,
Let , , and be constructed as described in Algorithm 1 and recall that . If has full rank, then for ,
Proof: We seek to bound the spectral and Frobenius norms of , where is constructed by Algorithm 1. To do so, first notice that scaling the columns of a matrix (equivalently, post-multiplying the matrix by a diagonal matrix) by any non-zero scale factors does not change the subspace spanned by the columns of the matrix. Thus,
where, in the last line, we have introduced the convenient notation that we will use throughout the remainder of this proof. In the sequel we seek to bound the residual
This implies that in eqn. (15) we can replace with any other matrix and the equality with an inequality. In particular we replace with , where is the best rank- approximation to :
Let . Then, and, using the triangle inequality,
We now bound , , and . First, for , note that:
In eqn. (17), we replaced by . This follows since the statement of our lemma assumes that the matrix has full rank. Also, the construction of guarantees that the columns of that are selected in the second stage of Algorithm 1 are linearly independent, and thus the matrix has full rank and is invertible. In eqn. (18), and can be dropped without increasing a unitarily invariant norm, while eqn. (19) follows since is a full-rank matrix. Next, note that . Finally, to conclude the proof, we bound as follows:
Eqn. (20) follows by the orthogonality of and and the fact that is a invertible matrix (see above). Eqn. (21) follows from the fact that for any two matrices and and , . Finally, eqn. (22) follows since and is an orthogonal matrix.
Let , , and be constructed using Algorithm 1. Then, with probability at least ,
Proof: From Lemma 1 we know that holds for all with probability at least . The deterministic construction of (see Algorithm 4 of with the parameter set to ) guarantees that
() If and are constructed as described in Algorithm 1, then, with probability at least ,
Proof: Let . We manipulate as follows:
Given our construction of and and applying eqn. (9) of Theorem 1 of with and , we get that with probability at least ,
Thus, by combining the above results and using and we get
To conclude the proof of the lemma we take the square roots of both sides of the above inequality.
() If and are constructed as described in Algorithm 1, then, with probability at least ,
Proof: It is straightforward to prove that with our construction of and , the expectation of is equal to . In addition, note that the latter quantity is exactly equal to . Applying Markov’s inequality, we get that, with probability at least ,
Taking square roots of both sides of the above inequality concludes the proof of the lemma.
4 Completing the proof of Theorem 1
To prove the Frobenius norm bound of Theorem 1 we combine Lemma 2 (with ) with Lemmas 3 and 5. Thus, we get
Using immediately derives the Frobenius norm bound of Theorem 1. Notice that Lemma 3 fails with probability at most and that Lemma 5 fails with probability at most ; thus, applying the standard union bound, it follows that the Frobenius norm bound of Theorem 1 holds with probability at least . To prove the spectral norm bound of Theorem 1 we combine Lemma 2 (with ) with Lemmas 3 and 4. Thus, we get
Using immediately derives the spectral norm bound of Theorem 1. Notice that Lemma 3 fails with probability at most and that Lemma 4 fails with probability at most ; thus, applying the standard union bound, it follows that the spectral norm bound of Theorem 1 holds with probability at least .
Acknowledgements
We are grateful to Daniel Spielman and Ilse Ipsen for numerous useful discussions on the results of this paper. We would also like to thank an anonymous reviewer of an earlier version of this manuscript who provided a counterexample to Lemma 4.4 of , and thus helped us identify the error in the proof of that lemma.
References
Appendix
for all for some constant . Let be an accuracy parameter, assume , and let
(Here is the unknown constant of Theorem 3.1, p. 8 of .) Then,