Fast approximation of matrix coherence and statistical leverage
Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, David P. Woodruff
Introduction
The concept of statistical leverage measures the extent to which the singular vectors of a matrix are correlated with the standard basis and as such it has found usefulness recently in large-scale data analysis and in the analysis of randomized matrix algorithms . A related notion is that of matrix coherence, which has been of interest in recently popular problems such as matrix completion and Nyström-based low-rank matrix approximation . Defined more precisely below, the statistical leverage scores may be computed as the squared Euclidean norms of the rows of the matrix containing the top left singular vectors and the coherence of the matrix is the largest statistical leverage score. Statistical leverage scores have a long history in statistical data analysis, where they have been used for outlier detection in regression diagnostics . Statistical leverage scores have also proved crucial recently in the development of improved worst-case randomized matrix algorithms that are also amenable to high-quality numerical implementation and that are useful to domain scientists ; see for a detailed discussion. The naïve and best previously existing algorithm to compute these scores would compute an orthogonal basis for the dominant part of the spectrum of , e.g., the basis provided by the Singular Value Decomposition (SVD) or a basis provided by a QR decomposition , and then use that basis to compute diagonal elements of the projection matrix onto the span of that basis.
We present a randomized algorithm to compute relative-error approximations to every statistical leverage score in time qualitatively faster than the time required to compute an orthogonal basis. For the case of an arbitrary matrix , with , our main algorithm runs (under assumptions on the precise values of and , see Theorem 1 for an exact statement) in time, as opposed to the time required by the naïve algorithm. As a corollary, our algorithm provides a relative-error approximation to the coherence of an arbitrary matrix in the same time. In addition, several practically-important extensions of the basic idea underlying our main algorithm are also described in this paper.
We start with the following definition of the statistical leverage scores of a matrix.
Given an arbitrary matrix , with , let denote the matrix consisting of the left singular vectors of , and let denote the -th row of the matrix as a row vector. Then, the statistical leverage scores of the rows of are given by
for ; the coherence of the rows of is
i.e., it is the largest statistical leverage score of ; and the -cross-leverage scores are
i.e., they are the dot products between the row and the row of .
Although we have defined these quantities in terms of a particular basis, they clearly do not depend on that particular basis, but only on the space spanned by that basis. To see this, let denote the projection matrix onto the span of the columns of . Then,
That is, the statistical leverage scores of a matrix are equal to the diagonal elements of the projection matrix onto the span of its columns.In this paper, for simplicity of exposition, we consider the case that the matrix has rank equal to , i.e., has full column rank. Theoretically, the extension to rank-deficient matrices is straightforward—simply modify Definition 1 and thus Eqns. (4) and (5) to let be any orthogonal matrix (clearly, with fewer than columns) spanning the column space of . From a numerical perspective, things are substantially more subtle, and we leave this for future work that considers numerical implementations of our algorithms. Similarly, the -cross-leverage scores are equal to the off-diagonal elements of this projection matrix, i.e.,
Clearly, time suffices to compute all the statistical leverage scores exactly: simply perform the SVD or compute a QR decomposition of in order to obtain any orthogonal basis for the range of and then compute the Euclidean norm of the rows of the resulting matrix. Thus, in this paper, we are interested in algorithms that run in time.
2 Our main result
Our main result is a randomized algorithm for computing relative-error approximations to every statistical leverage score, as well as an additive-error approximation to all of the large cross-leverage scores, of an arbitrary matrix, with , in time qualitatively faster than the time required to compute an orthogonal basis for the range of that matrix. Our main algorithm for computing approximations to the statistical leverage scores (see Algorithm 1 in Section 3) will amount to constructing a “randomized sketch” of the input matrix and then computing the Euclidean norms of the rows of that sketch. This sketch can also be used to compute approximations to the large cross-leverage scores (see Algorithm 2 of Section 3).
The following theorem provides our main quality-of-approximation and running time result for Algorithm 1.
holds for all . Assuming , the running time of the algorithm is
The following theorem provides our main quality-of-approximation and running time result for Algorithm 2.
This algorithm runs in time.
Note that by setting , we can compute all the “large” cross-leverage scores, i.e., those satisfying , to within additive-error in time (treating as a constant). If the overall running time is , as desired.
3 Significance and related work
Our results are important for their applications to fast randomized matrix algorithms, as well as their applications in numerical linear algebra and large-scale data analysis more generally.
Significance in theoretical computer science. The statistical leverage scores define the key structural nonuniformity that must be dealt with (i.e., either rapidly approximated or rapidly uniformized at the preprocessing step) in developing fast randomized algorithms for matrix problems such as least-squares regression and low-rank matrix approximation . Roughly, the best random sampling algorithms use these scores (or the generalized leverage scores relative to the best rank- approximation to ) as an importance sampling distribution to sample with respect to. On the other hand, the best random projection algorithms rotate to a basis where these scores are approximately uniform and thus in which uniform sampling is appropriate. See for a detailed discussion.
Applications to numerical linear algebra. Recently, high-quality numerical implementations of variants of the basic randomized matrix algorithms have proven superior to traditional deterministic algorithms . An important question raised by our main results is how these will compare with an implementation of our main algorithm. More generally, density functional theory and uncertainty quantification are two scientific computing areas where computing the diagonal elements of functions (such as a projection or inverse) of very large input matrices is common. For example, in the former case, “heuristic” methods based on using Chebychev polynomials have been used in numerical linear algebra to compute the diagonal elements of the projector . Our main algorithm should have implications in both of these areas.
More recently, these scores (usually the largest scores) often have an interpretation in terms of the data and processes generating the data that can be exploited. For example, depending on the setting, they can have an interpretation in terms of high-degree nodes in data graphs, very small clusters in noisy data, coherence of information, articulation points between clusters, the value of a customer in a network, space localization in sensor networks, etc. . In genetics, dense matrices of size thousands by hundreds of thousands (a size scale at which even traditional deterministic QR algorithms fail to run) constructed from DNA Single Nucleotide Polymorphisms (SNP) data are increasingly common, and the statistical leverage scores can correlate strongly with other metrics of genetic interest . Our main result will permit the computation of these scores and related quantities for significantly larger SNP data sets than has been possible previously .
Remark. Lest there be any confusion, we should emphasize our main contributions. First, note that statistical leverage and matrix coherence are important concepts in statistics and machine learning. Second, recall that several random sampling algorithms for ubiquitous matrix problems such as least-squares approximation and low-rank matrix approximation use leverage scores in a crucial manner; but until now these algorithms were , where is the time required to compute a QR decomposition or a partial SVD of the input matrix. Third, note that, in some cases, algorithms exist for these problems based on fast random projections. But recall that the existence of those projection algorithms in no way implies that it is easy or obvious how to compute the statistical leverage scores efficiently. Fourth, one implication of our main result is that those random sampling algorithms can now be performed just as efficiently as those random projection algorithms; thus, the solution for those matrix problems can now be obtained while preserving the identity of the rows. That is, these problems can now be solved just as efficiently by using actual rows, rather than the arbitrary linear combinations of rows that are returned by random projections. Fifth, we provide a generalization to “fat” matrices and to obtaining the cross-leverage scores. Sixth, we develop algorithms that can compute leverage scores and related statistics even in streaming environments.
4 Empirical discussion of our algorithms
Although the main contribution of our paper is to provide a rigorous theoretical understanding of fast leverage score approximation, our paper does analyze the theoretical performance of what is meant to be a practical algorithm. Thus, one might wonder about the empirical performance of our algorithms—for example, whether hidden constants render the algorithms useless for data of realistic size. Not surprisingly, this depends heavily on the quality of the numerical implementation, whether one is interested in “tall” or more general matrices, etc. We will consider empirical and numerical aspects of these algorithms in forthcoming papers, e.g., . We will, however, provide here a brief summary of several numerical issues for the reader interested in these issues.
Empirically, the running time bottleneck for our main algorithm (Algorithm 1 of Section 3) applied to “tall” matrices is the application of the random projection . Thus, empirically the running time is similar to the running time of random projection based methods for computing approximations to the least-squares problem, which is also dominated by the application of the random projection. The state of the art here is the Blendenpik algorithm of and the LSRN algorithm of . In their Blendenpik paper, Avron, Maymounkov, and Toledo showed that their high-quality numerical implementation of a Hadamard-based random projection (and associated least-squares computation) “beats Lapack’sLapack (short for Linear Algebra PACKage) is a high-quality and widely-used software library of numerical routines for solving a wide range of numerical linear algebra problems. direct dense least-squares solver by a large margin on essentially any dense tall matrix,” and they concluded that their empirical results “suggest that random projection algorithms should be incorporated into future versions of Lapack” . The LSRN algorithm of Meng, Saunders, and Mahoney improves Blendenpik in several respects, e.g., providing better handling of sparsity and rank deficiency, but most notably the random projection underlying LSRN is particularly appropriate for solving large problems on clusters with high communication cost, e.g., it has been shown to scale well on Amazon Elastic Cloud Compute clusters. Thus, our main algorithm should extend easily to these environments with the use of the random projection underlying LSRN. Moreover, for both Blendenpik and LSRN (when implemented with a Hadamard-based random projection), the hidden constants in the Hadamard-based random projection are so small that the random projection algorithm (and thus the empirical running time of our main algorithm for approximating leverage scores) beats the traditional time algorithm for dense matrices as small as thousands of rows by hundreds of columns.
5 Outline
In Section 2, we will provide a brief review of relevant notation and concepts from linear algebra. Then, in Sections 3 and 4, we will present our main results: Section 3 will contain our main algorithm and Section 4 will contain the proof of our main theorem. Section 5 will then describe extensions of our main result to general “fat” matrices, i.e., those with . Section 6 will conclude by describing the relationship of our main result with another related estimator for the statistical leverage scores, an application of our main algorithm to the under-constrained least-squares approximation problem, and extensions of our main algorithm to streaming environments.
Preliminaries on linear algebra and fast random projections
2 The Fast Johnson-Lindenstrauss Transform (FJLT)
Given and a set of points with , a -Johnson-Lindenstrauss Transform (-JLT), denoted , is a projection of the points into such that
To construct an -JLT with high probability, simply choose every entry of independently, equal to with probability each and zero otherwise (with probability ) . Let be a matrix drawn from such a distribution over matrices.When no confusion can arise, we will use to refer to this distribution over matrices as well as to a specific matrix drawn from this distribution. Then, the following lemma holds.
Let be an arbitrary (but fixed) set of points, where and let be an accuracy parameter. If
then, with probability at least , is an -JLT .
For our main results, we will also need a stronger requirement than the simple -JLT and so we will use a version of the Fast Johnson-Lindenstrauss Transform (FJLT), which was originally introduced in . Consider an orthogonal matrix , viewed as vectors in . A FJLT projects the vectors from to , while preserving the orthogonality of ; moreover, it does so very quickly. Specifically, given , is an -FJLT for if
, and
given any , the matrix product can be computed in time.
The next lemma follows from the definition of an -FJLT, and its proof can be found in .
3 The Subsampled Randomized Hadamard Transform (SRHT)
One can use a Randomized Hadamard Transform (RHT) to construct, with high probability, an -FJLT. Our main algorithm will use this efficient construction in a crucial way.Note that the RHT has also been crucial in the development of randomized algorithms for the general overconstrained LS problem and its variants have been used to provide high-quality numerical implementations of such randomized algorithms . Recall that the (unnormalized) matrix of the Hadamard transform is defined recursively by
with . The normalized matrix of the Hadamard transform is equal to
Using the sampling matrix formalism described previously , we will represent the operation of randomly sampling rows of an matrix using an linear sampling operator . Let the matrix be generated using the SRHT.Again, when no confusion can arise, we will use to denote a specific SRHT or the distribution on matrices implied by the randomized process for constructing an SRHT. The most important property about the distribution is that if is large enough, then, with high probability, generates an -FJLT. We summarize this discussion in the following lemma (which is essentially a combination of Lemmas 3 and 4 from , restated to fit our notation).
then, with probability at least , is an -FJLT for .
Our main algorithmic results
Recall that our first goal is to approximate, for all , the quantities
where we approximated the matrix by the matrix . Computing in this way takes time, which is not efficient because (from Lemma 3).
for each , which is essentially what Algorithm 1 does. Not surprisingly, the sketch can be used in other ways: for example, by considering the dot product between two different rows of this randomized sketching matrix (and some additional manipulations) Algorithm 2 approximates the large cross-leverage scores of .
2 Approximating all the statistical leverage scores
Since has rank (by Lemma 2) and preserves this rank, is a invertible matrix. Using and properties of the pseudoinverse, we get . Thus,
3 Approximating the large cross-leverage scores
By combining Lemmas 6 and 7 (in Section 4.2 below) with the triangle inequality, one immediately obtains the following lemma.
Let be either the sketching matrix constructed by Algorithm 1, i.e., , or as described in Section 3.1. Then, the pairwise dot-products of the rows of are additive-error approximations to the leverage scores and cross-leverage scores:
That is, if one were interested in obtaining an approximation to all the cross-leverage scores to within additive error (and thus the diagonal statistical leverage scores to relative-error), then the algorithm which first computes followed by all the pairwise inner products achieves this in time , where is the time to compute from Section 3.2 and .The exact algorithm which computes a basis first and then the pairwise inner products requires time. Thus, by using the sketch, we can already improve on this running time by a factor of . The challenge is to avoid the computational complexity and this can be done if one is interested only in the large cross-leverage scores.
Proofs of our main theorems
and can be computed efficiently. Since is an -FJLT for , where , can be computed in time. By Lemma 2, , and so
Since is an -FJLT for , it follows that , i.e., that is approximately orthogonal. Theorem 1 follows from this basic idea. However, in order to prove Theorem 2, having a sketch which preserves inner products alone is not sufficient. We also need a fast algorithm to identify the large inner products and to relate these to the actual cross-leverage scores. Indeed, it is possible to efficiently find pairs of rows in a general matrix with large inner products. Combining this with the fact that the inner products are preserved, we obtain Theorem 2.
2 Proof of Theorem 1
We condition all our analysis on the events that is an -FJLT for and is an -JLT for points in . Define
The theorem follows after rescaling .
Let . Using this SVD of and Eqn. (10) in Lemma 2,
By performing standard manipulations, we can now bound :
Let the SVD of be , where is a full rotation in dimensions (because ). Then, . Thus,
where we used the fact that and the unitary invariance of the spectral norm. Finally, using Eqn. (9) of Lemma 2 the result follows.
Proof of Lemma 7.
Since is an -JLT for vectors, it preserves the norms of an arbitrary (but fixed) collection of vectors. Let . Consider the following vectors:
By the -JLT property of and the fact that ,
Combining Eqns. (17) and (18) after expanding the squares using the identity , substituting , and after some algebra, we obtain
To conclude the proof, multiply throughout by and use the homogeneity of the inner product, together with the linearity of , to obtain:
Running Times.
By Lemma 4, we can use instead of and obtain the same estimates. Since is an -FJLT, the product can be computed in while its SVD takes an additional time to return . Since , we obtain in an additional time. Finally, premultiplying by takes time, and computing and returning the squared row-norms of takes time. So, the total running time is the sum of all these operations, which is
For our implementations of the -JLTs and -FJLTs (), and . It follows that the asymptotic running time is
To simplify, suppose that and treat as a constant. Then, the asymptotic running time is
3 Proof of Theorem 2
We first construct an algorithm to estimate the large inner products among the rows of an arbitrary matrix with . This general algorithm will be applied to the matrix . Let denote the rows of ; for a given , the pair is heavy if
By the Cauchy-Schwarz inequality, this implies that
so it suffices to find all the pairs for which Eqn. (19) holds. We will call such pairs norm-heavy. Let be the number of norm-heavy pairs satisfying Eqn. (19). We first bound the number of such pairs.
Using the above notation, .
where are the singular values of . To conclude, by the definition of a heavy pair,
where the last inequality follows by Cauchy-Schwarz. ∎
Algorithm 3 returns including all the heavy pairs of in time.
Given , assume that for the pair of vectors and
where the last equality follows from . By Eqn. (20), after squaring and using ,
Using Lemma 9, the running time of our approach is . Since , and, by Eqn. (22), , the overall running time is .
Extending our algorithm to general matrices
Unfortunately, as stated, this is an ill-posed problem. Indeed, consider the degenerate case when (i.e., the identity matrix). In this case, is not unique and the leverage scores are not well-defined. Moreover, for the obvious equivalent choices for , the leverage scores defined according to any one of these choices do not provide a relative error approximation to the leverage scores defined according to any other choices. More generally, removing this trivial degeneracy does not help. Consider the matrix
In this example, the leverage scores for are well defined. However, as , it is not possible to distinguish between the top- singular space and its complement. This example suggests that it should be possible to obtain some result conditioning on the spectral gap at the singular value. For example, one might assume that , in which case the parameter would play an important role in the ability to solve this problem. Any algorithm which cannot distinguish the singular values with an error less than will confuse the -th and -th singular vectors and consequently will fail to get an accurate approximation to the leverage scores for .
In the following, we take a more natural approach which leads to a clean problem formulation. To do so, recall that the leverage scores and the related normalized leverage scores of Eqn. (23) are used to approximate the matrix in some way, e.g., we might be seeking a low-rank approximation to the matrix with respect to the spectral or the Frobenius norm, or we might be seeking useful features or data points in downstream data analysis applications , or we might be seeking to develop high-quality numerical implementations of low-rank matrix approximation algorithms , etc. In all these cases, we only care that the estimated leverage scores are a good approximation to the leverage scores of some “good” low-rank approximation to . The following definition captures the notion of a set of rank- matrices that are good approximations to .
Thus, we will seek algorithms whose output is a set of numbers, with the requirement that those numbers are good approximations to the normalized leverage scores of some matrix (instead of ). This removes the ill-posedness of the original problem. Next, we will give two examples of algorithms that compute such -approximations to the normalized leverage scores of a general matrix with a rank parameter for two popular norms, the spectral norm and the Frobenius norm.Note that we will not compute , but our algorithms will compute a matrix in that set. Moreover, that matrix can be used for high-quality low-rank matrix approximation. See the comments in Section 1.4 for more details.
Algorithm 4 approximates the statistical leverage scores of a general matrix with rank parameter in the spectral norm case. It takes as inputs a matrix Rn×d with and a rank parameter , and outputs a set of numbers for all , namely our approximations to the normalized leverage scores of with rank parameter
The matrix can be computed in time.
This version of the above lemma is proven in .More specifically, the proof may be found in Lemma 32 and in particular in Eqn. (14) in Section A.2; note that for our purposes here we replaced by after adjusting accordingly. Now, since has rank , it follows that and thus we can consider the non-negative random variable and apply Markov’s inequality to get that
holds with probability at least . Thus, with probability at least .
holds with probability at least for all . It follows that
Clearly, are the normalized leverage scores of the matrix . Recall that with probability at least and use Definition 3 to conclude that the scores of Eqn. (25) are -approximations to the normalized leverage scores of with rank parameter . The following Theorem summarizes the above discussion:
2 Leverage Scores for Frobenius Norm Approximators.
Algorithm 5 approximates the statistical leverage scores of a general matrix with rank parameter in the Frobenius norm case. It takes as inputs a matrix Rn×d with and a rank parameter , and outputs a set of numbers for all , namely our approximations to the normalized leverage scores of with rank parameter
Let and let be as in Eqn. (29). Then,
The matrix can be computed in time.
where is the best rank- approximation to the matrix ; from standard linear algebra, . Then, the above lemma is proven in .More specifically, the proof may be found in Lemma 33 in Section A.3; note that for our purposes here we set . Now, since has rank , it follows that and thus we can consider the non-negative random variable and apply Markov’s inequality to get that
holds with probability at least . Rearranging terms and taking square roots of both sides implies that
Thus, with probability at least . To conclude our proof, recall that is an orthonormal basis for the columns of . From Eqn. (29),
We briefly discuss the running time of the proposed algorithm. First, we can compute in time. Then, the computation of takes time. The computation of takes time and the computation of takes time. Thus, the total time is equal to . The following Theorem summarizes the above discussion.
Discussion
We will conclude with a discussion of our main results in a broader context: understanding the relationship between our main algorithm and a related estimator for the statistical leverage scores; applying our main algorithm to solve under-constrained least squares problems; and implementing variants of the basic algorithm in streaming environments.
Magdon-Ismail in presented the following algorithm to estimate the statistical leverage scores: given as input an matrix , with , the algorithm proceeds as follows.
Compute , where the matrix is a SRHT or another FJLT.
Although both Algorithm 1 and the algorithm of this subsection estimate by a matrix of the form , there are notable differences. The algorithm of this subsection does not actually compute or approximate directly; instead, it separates the matrix into two parts and computes the dot product between and . Positivity is sacrificed and this leads to some complications in the estimator; however, the truncation step is interesting, since, despite the fact that the estimates are “biased” (in a manner somewhat akin to what is obtained with “thresholding” or “regularization” procedures), we still obtain provable approximation guarantees. The algorithm of this subsection is simpler (since it uses an application of only one random projection), albeit at the cost of weaker theoretical guarantees and a worse running time than our main algorithm. A direction of considerable practical interest is to evaluate empirically the performance of these two estimators, either for estimating all the leverage scores or (more interestingly) for estimating the largest leverage scores for data matrices for which the leverage scores are quite nonuniform.
2 An application to under-constrained least-squares problems
Consider the following under-constrained least-squares problem:
Algorithm 6 runs in time.
Thus, all the singular values of are strictly positive and hence has full rank equal to . Also, using ,
In the above derivations we substituted the SVD of , dropped terms that do not change unitarily invariant norms, and used the fact that and have full rank in order to simplify the pseudoinverse. Now let and note that Eqn. (33) and the fact that has full rank imply
Thus, we conclude our proof by observing that
In the above we used the fact that . The running time of the algorithm follows by observing that is an matrix and thus computing its pseudoinverse takes time; computing takes an additional time.
We conclude the section with a few remarks. First, assuming that , , and are constants and , it immediately follows that Algorithm 6 runs in time. It should be clear that we can use Theorem 1 and the related Algorithm 1 to approximate the statistical leverage scores, thus bypassing the need to exactly compute them. Second, instead of approximating the statistical leverage scores needed in Algorithm 6, we could use the randomized Hadamard transform (essentially post-multiply by a randomized Hadamard transform to make all statistical leverage scores uniform). The resulting algorithm could be theoretically analyzed following the lines of . It would be interesting to evaluate experimentally the performance of the two approaches in real data.
3 Extension to streaming environments
In this section, we consider the estimation of the leverage scores and of related statistics when the input data set is so large that an appropriate way to view the data is as a data stream . In this context, one is interested in computing statistics of the data stream while making one pass (or occasionally a few additional passes) over the data from external storage and using only a small amount of additional space. For an matrix , with , small additional space means that the space complexity only depends logarithmically on the high dimension and polynomially on the low dimension . When we discuss bits of space, we assume that the entries of can be discretized to bit integers, though all of our results can be generalized to arbitrary word sizes. The general strategy behind our algorithms is as follows.
As the data streams by, compute , for an appropriate problem-dependent linear sketching matrix , and also compute , for a random projection matrix .In the offline setting, one would use an SRHT or another FJLT, while in the streaming setting one could use either of the following. If the stream is such that one sees each entire column of at once, then one could do an FJLT on the column. Alternatively, if one see updates to the individual entries of in an arbitrary order, then one could apply any sketching matrix, such as those of or of .
After the first pass over the data, compute the matrix , as described in Algorithm 1, corresponding to (or compute the pseudoinverse of or the matrix from any other QR decomposition of ).
Compute , for a random projection matrix , such as the one used by Algorithm 1.
With the procedure outlined above, the matrix is effectively applied to the rows of , i.e., to the sketch of that has rows with Euclidean norms approximately equal to the row norms of , and pairwise inner products approximately equal to those in . Thus statistics related to can be extracted.
By the coupon collector problem, running independent copies is enough to find a set containing all rows for which , and no rows for which with probability at least .